CN114386569A - Novel image description generation algorithm using capsule network - Google Patents

Novel image description generation algorithm using capsule network Download PDF

Info

Publication number
CN114386569A
CN114386569A CN202111572920.7A CN202111572920A CN114386569A CN 114386569 A CN114386569 A CN 114386569A CN 202111572920 A CN202111572920 A CN 202111572920A CN 114386569 A CN114386569 A CN 114386569A
Authority
CN
China
Prior art keywords
image
capsule network
level
vector
lstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111572920.7A
Other languages
Chinese (zh)
Other versions
CN114386569B (en
Inventor
于红
刘晗
刘元秋
刘雨欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN202111572920.7A priority Critical patent/CN114386569B/en
Publication of CN114386569A publication Critical patent/CN114386569A/en
Application granted granted Critical
Publication of CN114386569B publication Critical patent/CN114386569B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Image Processing (AREA)

Abstract

A novel image description generation algorithm using a capsule network comprises the steps of firstly processing region level image characteristics by using a multi-channel bilinear pooling attention module, and processing the region level characteristics through a bilinear pooling attention mechanism and an extrusion-reward operation to obtain multi-channel attention visual characteristics; then, inputting the multichannel characteristics into a capsule network, taking each dimension of the region-level characteristics as an activity vector in a bottom-layer capsule, and aggregating the region-level characteristics into global-level image characteristics through dynamic routing calculation; and finally, decoding and using the hidden layer vector of the LSTM, the image characteristics and the word and word vector generated at the previous moment as the input of the next moment, and updating the characteristics and the hidden layer state by using a bilinear pooling algorithm so as to generate a corresponding word. Through multiple layers of LSTM, the resulting words make up the corresponding description. The invention realizes that the capsule network is used for capturing the relative position relation in the image description generation process and generating better image description.

Description

Novel image description generation algorithm using capsule network
Technical Field
The invention belongs to the field of artificial intelligence, and mainly relates to a novel image description generation algorithm using a capsule network.
Background
The image description generation task connects two major directions in the field of artificial intelligence, computer vision and natural language processing. In real life, people can automatically establish the relation among visual characteristic information such as scenes, objects and the like in an image and perceive high-level semantic information in the image, but a computer cannot understand and sort the information like the human brain, and the image description generation task aims to convert image characteristics into semantic information so as to provide help for the computer to better understand the content of the image. In order to realize the conversion from image to text, early related work mainly started from two aspects of template and retrieval, or filling the detected object name into a language template to realize the generation of description, or retrieving a similar picture and modifying the description of the similar picture to generate picture description. However, both of these approaches have certain drawbacks: the description generated by the template-based method is limited to a fixed length and the format is not variable; retrieval-based methods rely on data sets, cannot adapt to new pictures, and are difficult to generate high-quality image descriptions.
The framework of the current classical method for generating various image descriptions is an encoder-decoder structure, and research is mainly focused on image feature processing and attention mechanism application. The image feature processing is mainly focused on an encoder part, and features of different areas and different levels are extracted from a picture and then processed, so that the image description quality is improved. For example, the SCA-CNN method analyzes the characteristics of spatiality, multiple layers and multiple channels of the convolutional neural network, and achieves better effect after combining channel attention and space attention. And the Bottom-up method selectively extracts the picture region characteristics through target detection and entity identification, thereby generating more accurate and complete picture description. The X-Linear method uses bilinear pooling of space and channels to obtain second-order interaction of image features, and enhances the expression capability of the model.
The main goal of the attention mechanism work is to enhance the correlation between image regions and words to obtain more semantic details. The visual sentinel method allows an attention mechanism to decide to pay attention to an image or a sentence by itself, thereby generating a corresponding entity word or preposition. The backtracking and prediction method draws attention to the extension to the range of two words, so that the description is more coherent and more accords with the habit of human language. The method of the scene graph enables the algorithm to pay more attention to the entities, attributes and relationships among the entities in the picture, and the accuracy of description is improved. After the Transformer model is applied, the attention mechanism is improved more deeply, and the obtained effect is better.
At present, the method of processing image features to obtain deeper information is a general direction, and can be used as a front method of an attention mechanism, and then fusing two parts can generate image feature representation with higher quality. The existing visual attention mechanism can focus on different positions of pictures in the process of generating a text sequence so as to select corresponding words, but the attention transfer cannot focus on the relative position relation of objects in the pictures in space. The present invention uses a capsule network to improve the attention mechanism, making it possible to take full advantage of the spatial information conveyed in the image to generate a more accurate and detailed description.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a novel image description generation algorithm using a capsule network, and the spatial relative position relation is captured through a transformation matrix, so that the problem that the traditional visual attention mechanism cannot fully capture the spatial relation is solved.
The technical scheme of the invention is as follows:
a novel image description generation algorithm using a capsule network, characterized by the steps of:
(1) processing region-level image features using a bilinear pooling attention module with multiple channels;
taking a region level image feature matrix F and embedding a feature vector QE、KE、VE,KEAnd VEAre all initialized to F, QEInitialization as an average pooling of all region-level image features
Figure BDA0003423784620000021
Figure BDA0003423784620000031
Wherein f isiThe ith dimension of F, N the number of region-level image features, QE、KE、VEThe query vector, the relevance vector and the queried vector in the attention mechanism;
first, calculate Q with low rank bilinear pooling algorithmEAnd KEGet an intermediate representation of the product of the ith dimension ki
Figure BDA0003423784620000032
To QEAnd VEThe ith dimension vi is obtained by bilinear pooling calculation
Figure BDA0003423784620000033
Figure BDA0003423784620000034
Figure BDA0003423784620000035
Wherein,
Figure BDA0003423784620000036
are each ki、QE、vi、QEThe embedded matrix is an element multiplication operation between the matrixes, and sigma is a nonlinear activation function;
for intermediate representation
Figure BDA0003423784620000037
Global average pooling by squeeze awards
Figure BDA0003423784620000038
And capture its channel dependence alphac
Figure BDA0003423784620000039
Figure BDA00034237846200000310
Wherein,
Figure BDA00034237846200000320
WBare respectively as
Figure BDA00034237846200000311
N is
Figure BDA00034237846200000312
σ is a nonlinear activation function;
according to α c and
Figure BDA00034237846200000313
obtaining a visual representation of multiple channels
Figure BDA00034237846200000314
Figure BDA00034237846200000315
Wherein,
Figure BDA00034237846200000316
is composed of
Figure BDA00034237846200000317
The ith dimension of (a);
(2) extracting an image-level visual representation using a capsule network;
visual representation of multiple channels
Figure BDA00034237846200000318
Inputting into capsule network for 2-4 times of dynamic routing operationUpdating parameters of the capsule network to obtain a final image-level visual representation
Figure BDA00034237846200000319
The capsule network operation formula is as follows:
Figure BDA0003423784620000041
Figure BDA0003423784620000042
wherein, Wi fIs muiThe transformation matrix of (a) is,
Figure BDA0003423784620000043
for corresponding mu in capsule networkiThe coupling coefficient of (a);
the spatial coupling coefficient updating formula of the capsule network is as follows:
Figure BDA0003423784620000044
Figure BDA0003423784620000045
wherein, bi、bjFor dimension i, j, b of the routing matrix in the capsule networkiBy accumulating muiAnd
Figure BDA0003423784620000046
self-updating the product of (a);
(3) visual representation of image level using LSTM and bilinear pooling modules
Figure BDA0003423784620000047
Decoding to obtain image description;
the decoder comprises a layer of LSTM, a word is generated through the LSTM layer and the bilinear pooling module at each time step, T time steps are circulated, and a description sentence with the length of T is finally obtained, wherein the longest length of T is 17; at time t, the average of the region level image feature matrix is pooled
Figure BDA0003423784620000048
And image-level visual representation
Figure BDA0003423784620000049
By a joint representation of
Figure BDA00034237846200000410
Context vector c calculated at time t-1t-1And a word vector s generated at time t-1t-1Splicing is xtInputting LSTM to obtain a hidden layer vector htAnd output to bilinear pooling module and GLU module to obtain context vector ctGenerating a word st after softmax operation;
the LSTM input xtThe calculation formula of (a) is as follows:
Figure BDA00034237846200000411
Figure BDA00034237846200000412
wherein, WF、WxIs an embedded matrix;
the hidden layer vector htThe calculation formula of (a) is as follows:
ht=LSTM(xt,ht-1)
wherein h ist-1Is a hidden layer state matrix of the LSTM at the time t-1;
said context vector ctThe calculation formula of (a) is as follows:
ct=GLU(FX-Linear(KD,VD,QD))
wherein, FX-LinearA calculation function, K, for a bilinear pooling blockD、VD、QDFor relevance vectors, queried vectors, query vectors, K in a bilinear pooling moduleDHidden layer state h initialized to LSTMt,VD、QDInitialization to a visual joint representation
Figure BDA0003423784620000051
The formula for generating words is as follows:
st=softmax(Wcct)
wherein s istFor words generated at time t, WcIs ctThe embedded matrix of (2).
The nonlinear activation function is a CELU activation function.
The invention has the beneficial effects that: the invention provides a novel image description generation algorithm using a capsule network. Second-order interaction of image features on the channels is achieved through bilinear pooling attention calculation with multiple channels, and relative position relation of regional features on the space is captured through a capsule network, so that a decoder is guided to notice the position relation between entities in a sentence, words reflecting the position relation are accurately generated, and image description with higher quality is generated.
Drawings
FIG. 1 is a flow chart of an image description generation method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of the capsule network effect according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an algorithm framework according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings of the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention without any inventive step, are within the scope of protection of the invention.
Fig. 1 is a flowchart of an image description generation method according to an embodiment of the present invention. As shown in fig. 1, an embodiment of the present invention provides such an improved image description generation method using a capsule network, including:
101, processing region level image characteristics by using a multi-channel bilinear pooling attention module;
step 102, extracting an image-level visual representation using a capsule network;
step 103, visual representation of image level using LSTM and bilinear pooling modules
Figure BDA0003423784620000062
And decoding is carried out, and image description is obtained after a plurality of time steps are cycled.
In the embodiment of the present invention, the region-level features of the picture correspond to N vectors with length of 2048, which respectively represent corresponding features of N sub-regions in the picture, where the N regions include information of main objects in the picture. Based on region-level image features, we use it to initialize the input Q of a bilinear pooling attention Module with multiple channelsE、KE、VE,QEAverage pooling matrix set as region level feature, KE、VESet as a region level feature and then input into a bilinear pooling attention module with multiple channels for computation.
And obtaining a regular image feature matrix through low-rank bilinear pooling, wherein the dimension of the regular image feature matrix is Nx 1024, and the regular image feature matrix reflects the image features of the region level. The attention capsule network takes each dimension of the region-level features as an activity vector in a bottom-layer capsule, and through dynamic routing calculation, the spatial relationship information between the salient region and the whole image is retained in a transformation matrix of the dynamic routing, so that the region-level features are aggregated into global-level image features. FIG. 2 is a diagram illustrating the effect of the capsule network according to an embodiment of the present invention on a vector
Figure BDA0003423784620000061
Different posture transformation is realized through different transformation matrixes, and the posture transformation comprises translation, rotation, scaling and the like. Similarly, each dimension of the region feature represents a significant region, and when the capsule network calculation is carried out, the relative position relation among the regions is kept in the continuously updated transformation matrix, so that the unique global image feature is updated.
The decoding stage, step 103, uses the image-level visual representation generated in step 102, and the hidden layer state in LSTM and the word-word vector generated at the previous time to update the hidden layer state and obtain the output vector to generate the word output at the current time. The word vectors generated at the last moment are fused into the input, directing the decoder to generate sentences that better conform to the human language specification. In the decoding stage, a bilinear pooling attention module is still needed, the input of the bilinear pooling attention module is a hidden layer state and a region level characteristic, the hidden layer state reflects the direction of attention transfer, and the result of bilinear pooling calculation enables the hidden layer state to highlight attention through the fusion of the region level characteristic, so that the generation of words at the next moment is promoted. And finally generating a description sentence corresponding to the picture by the decoder through the LSTM model with multiple time steps.
Fig. 3 is an overall framework diagram of the algorithm, which completely reflects the whole process from one picture to the generation of the description. After an image is input, firstly processing the region level image characteristics by using a multi-channel bilinear pooling attention block, and processing the region level characteristics through bilinear pooling and extrusion reward operation to obtain multi-channel visual characteristics; then, inputting the multi-channel characteristics into an attention capsule network, taking the capsule network of each dimension as a low-level capsule, and aggregating the region-level characteristics into image characteristics of a global level through dynamic routing calculation; and finally, decoding and using the hidden layer vector, the image characteristic and the word and word vector generated at the previous moment of the LSTM as the input of the next moment, and updating the characteristic and the hidden layer state by using bilinear pooling so as to generate a corresponding word. Through multiple layers of LSTM, the resulting words make up the corresponding description. In summary, the following steps: the invention provides an improved image description generation method using a capsule network. Second-order interaction of captured image features is calculated through multi-channel low-rank bilinear pooling, and relative position relation among regional features is extracted through a capsule network, so that a decoder is guided to notice position relation and space information among entities in sentences and generate words reflecting the space position relation, and image description with higher quality is obtained.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (2)

1. A novel image description generation algorithm using a capsule network, characterized by the steps of:
(1) processing region-level image features using a bilinear pooling attention module with multiple channels;
taking a region level image feature matrix F and embedding a feature vector QE、KE、VE,KEAnd VEAre all initialized to F, QEInitialization as an average pooling of all region-level image features
Figure FDA0003423784610000011
Figure FDA0003423784610000012
Wherein f isiThe ith dimension of F, N the number of region-level image features, QE、KE、VEThe query vector, the relevance vector and the queried vector in the attention mechanism;
first, Q is calculated by using low-rank bilinear poolingEAnd KEGet an intermediate representation of the product of the ith dimension ki
Figure FDA0003423784610000013
To QEAnd VEIs subjected to bilinear pooling to obtain
Figure FDA0003423784610000014
Figure FDA0003423784610000015
Figure FDA0003423784610000016
Wherein,
Figure FDA0003423784610000017
are each ki、QE、vi、QEThe embedded matrix is an element multiplication operation between the matrixes, and sigma is a nonlinear activation function;
for intermediate representation
Figure FDA0003423784610000018
Global average pooling by squeeze awards
Figure FDA00034237846100000119
And capture its channel dependence alphac
Figure FDA0003423784610000019
Figure FDA00034237846100000110
Wherein,
Figure FDA00034237846100000111
WBare respectively as
Figure FDA00034237846100000112
N is
Figure FDA00034237846100000113
σ is a nonlinear activation function;
according to alphacAnd
Figure FDA00034237846100000114
obtaining a visual representation of multiple channels
Figure FDA00034237846100000115
Figure FDA00034237846100000116
Wherein,
Figure FDA00034237846100000117
is composed of
Figure FDA00034237846100000118
The ith dimension of (a);
(2) extracting an image-level visual representation using a capsule network;
visual representation of multiple channels
Figure FDA0003423784610000021
Inputting into capsule network for 2-4 times of dynamic routing operation, updating capsule network parameter after each operation to obtain final image level visual representation
Figure FDA0003423784610000022
The capsule network operation formula is as follows:
Figure FDA0003423784610000023
Figure FDA0003423784610000024
wherein, Wi fIs muiThe transformation matrix of (a) is,
Figure FDA0003423784610000025
for corresponding mu in capsule networkiThe coupling coefficient of (a);
the spatial coupling coefficient updating formula of the capsule network is as follows:
Figure FDA0003423784610000026
Figure FDA0003423784610000027
wherein, bi、bjFor dimension i, j, b of the routing matrix in the capsule networkiBy accumulating muiAnd
Figure FDA0003423784610000028
self-updating the product of (a);
(3) visual representation of image level using LSTM and bilinear pooling modules
Figure FDA0003423784610000029
Decoding to obtain image description;
the decoder comprises a layer of LSTMs in each of whichGenerating words through an LSTM layer and a bilinear pooling module at the time step, and circulating T time steps to finally obtain a description sentence with the length of T, wherein the longest length of T is 17; at time t, the average of the region level image feature matrix is pooled
Figure FDA00034237846100000210
And image-level visual representation
Figure FDA00034237846100000211
By a joint representation of
Figure FDA00034237846100000212
Context vector c calculated at time t-1t-1And a word vector s generated at time t-1t-1Splicing is xtInputting LSTM to obtain a hidden layer vector htAnd output to bilinear pooling module and GLU module to obtain context vector ctGenerating words s after softmax operationt
The LSTM input xtThe calculation formula of (a) is as follows:
Figure FDA00034237846100000213
Figure FDA0003423784610000031
wherein, WF、WxIs an embedded matrix;
the hidden layer vector htThe calculation formula of (a) is as follows:
ht=LSTM(xt,ht-1)
wherein h ist-1Is a hidden layer state matrix of the LSTM at the time t-1;
said context vector ctThe calculation formula of (a) is as follows:
ct=GLU(FX-Linear(KD,VD,QD))
wherein, FX-LinearA calculation function, K, for a bilinear pooling blockD、VD、QDFor relevance vectors, queried vectors, query vectors, K in a bilinear pooling moduleDHidden layer state h initialized to LSTMt,VD、QDInitialization to a visual joint representation
Figure FDA0003423784610000032
The formula for generating words is as follows:
st=softmax(Wcct)
wherein s istFor words generated at time t, WcIs ctThe embedded matrix of (2).
2. The novel image description generation algorithm using capsule network according to claim 1, characterized in that the nonlinear activation function is a CELU activation function.
CN202111572920.7A 2021-12-21 2021-12-21 Novel image description generation method using capsule network Active CN114386569B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111572920.7A CN114386569B (en) 2021-12-21 2021-12-21 Novel image description generation method using capsule network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111572920.7A CN114386569B (en) 2021-12-21 2021-12-21 Novel image description generation method using capsule network

Publications (2)

Publication Number Publication Date
CN114386569A true CN114386569A (en) 2022-04-22
CN114386569B CN114386569B (en) 2024-08-23

Family

ID=81197925

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111572920.7A Active CN114386569B (en) 2021-12-21 2021-12-21 Novel image description generation method using capsule network

Country Status (1)

Country Link
CN (1) CN114386569B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116229162A (en) * 2023-02-20 2023-06-06 北京邮电大学 Semi-autoregressive image description method based on capsule network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109711463A (en) * 2018-12-25 2019-05-03 广东顺德西安交通大学研究院 Important object detection method based on attention
US20210042579A1 (en) * 2018-11-30 2021-02-11 Tencent Technology (Shenzhen) Company Limited Image description information generation method and apparatus, and electronic device
US20210142081A1 (en) * 2019-11-11 2021-05-13 Coretronic Corporation Image recognition method and device
CN113535950A (en) * 2021-06-15 2021-10-22 杭州电子科技大学 Small sample intention recognition method based on knowledge graph and capsule network
CN113569932A (en) * 2021-07-18 2021-10-29 湖北工业大学 Image description generation method based on text hierarchical structure

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210042579A1 (en) * 2018-11-30 2021-02-11 Tencent Technology (Shenzhen) Company Limited Image description information generation method and apparatus, and electronic device
CN109711463A (en) * 2018-12-25 2019-05-03 广东顺德西安交通大学研究院 Important object detection method based on attention
US20210142081A1 (en) * 2019-11-11 2021-05-13 Coretronic Corporation Image recognition method and device
CN113535950A (en) * 2021-06-15 2021-10-22 杭州电子科技大学 Small sample intention recognition method based on knowledge graph and capsule network
CN113569932A (en) * 2021-07-18 2021-10-29 湖北工业大学 Image description generation method based on text hierarchical structure

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张亚彤等: "结合注意力及胶囊网络的多通道关系抽取模型", 《小型微型计算机系统》, 13 April 2021 (2021-04-13) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116229162A (en) * 2023-02-20 2023-06-06 北京邮电大学 Semi-autoregressive image description method based on capsule network
CN116229162B (en) * 2023-02-20 2024-07-30 北京邮电大学 Semi-autoregressive image description method based on capsule network

Also Published As

Publication number Publication date
CN114386569B (en) 2024-08-23

Similar Documents

Publication Publication Date Title
CN108388900B (en) Video description method based on combination of multi-feature fusion and space-time attention mechanism
US11657230B2 (en) Referring image segmentation
CN109711463B (en) Attention-based important object detection method
CN111079532B (en) Video content description method based on text self-encoder
CN110737801B (en) Content classification method, apparatus, computer device, and storage medium
CN108829677B (en) Multi-modal attention-based automatic image title generation method
Liu et al. A cross-modal adaptive gated fusion generative adversarial network for RGB-D salient object detection
JP2022509299A (en) How to generate video captions, appliances, devices and computer programs
US11868738B2 (en) Method and apparatus for generating natural language description information
CN112329525A (en) Gesture recognition method and device based on space-time diagram convolutional neural network
CN111241963B (en) First person view video interactive behavior identification method based on interactive modeling
Pu et al. Adaptive feature abstraction for translating video to text
CN114283352A (en) Video semantic segmentation device, training method and video semantic segmentation method
Khurram et al. Dense-captionnet: a sentence generation architecture for fine-grained description of image semantics
CN114896434A (en) Hash code generation method and device based on center similarity learning
CN114386569A (en) Novel image description generation algorithm using capsule network
CN113420179B (en) Semantic reconstruction video description method based on time sequence Gaussian mixture hole convolution
CN113763232B (en) Image processing method, device, equipment and computer readable storage medium
CN111445545B (en) Text transfer mapping method and device, storage medium and electronic equipment
CN113657200A (en) Video behavior action identification method and system based on mask R-CNN
CN117635275A (en) Intelligent electronic commerce operation commodity management platform and method based on big data
KR102198480B1 (en) Video summarization apparatus and method via recursive graph modeling
CN110826397B (en) Video description method based on high-order low-rank multi-modal attention mechanism
CN117173715A (en) Attention visual question-answering method and device, electronic equipment and storage medium
Arif et al. Video representation by dense trajectories motion map applied to human activity recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant