CN114386569A - Novel image description generation algorithm using capsule network - Google Patents
Novel image description generation algorithm using capsule network Download PDFInfo
- Publication number
- CN114386569A CN114386569A CN202111572920.7A CN202111572920A CN114386569A CN 114386569 A CN114386569 A CN 114386569A CN 202111572920 A CN202111572920 A CN 202111572920A CN 114386569 A CN114386569 A CN 114386569A
- Authority
- CN
- China
- Prior art keywords
- image
- capsule network
- level
- vector
- lstm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000002775 capsule Substances 0.000 title claims abstract description 40
- 238000011176 pooling Methods 0.000 claims abstract description 38
- 239000013598 vector Substances 0.000 claims abstract description 37
- 230000000007 visual effect Effects 0.000 claims abstract description 23
- 238000004364 calculation method Methods 0.000 claims abstract description 15
- 230000007246 mechanism Effects 0.000 claims abstract description 11
- 238000012545 processing Methods 0.000 claims abstract description 10
- 239000011159 matrix material Substances 0.000 claims description 23
- 230000006870 function Effects 0.000 claims description 10
- 230000004913 activation Effects 0.000 claims description 8
- 230000009466 transformation Effects 0.000 claims description 8
- 230000008878 coupling Effects 0.000 claims description 4
- 238000010168 coupling process Methods 0.000 claims description 4
- 238000005859 coupling reaction Methods 0.000 claims description 4
- 238000000034 method Methods 0.000 abstract description 18
- 230000000694 effects Effects 0.000 abstract description 6
- 230000008569 process Effects 0.000 abstract description 3
- 230000004931 aggregating effect Effects 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000001125 extrusion Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
- Image Processing (AREA)
Abstract
A novel image description generation algorithm using a capsule network comprises the steps of firstly processing region level image characteristics by using a multi-channel bilinear pooling attention module, and processing the region level characteristics through a bilinear pooling attention mechanism and an extrusion-reward operation to obtain multi-channel attention visual characteristics; then, inputting the multichannel characteristics into a capsule network, taking each dimension of the region-level characteristics as an activity vector in a bottom-layer capsule, and aggregating the region-level characteristics into global-level image characteristics through dynamic routing calculation; and finally, decoding and using the hidden layer vector of the LSTM, the image characteristics and the word and word vector generated at the previous moment as the input of the next moment, and updating the characteristics and the hidden layer state by using a bilinear pooling algorithm so as to generate a corresponding word. Through multiple layers of LSTM, the resulting words make up the corresponding description. The invention realizes that the capsule network is used for capturing the relative position relation in the image description generation process and generating better image description.
Description
Technical Field
The invention belongs to the field of artificial intelligence, and mainly relates to a novel image description generation algorithm using a capsule network.
Background
The image description generation task connects two major directions in the field of artificial intelligence, computer vision and natural language processing. In real life, people can automatically establish the relation among visual characteristic information such as scenes, objects and the like in an image and perceive high-level semantic information in the image, but a computer cannot understand and sort the information like the human brain, and the image description generation task aims to convert image characteristics into semantic information so as to provide help for the computer to better understand the content of the image. In order to realize the conversion from image to text, early related work mainly started from two aspects of template and retrieval, or filling the detected object name into a language template to realize the generation of description, or retrieving a similar picture and modifying the description of the similar picture to generate picture description. However, both of these approaches have certain drawbacks: the description generated by the template-based method is limited to a fixed length and the format is not variable; retrieval-based methods rely on data sets, cannot adapt to new pictures, and are difficult to generate high-quality image descriptions.
The framework of the current classical method for generating various image descriptions is an encoder-decoder structure, and research is mainly focused on image feature processing and attention mechanism application. The image feature processing is mainly focused on an encoder part, and features of different areas and different levels are extracted from a picture and then processed, so that the image description quality is improved. For example, the SCA-CNN method analyzes the characteristics of spatiality, multiple layers and multiple channels of the convolutional neural network, and achieves better effect after combining channel attention and space attention. And the Bottom-up method selectively extracts the picture region characteristics through target detection and entity identification, thereby generating more accurate and complete picture description. The X-Linear method uses bilinear pooling of space and channels to obtain second-order interaction of image features, and enhances the expression capability of the model.
The main goal of the attention mechanism work is to enhance the correlation between image regions and words to obtain more semantic details. The visual sentinel method allows an attention mechanism to decide to pay attention to an image or a sentence by itself, thereby generating a corresponding entity word or preposition. The backtracking and prediction method draws attention to the extension to the range of two words, so that the description is more coherent and more accords with the habit of human language. The method of the scene graph enables the algorithm to pay more attention to the entities, attributes and relationships among the entities in the picture, and the accuracy of description is improved. After the Transformer model is applied, the attention mechanism is improved more deeply, and the obtained effect is better.
At present, the method of processing image features to obtain deeper information is a general direction, and can be used as a front method of an attention mechanism, and then fusing two parts can generate image feature representation with higher quality. The existing visual attention mechanism can focus on different positions of pictures in the process of generating a text sequence so as to select corresponding words, but the attention transfer cannot focus on the relative position relation of objects in the pictures in space. The present invention uses a capsule network to improve the attention mechanism, making it possible to take full advantage of the spatial information conveyed in the image to generate a more accurate and detailed description.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a novel image description generation algorithm using a capsule network, and the spatial relative position relation is captured through a transformation matrix, so that the problem that the traditional visual attention mechanism cannot fully capture the spatial relation is solved.
The technical scheme of the invention is as follows:
a novel image description generation algorithm using a capsule network, characterized by the steps of:
(1) processing region-level image features using a bilinear pooling attention module with multiple channels;
taking a region level image feature matrix F and embedding a feature vector QE、KE、VE,KEAnd VEAre all initialized to F, QEInitialization as an average pooling of all region-level image features
Wherein f isiThe ith dimension of F, N the number of region-level image features, QE、KE、VEThe query vector, the relevance vector and the queried vector in the attention mechanism;
first, calculate Q with low rank bilinear pooling algorithmEAnd KEGet an intermediate representation of the product of the ith dimension kiTo QEAnd VEThe ith dimension vi is obtained by bilinear pooling calculation
Wherein,are each ki、QE、vi、QEThe embedded matrix is an element multiplication operation between the matrixes, and sigma is a nonlinear activation function;
for intermediate representationGlobal average pooling by squeeze awardsAnd capture its channel dependence alphac;
(2) extracting an image-level visual representation using a capsule network;
visual representation of multiple channelsInputting into capsule network for 2-4 times of dynamic routing operationUpdating parameters of the capsule network to obtain a final image-level visual representation
The capsule network operation formula is as follows:
wherein, Wi fIs muiThe transformation matrix of (a) is,for corresponding mu in capsule networkiThe coupling coefficient of (a);
the spatial coupling coefficient updating formula of the capsule network is as follows:
wherein, bi、bjFor dimension i, j, b of the routing matrix in the capsule networkiBy accumulating muiAndself-updating the product of (a);
(3) visual representation of image level using LSTM and bilinear pooling modulesDecoding to obtain image description;
the decoder comprises a layer of LSTM, a word is generated through the LSTM layer and the bilinear pooling module at each time step, T time steps are circulated, and a description sentence with the length of T is finally obtained, wherein the longest length of T is 17; at time t, the average of the region level image feature matrix is pooledAnd image-level visual representationBy a joint representation ofContext vector c calculated at time t-1t-1And a word vector s generated at time t-1t-1Splicing is xtInputting LSTM to obtain a hidden layer vector htAnd output to bilinear pooling module and GLU module to obtain context vector ctGenerating a word st after softmax operation;
the LSTM input xtThe calculation formula of (a) is as follows:
wherein, WF、WxIs an embedded matrix;
the hidden layer vector htThe calculation formula of (a) is as follows:
ht=LSTM(xt,ht-1)
wherein h ist-1Is a hidden layer state matrix of the LSTM at the time t-1;
said context vector ctThe calculation formula of (a) is as follows:
ct=GLU(FX-Linear(KD,VD,QD))
wherein, FX-LinearA calculation function, K, for a bilinear pooling blockD、VD、QDFor relevance vectors, queried vectors, query vectors, K in a bilinear pooling moduleDHidden layer state h initialized to LSTMt,VD、QDInitialization to a visual joint representation
The formula for generating words is as follows:
st=softmax(Wcct)
wherein s istFor words generated at time t, WcIs ctThe embedded matrix of (2).
The nonlinear activation function is a CELU activation function.
The invention has the beneficial effects that: the invention provides a novel image description generation algorithm using a capsule network. Second-order interaction of image features on the channels is achieved through bilinear pooling attention calculation with multiple channels, and relative position relation of regional features on the space is captured through a capsule network, so that a decoder is guided to notice the position relation between entities in a sentence, words reflecting the position relation are accurately generated, and image description with higher quality is generated.
Drawings
FIG. 1 is a flow chart of an image description generation method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of the capsule network effect according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an algorithm framework according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings of the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention without any inventive step, are within the scope of protection of the invention.
Fig. 1 is a flowchart of an image description generation method according to an embodiment of the present invention. As shown in fig. 1, an embodiment of the present invention provides such an improved image description generation method using a capsule network, including:
101, processing region level image characteristics by using a multi-channel bilinear pooling attention module;
In the embodiment of the present invention, the region-level features of the picture correspond to N vectors with length of 2048, which respectively represent corresponding features of N sub-regions in the picture, where the N regions include information of main objects in the picture. Based on region-level image features, we use it to initialize the input Q of a bilinear pooling attention Module with multiple channelsE、KE、VE,QEAverage pooling matrix set as region level feature, KE、VESet as a region level feature and then input into a bilinear pooling attention module with multiple channels for computation.
And obtaining a regular image feature matrix through low-rank bilinear pooling, wherein the dimension of the regular image feature matrix is Nx 1024, and the regular image feature matrix reflects the image features of the region level. The attention capsule network takes each dimension of the region-level features as an activity vector in a bottom-layer capsule, and through dynamic routing calculation, the spatial relationship information between the salient region and the whole image is retained in a transformation matrix of the dynamic routing, so that the region-level features are aggregated into global-level image features. FIG. 2 is a diagram illustrating the effect of the capsule network according to an embodiment of the present invention on a vectorDifferent posture transformation is realized through different transformation matrixes, and the posture transformation comprises translation, rotation, scaling and the like. Similarly, each dimension of the region feature represents a significant region, and when the capsule network calculation is carried out, the relative position relation among the regions is kept in the continuously updated transformation matrix, so that the unique global image feature is updated.
The decoding stage, step 103, uses the image-level visual representation generated in step 102, and the hidden layer state in LSTM and the word-word vector generated at the previous time to update the hidden layer state and obtain the output vector to generate the word output at the current time. The word vectors generated at the last moment are fused into the input, directing the decoder to generate sentences that better conform to the human language specification. In the decoding stage, a bilinear pooling attention module is still needed, the input of the bilinear pooling attention module is a hidden layer state and a region level characteristic, the hidden layer state reflects the direction of attention transfer, and the result of bilinear pooling calculation enables the hidden layer state to highlight attention through the fusion of the region level characteristic, so that the generation of words at the next moment is promoted. And finally generating a description sentence corresponding to the picture by the decoder through the LSTM model with multiple time steps.
Fig. 3 is an overall framework diagram of the algorithm, which completely reflects the whole process from one picture to the generation of the description. After an image is input, firstly processing the region level image characteristics by using a multi-channel bilinear pooling attention block, and processing the region level characteristics through bilinear pooling and extrusion reward operation to obtain multi-channel visual characteristics; then, inputting the multi-channel characteristics into an attention capsule network, taking the capsule network of each dimension as a low-level capsule, and aggregating the region-level characteristics into image characteristics of a global level through dynamic routing calculation; and finally, decoding and using the hidden layer vector, the image characteristic and the word and word vector generated at the previous moment of the LSTM as the input of the next moment, and updating the characteristic and the hidden layer state by using bilinear pooling so as to generate a corresponding word. Through multiple layers of LSTM, the resulting words make up the corresponding description. In summary, the following steps: the invention provides an improved image description generation method using a capsule network. Second-order interaction of captured image features is calculated through multi-channel low-rank bilinear pooling, and relative position relation among regional features is extracted through a capsule network, so that a decoder is guided to notice position relation and space information among entities in sentences and generate words reflecting the space position relation, and image description with higher quality is obtained.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (2)
1. A novel image description generation algorithm using a capsule network, characterized by the steps of:
(1) processing region-level image features using a bilinear pooling attention module with multiple channels;
taking a region level image feature matrix F and embedding a feature vector QE、KE、VE,KEAnd VEAre all initialized to F, QEInitialization as an average pooling of all region-level image features
Wherein f isiThe ith dimension of F, N the number of region-level image features, QE、KE、VEThe query vector, the relevance vector and the queried vector in the attention mechanism;
first, Q is calculated by using low-rank bilinear poolingEAnd KEGet an intermediate representation of the product of the ith dimension kiTo QEAnd VEIs subjected to bilinear pooling to obtain
Wherein,are each ki、QE、vi、QEThe embedded matrix is an element multiplication operation between the matrixes, and sigma is a nonlinear activation function;
for intermediate representationGlobal average pooling by squeeze awardsAnd capture its channel dependence alphac;
(2) extracting an image-level visual representation using a capsule network;
visual representation of multiple channelsInputting into capsule network for 2-4 times of dynamic routing operation, updating capsule network parameter after each operation to obtain final image level visual representation
The capsule network operation formula is as follows:
wherein, Wi fIs muiThe transformation matrix of (a) is,for corresponding mu in capsule networkiThe coupling coefficient of (a);
the spatial coupling coefficient updating formula of the capsule network is as follows:
wherein, bi、bjFor dimension i, j, b of the routing matrix in the capsule networkiBy accumulating muiAndself-updating the product of (a);
(3) visual representation of image level using LSTM and bilinear pooling modulesDecoding to obtain image description;
the decoder comprises a layer of LSTMs in each of whichGenerating words through an LSTM layer and a bilinear pooling module at the time step, and circulating T time steps to finally obtain a description sentence with the length of T, wherein the longest length of T is 17; at time t, the average of the region level image feature matrix is pooledAnd image-level visual representationBy a joint representation ofContext vector c calculated at time t-1t-1And a word vector s generated at time t-1t-1Splicing is xtInputting LSTM to obtain a hidden layer vector htAnd output to bilinear pooling module and GLU module to obtain context vector ctGenerating words s after softmax operationt;
The LSTM input xtThe calculation formula of (a) is as follows:
wherein, WF、WxIs an embedded matrix;
the hidden layer vector htThe calculation formula of (a) is as follows:
ht=LSTM(xt,ht-1)
wherein h ist-1Is a hidden layer state matrix of the LSTM at the time t-1;
said context vector ctThe calculation formula of (a) is as follows:
ct=GLU(FX-Linear(KD,VD,QD))
wherein, FX-LinearA calculation function, K, for a bilinear pooling blockD、VD、QDFor relevance vectors, queried vectors, query vectors, K in a bilinear pooling moduleDHidden layer state h initialized to LSTMt,VD、QDInitialization to a visual joint representation
The formula for generating words is as follows:
st=softmax(Wcct)
wherein s istFor words generated at time t, WcIs ctThe embedded matrix of (2).
2. The novel image description generation algorithm using capsule network according to claim 1, characterized in that the nonlinear activation function is a CELU activation function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111572920.7A CN114386569B (en) | 2021-12-21 | 2021-12-21 | Novel image description generation method using capsule network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111572920.7A CN114386569B (en) | 2021-12-21 | 2021-12-21 | Novel image description generation method using capsule network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114386569A true CN114386569A (en) | 2022-04-22 |
CN114386569B CN114386569B (en) | 2024-08-23 |
Family
ID=81197925
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111572920.7A Active CN114386569B (en) | 2021-12-21 | 2021-12-21 | Novel image description generation method using capsule network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114386569B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116229162A (en) * | 2023-02-20 | 2023-06-06 | 北京邮电大学 | Semi-autoregressive image description method based on capsule network |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109711463A (en) * | 2018-12-25 | 2019-05-03 | 广东顺德西安交通大学研究院 | Important object detection method based on attention |
US20210042579A1 (en) * | 2018-11-30 | 2021-02-11 | Tencent Technology (Shenzhen) Company Limited | Image description information generation method and apparatus, and electronic device |
US20210142081A1 (en) * | 2019-11-11 | 2021-05-13 | Coretronic Corporation | Image recognition method and device |
CN113535950A (en) * | 2021-06-15 | 2021-10-22 | 杭州电子科技大学 | Small sample intention recognition method based on knowledge graph and capsule network |
CN113569932A (en) * | 2021-07-18 | 2021-10-29 | 湖北工业大学 | Image description generation method based on text hierarchical structure |
-
2021
- 2021-12-21 CN CN202111572920.7A patent/CN114386569B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210042579A1 (en) * | 2018-11-30 | 2021-02-11 | Tencent Technology (Shenzhen) Company Limited | Image description information generation method and apparatus, and electronic device |
CN109711463A (en) * | 2018-12-25 | 2019-05-03 | 广东顺德西安交通大学研究院 | Important object detection method based on attention |
US20210142081A1 (en) * | 2019-11-11 | 2021-05-13 | Coretronic Corporation | Image recognition method and device |
CN113535950A (en) * | 2021-06-15 | 2021-10-22 | 杭州电子科技大学 | Small sample intention recognition method based on knowledge graph and capsule network |
CN113569932A (en) * | 2021-07-18 | 2021-10-29 | 湖北工业大学 | Image description generation method based on text hierarchical structure |
Non-Patent Citations (1)
Title |
---|
张亚彤等: "结合注意力及胶囊网络的多通道关系抽取模型", 《小型微型计算机系统》, 13 April 2021 (2021-04-13) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116229162A (en) * | 2023-02-20 | 2023-06-06 | 北京邮电大学 | Semi-autoregressive image description method based on capsule network |
CN116229162B (en) * | 2023-02-20 | 2024-07-30 | 北京邮电大学 | Semi-autoregressive image description method based on capsule network |
Also Published As
Publication number | Publication date |
---|---|
CN114386569B (en) | 2024-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108388900B (en) | Video description method based on combination of multi-feature fusion and space-time attention mechanism | |
US11657230B2 (en) | Referring image segmentation | |
CN109711463B (en) | Attention-based important object detection method | |
CN111079532B (en) | Video content description method based on text self-encoder | |
CN110737801B (en) | Content classification method, apparatus, computer device, and storage medium | |
CN108829677B (en) | Multi-modal attention-based automatic image title generation method | |
Liu et al. | A cross-modal adaptive gated fusion generative adversarial network for RGB-D salient object detection | |
JP2022509299A (en) | How to generate video captions, appliances, devices and computer programs | |
US11868738B2 (en) | Method and apparatus for generating natural language description information | |
CN112329525A (en) | Gesture recognition method and device based on space-time diagram convolutional neural network | |
CN111241963B (en) | First person view video interactive behavior identification method based on interactive modeling | |
Pu et al. | Adaptive feature abstraction for translating video to text | |
CN114283352A (en) | Video semantic segmentation device, training method and video semantic segmentation method | |
Khurram et al. | Dense-captionnet: a sentence generation architecture for fine-grained description of image semantics | |
CN114896434A (en) | Hash code generation method and device based on center similarity learning | |
CN114386569A (en) | Novel image description generation algorithm using capsule network | |
CN113420179B (en) | Semantic reconstruction video description method based on time sequence Gaussian mixture hole convolution | |
CN113763232B (en) | Image processing method, device, equipment and computer readable storage medium | |
CN111445545B (en) | Text transfer mapping method and device, storage medium and electronic equipment | |
CN113657200A (en) | Video behavior action identification method and system based on mask R-CNN | |
CN117635275A (en) | Intelligent electronic commerce operation commodity management platform and method based on big data | |
KR102198480B1 (en) | Video summarization apparatus and method via recursive graph modeling | |
CN110826397B (en) | Video description method based on high-order low-rank multi-modal attention mechanism | |
CN117173715A (en) | Attention visual question-answering method and device, electronic equipment and storage medium | |
Arif et al. | Video representation by dense trajectories motion map applied to human activity recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |