CN114782848A - Picture subtitle generating method applying characteristic pyramid - Google Patents

Picture subtitle generating method applying characteristic pyramid Download PDF

Info

Publication number
CN114782848A
CN114782848A CN202210233662.8A CN202210233662A CN114782848A CN 114782848 A CN114782848 A CN 114782848A CN 202210233662 A CN202210233662 A CN 202210233662A CN 114782848 A CN114782848 A CN 114782848A
Authority
CN
China
Prior art keywords
picture
layer
dimension
batch
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210233662.8A
Other languages
Chinese (zh)
Other versions
CN114782848B (en
Inventor
徐萍
毕东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang Yayi Network Technology Co ltd
Original Assignee
Shenyang Yayi Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang Yayi Network Technology Co ltd filed Critical Shenyang Yayi Network Technology Co ltd
Priority to CN202210233662.8A priority Critical patent/CN114782848B/en
Publication of CN114782848A publication Critical patent/CN114782848A/en
Application granted granted Critical
Publication of CN114782848B publication Critical patent/CN114782848B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a method for generating a picture subtitle by applying a characteristic pyramid, which comprises the following steps: inputting the preprocessed picture into a feature pyramid module, extracting a feature picture as picture feature information, and combining the feature picture with the preprocessed picture to be called as picture features with three different scales of low dimensionality, high dimensionality and original dimensionality; sending the original dimensional picture characteristics into an embedding layer to be converted into vector representation; sending the picture characteristics of three different scales into a first layer of an encoder, and carrying out dimension scaling; sending hidden layer information with consistent dimension size to a stacked high layer in an encoder to obtain three encoder characteristics, splicing to obtain fused picture characteristics, and sending the fused picture characteristics to a decoder of a model for decoding; and performing gradient updating through a cross entropy loss function, and optimizing the weight of the model to obtain the image caption generating method. The invention enhances the semantic expression capability of the picture from different angles and different visual field distances of the picture, and effectively reduces the calculation cost of a self-attention mechanism and a feedforward neural network in an encoder.

Description

Picture subtitle generating method applying characteristic pyramid
Technical Field
The invention relates to an image and language processing technology, in particular to a picture subtitle generating method based on a characteristic pyramid.
Background
Image captioning (Image Caption) can be considered as a global object detection task, which is to generate a sentence describing the content of a picture from the whole picture. Early image subtitle generation methods were based on traditional machine learning, and included extracting features of images using some image processing operators, classifying using a support vector machine or the like to obtain targets in the images, and then using the obtained targets and their attributes as the basis for generating sentences, for example, using CRF or some established rules to restore descriptions of images. Such a practice is not ideal in practical applications and depends heavily on 1) the extraction of image features 2) the rules required when generating sentences.
Deep learning has facilitated the rapid development of computer vision. Image coding and feature extraction greatly benefit from the development of CNN. With the emergence of deep CNN encoders such as VGG (video graphics generator), the accuracy of tasks such as image recognition is rapidly improved. Due to the strong image feature extraction capability of the CNN, it is a mainstream practice to use a deep CNN network as an image feature encoder in an image capture task. Google proposed a Neural Image capture model in 2014 as a mountain-opening work for this approach. The subsequent models with the greater impact of Neural Talk et al on Image capture development almost follow this basic framework.
As the Transformer model becomes more popular and more advocated in the field of natural language processing, there is now much work in the picture field trying to extract more powerful image features using transformers. Vision transformers, currently developed from transformers, have been effective in achieving good results for each large picture task. As shown in fig. 1, the Vision Transformer still adopts the structure of a codec, and encodes and decodes the picture features and the position information of the sub-picture by using an attention mechanism. In the attention calculation process, a multi-head segmentation mode is adopted, so that different heads pay attention to information of different picture semantic spaces. Attention mechanisms are in addition to self-attention, codec attention. They differ in that: the self-attention query vector, the key vector and the value vector are all intermediate vectors of the same layer, the codec attention query vector is an intermediate vector at the decoding end, and the key vector and the value vector are source language coding vectors output by the coding end.
As shown in fig. 2, the feature pyramid model was originally proposed in the image domain, and the image domain task generally employs a convolutional neural network, which naturally takes the shape of a pyramid due to the existence of pooling layers. In the target detection task, the sizes of different targets are different, and the extracted feature thickness granularity of different layers in the convolutional network is different, so that the target detection task can be regarded as picture feature information of different sizes.
Due to the feature extraction of the feature pyramid, the dimensions of the image features are changed, become inconsistent and cannot be fused, so that the feature pyramid structure cannot be used for enhancing the image information in practical application to obtain stronger image representation.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a picture subtitle generating method applying a feature pyramid. Before the picture is sent to a Vision Transformer model for coding and decoding picture features, the feature pyramid model is beneficial to more fully extracting the picture features.
In order to solve the technical problems, the invention adopts the technical scheme that:
a method for generating a picture subtitle by applying a feature pyramid comprises the following steps:
1) inputting the preprocessed picture into a feature pyramid module, extracting features of the picture through a multilayer convolutional neural network in the feature pyramid module, respectively extracting feature graphs of a low-level convolutional neural network and a top-level convolutional neural network, and taking the feature graphs as picture feature information of two scales of low dimensionality and high dimensionality and the preprocessed picture which are called as picture features of three different scales of low dimensionality, high dimensionality and original dimensionality;
2) sending the original dimensional picture characteristics into an embedding layer to be converted into vector representation;
3) sending the three picture characteristics with different dimensions into a first layer of an encoder, and carrying out dimension scaling, namely scaling the picture characteristics with different dimensions into hidden layer information with the same dimension through a self-attention mechanism and a feedforward neural network;
4) sending hidden layer information with consistent dimension size to a stacked high layer in an encoder to obtain three encoder characteristics, and obtaining fused picture characteristics through splicing operation;
5) and sending the fused picture characteristics into a decoder of the model for decoding, decoding the picture characteristics into picture subtitles by the decoder through stacked decoder layers, performing gradient updating through a cross entropy loss function, and optimizing the weight of the model to obtain the picture subtitle generating method.
In the step 1), preprocessing picture data, inputting the picture into a feature pyramid module, and extracting features through a multilayer convolutional neural network, wherein the convolutional neural network is calculated in the following way:
Figure BDA0003541341270000021
weight(i,j)=w[:,:,i:j]
input(x,k)=x[:,:,k]
x is belonged to R in formulaH×W×CExpressing tensor of the pictures in a computer, H expressing the height of the pictures, W expressing the width of the pictures, C expressing the number of channels of the pictures, and expressing 2D cross-correlation operation in a formula
Figure BDA0003541341270000022
Represents a convolution kernel wherein coutNumber of channels representing output characteristics, CinThe number of channels representing the input characteristics, w [::, ij]The i and j tensors, x [: k, in the third and fourth dimensions, taking w]Expressing a kth tensor in a third dimension of x, wherein b in the formula expresses a bias constant; weight (i, j) represents the jth convolution kernel of channel i in the convolutional neural network, and input (x, k) represents the outputThe tensor of the k channel into x;
taking the output of the first layer of convolutional neural network as a low-dimensional picture characteristic, and taking the output of the last layer of convolutional neural network as a high-dimensional picture characteristic; the original picture is used as the original dimension picture characteristic.
Step 2) sending the original dimension picture characteristics to an embedding layer to be converted into vector representation, namely, adjusting the height and width of the original dimension picture characteristics to a specified size, dividing the original dimension picture characteristics into sub-pictures with fixed sizes, wherein each sub-picture is called a patch, and then sending the sub-pictures to the embedding layer to obtain the code of each patch, namely, a picture embedding vector, and specifically, the method comprises the following steps:
201) dividing the picture (batch, c, h, w) into sub-pictures with the resolution of p1 p2 for each batch, firstly, cutting each original dimension picture into (h/p1) small blocks (w/p2), namely from (batch, c, p1 (h/p1), p2 (w/p2)) to (batch, c, (h/p1) (w/p2), p1 p2), and converting the small blocks into (batch, (h/p1) (w/p2), p1 p2 c), which is equivalent to being divided into (h/p1) (w/p2) batches, wherein the dimension of each batch is p1 p 2; the implementation of this process is calculated as follows:
x=rearrange(b,c,(h*p1),(w*p2)→b,(h*w),(p1*p2*c)′)
wherein, the rearrange function is an operator of the einops library, p1 and p2 are the size of patch, c is the number of channels, b is the number of batch, and h and w are the height and the width of the image respectively.
202) After the sub-pictures are divided, the embedded vector of the original dimension picture characteristic is obtained, and the dimension of the embedded vector is adjusted, namely the dimension is adjusted to the required size through a layer of full connection layer.
In step 3), the picture features with different dimensions are sent to a first layer of an encoder, the first layer of the encoder is composed of three coding layers with different dimensions and respectively corresponds to the three picture features, wherein the coding layers are composed of an attention mechanism and a feedforward neural network, and the multi-head attention mechanism is calculated in the following manner:
Figure BDA0003541341270000031
MultiHead(Q,K,V)=Concat(head1,…,headh)Wo
wherein Q, K, V is the input vector, head, of the modeliThe vector of the ith head, W is a translation model parameter, Attention (·) is an Attention mechanism function, and Concat (·) is a vector connection function;
the feedforward neural network layer FNN is calculated as follows:
FFN(x)=max(0,xW1+b1)W2+b2
where x is the hidden layer vector, W1、W2、b1、b2The parameters of the model are obtained by automatic learning of the model;
301)(batch,l0,d0) Is the original picture feature (batch, h)1,w1,d1) Is a low-dimensional picture feature (batch, h)2,w2,d2) For the high-dimensional picture characteristics, the low-dimensional and high-dimensional picture characteristics are subjected to dimension transformation to obtain (batch, h)1*w1,d1) And (batch, h)2*w2,d2) (ii) a Sending the three picture characteristics to a multi-head self-attention machine to obtain0D) and (batch, l)1D) and (batch, l)2D) picture features with consistent dimensions;
302) and then the three picture characteristics are sent into a feedforward neural network, and hidden layer information with the same third dimension is still obtained.
The invention has the following beneficial effects and advantages:
1. the invention uses the form of a characteristic pyramid to extract richer picture characteristic information from different scales, so that a coder captures picture semantic information of different scales;
2. the encoder effectively converts the dimensionality of the picture characteristics of different scales into consistency by arranging a multi-head self-attention mechanism with dimensionality normalization at the first layer so as to facilitate characteristic fusion;
3. according to the invention, only one characteristic pyramid structure and two additional sub-layer structures of the first layer of the encoder are added, and the generation quality of the image captions is improved under the condition that the model parameters are hardly increased.
Drawings
FIG. 1 is a diagram of a Vision Transformer prototype model according to the present invention;
FIG. 2 is a diagram of feature pyramid extracted picture features in the present invention;
FIG. 3 is a diagram of a feature pyramid based Vision Transformer refinement model in accordance with the present invention.
Detailed Description
The invention is further elucidated with reference to the accompanying drawings.
The invention provides a method for generating a picture subtitle by applying a feature pyramid. The pyramid structure enlarges the dimension of the features, so that the features cannot be fused. The invention scales the picture features to be consistent by arranging an extra encoder layer in the first encoder layer on the basis of Vision Transformer, so as to facilitate the feature fusion.
As shown in fig. 3, the present invention provides a method for generating a subtitle using a feature pyramid, including the following steps:
1) and inputting the preprocessed picture into a feature pyramid module, and extracting features of the picture through a multilayer convolutional neural network. Respectively extracting feature graphs of a low-level convolutional neural network and a top-level convolutional neural network, wherein the feature graphs of the two scales of picture feature information and the preprocessed picture are called as picture features of three different scales of low dimensionality, high dimensionality and original dimensionality;
2) converting the original-level picture features into vector representation through an embedding layer
3) Sending the image characteristics of three different scales into a first layer of an encoder, and carrying out dimension scaling, namely scaling into hidden layer information with the same dimension size through a self-attention mechanism and a feedforward neural network;
4) sending hidden layer information with consistent dimension size to a stacked high layer in an encoder to obtain three encoder characteristics, and obtaining fused picture characteristics by splicing the three encoder characteristics
5) And sending the fused picture characteristics into a decoder of the model for decoding, decoding the picture characteristics into picture subtitles by the decoder through stacked decoder layers, performing gradient updating through a cross entropy loss function, and optimizing the weight of the model.
In the step 1), preprocessing picture data, inputting pictures into a feature pyramid module, and extracting features through a multilayer convolutional neural network. The calculation mode of the convolutional neural network is as follows:
Figure BDA0003541341270000041
weight(i,j)=w[:,:,i:j]
input(x,k)=x[:,:,k]
wherein x ∈ RH×W×CRepresenting tensor for pictures in a computer, wherein H represents the height of the picture, W represents the width of the picture, C represents the number of channels of the picture, and x represents 2D cross-correlation operation,
Figure BDA0003541341270000042
represents a convolution kernel wherein coutNumber of channels representing output characteristics and CinThe number of channels representing the input characteristics, w [::, ij]The i and j tensors, x [: k, in the third and fourth dimensions, taking w]Representing the k-th tensor in the third dimension, taking x.
To input (batch, h, w, C)in) (16, 384, 384, 3) as an example, there are multiple layers of convolutional neural networks in a pyramid structure. Each time, a new picture characteristic diagram (batch, h ', w', C) is obtained through convolution calculationout) (16, 192, 192, 16), C of each layer of convolutional networkoutAre becoming larger. And taking the output of the first layer of convolutional neural network as a low-dimensional picture characteristic, and taking the output of the last layer of convolutional neural network as a high-dimensional picture characteristic. Wherein h ', w' become smaller as the number of layers increases. And the original picture is referred to as an original level picture feature.
In step 2), the height and width of the original-level picture features are adjusted to a specified size and then divided into sub-pictures with fixed sizes, each sub-picture is called a patch, and then the sub-pictures are sent to an embedding layer to obtain a code for each patch, namely a picture embedding vector, specifically:
301) taking the input (batch, c, h, w) and the resolution of each batch as p × p as an example, the specific process of the process of dividing into sub-pictures is as follows: firstly, each original-stage picture feature is cut into (h/p) ((w/p)) small blocks from (batch, c, p) ((h/p), p) ((w/p)) to (batch, c), (h/p) ((w/p), p) () p), then the small blocks are converted into (batch), (h/p) ((w/p), p c), which is equivalent to the division into (h/p) ((w/p) batches, and the dimension of each batch is p × p; this process can be implemented by the following calculation:
x=rearrange(img,′b c(h p1)(w p2)→b(h w)(p1 p2 c)′,p1=p,p2=p)
wherein, the rearrarage function is an operator of the einops library, p1 and p2 are the size of patch, c is the number of channels, b is the number of batch, and h and w are the height and width of the image respectively.
302) And after the sub-pictures are divided, obtaining an embedded vector of the original-level picture characteristics, and adjusting the dimensionality of the embedded vector, namely adjusting the dimensionality to a required size through a full-connection layer.
In step 3), the picture features with different dimensions are sent to the first layer of the encoder, and the first layer of the encoder is composed of three coding layers with different dimensions and respectively corresponds to the three picture features. Wherein the coding layer is composed of a self-attention mechanism and a feedforward neural network. The calculation method of the multi-head self-attention mechanism is as follows:
Figure BDA0003541341270000051
MultiHead(Q,K,V)=Concat(head1,…,headh)Wo
wherein Q, K, V is the input vector, head, of the modeliThe ith head vector, W is the translation model parameter, Attention (·) is the Attention mechanism function, Concat: (A) is a vector join function;
the feedforward neural network layer FNN is calculated as follows:
FFN(x)=max(0,xW1+b1)W2+b2
where x is the hidden layer vector, W1、W2、b1、b2The parameters of the model are obtained by automatic learning of the model;
401) original picture features are characterized by (batch, l)0,d0) For example, the low-dimensional picture features are (batch, h)1,w1,d1) For example, the high-dimensional picture features are (batch, h)2,w2,d2) For example, the feature of the low-dimensional and high-dimensional pictures is subjected to dimension transformation to obtain (batch, h)1*w1,d1) And (batch, h)2*w2,d2). Sending the three picture characteristics to a multi-head self-attention machine to obtain0D) and (batch, l)1D) and (batch, l)2And d) picture features with consistent dimensions.
402) And then, the picture characteristics are sent into a feedforward neural network, and hidden layer information with the same third dimension is still obtained.
In the step 4), the hidden layer information with consistent dimension is sent to a high layer stacked in the encoder to obtain three encoder characteristics. And finally, splicing the three encoder features to obtain a fused image feature with a longer length.
And 5) sending the fused picture characteristics into a decoder, and decoding the picture characteristics into picture subtitles through stacked decoder layers, wherein the decoder layers are composed of a multi-head self-attention mechanism and a feedforward neural network. And the whole neural network carries out gradient updating through a cross entropy loss function, and the weight of the model is optimized.
In order to verify the effectiveness of the method, the invention applies the picture subtitle generating method applying the characteristic pyramid to two tasks of generating the picture subtitle of the MSCOCO 2017. There are about 12 ten thousand pictures in the MSCOCO2017 data set, and each picture corresponds to 5 pieces of caption data. As can be found from Table 1, compared with the original Vision Transformer model, the BLEU value of the caption generated by the model is obviously improved, and the method provided by the invention is proved to be capable of effectively improving the picture caption generation quality.
Model (model) MSCOCO2017
Original Vision Transformer 37.9
Feature pyramid Vision Transformer 39.6
According to the invention, before the picture features are sent to the encoder, information extraction with different scales is carried out through the feature pyramid, and the semantic expression capability of the picture is enhanced from different angles and different visual field distances of the picture, so that the information contained in the picture is more effectively extracted. Meanwhile, the invention adopts the encoder capable of sharing parameters, thereby effectively reducing the calculation cost of a self-attention mechanism and a feedforward neural network in the encoder.

Claims (4)

1. A method for generating a picture subtitle by applying a feature pyramid is characterized by comprising the following steps of:
1) inputting the preprocessed picture into a feature pyramid module, extracting features of the picture through a multilayer convolutional neural network in the feature pyramid module, and respectively extracting feature graphs of a lower-layer convolutional neural network and a top-layer convolutional neural network to serve as picture feature information of two scales of a low dimension and a high dimension and picture features of three different scales of the preprocessed picture, namely the picture features of the low dimension, the high dimension and the original dimension;
2) sending the original dimensional picture characteristics into an embedding layer to be converted into vector representation;
3) sending the three picture characteristics with different dimensions into a first layer of an encoder, and carrying out dimension scaling, namely scaling the picture characteristics with different dimensions into hidden layer information with the same dimension through a self-attention mechanism and a feedforward neural network;
4) hidden layer information with consistent dimensions is sent to a high layer stacked in an encoder to obtain three encoder characteristics, and fused picture characteristics are obtained through splicing operation;
5) and sending the fused picture characteristics into a decoder of the model for decoding, decoding the picture characteristics into picture subtitles by the decoder through stacked decoder layers, performing gradient updating through a cross entropy loss function, and optimizing the weight of the model to obtain the picture subtitle generating method.
2. The method for generating a picture subtitle using a feature pyramid as claimed in claim 1, wherein: in the step 1), preprocessing image data, inputting the image into a characteristic pyramid module, and performing characteristic extraction through a multilayer convolutional neural network, wherein the convolutional neural network is calculated in the following way:
Figure FDA0003541341260000011
weight(i,j)=w[:,:,i:j]
input(x,k)=x[:,:,k]
x is equal to R in formulaH×W×CExpressing tensor of the pictures in a computer, H expressing the height of the pictures, W expressing the width of the pictures, C expressing the number of channels of the pictures, and expressing 2D cross-correlation operation in a formula
Figure FDA0003541341260000012
Represents a convolution kernel of where coutNumber of channels representing output characteristics, CinThe number of channels, w [:, ij, representing the input characteristics]The i and j tensors, x [: k, in the third and fourth dimensions, taking w]Expressing a kth tensor in a third dimension of x, wherein b in the formula expresses a bias constant; weight (i, j) represents the j-th convolution kernel of channel i in the convolutional neural network, input (x, k) represents the tensor of the k-th channel of input x,
taking the output of the first layer of convolutional neural network as a low-dimensional picture characteristic, and taking the output of the last layer of convolutional neural network as a high-dimensional picture characteristic; the original picture is used as the original dimension picture characteristic.
3. The method for generating a picture subtitle using a feature pyramid as claimed in claim 1, wherein: step 2) sending the original dimension picture characteristics to an embedding layer to be converted into vector representation, namely, adjusting the height and width of the original dimension picture characteristics to a specified size, dividing the original dimension picture characteristics into sub-pictures with fixed sizes, wherein each sub-picture is called a patch, and then sending the sub-pictures to the embedding layer to obtain the code of each patch, namely, a picture embedding vector, and specifically, the method comprises the following steps:
201) dividing the picture (batch, c, h, w) into sub-pictures with the resolution of p1 p2 for each batch, firstly, cutting each original dimension picture into (h/p1) small blocks (w/p2), namely from (batch, c, p1 (h/p1), p2 (w/p2)) to (batch, c, (h/p1) (w/p2), p1 p2), and converting the small blocks into (batch, (h/p1) (w/p2), p1 p2 c), which is equivalent to being divided into (h/p1) (w/p2) batches, wherein the dimension of each batch is p1 p 2; the implementation of this process is calculated as follows:
x=rearrange(b,c,(h*p1),(w*p2)→b,(h*w),(p1*p2*c)′)
wherein, the rearrarage function is an operator of an einops library, p1 and p2 are the size of patch, c is the number of channels, b is the number of batch, and h and w are the height and the width of the image respectively;
202) after the sub-pictures are divided, the embedded vector of the original dimension picture characteristic is obtained, and the dimension of the embedded vector is adjusted, namely the dimension is adjusted to the required size through a layer of full connection layer.
4. The method for generating a picture subtitle using a feature pyramid as claimed in claim 1, wherein: in step 3), the picture features with different dimensions are sent to a first layer of an encoder, and the first layer of the encoder is composed of three coding layers with different dimensions, which respectively correspond to the three picture features, wherein the coding layers are composed of an attention mechanism and a feedforward neural network, and the calculation mode of the multi-head attention mechanism is as follows:
headi=Attention(QWi Q,KWI K,VWI V)
MultiHead(Q,K,V)=Concat(head1,…,headh)Wo
wherein Q, K, V is the input vector, head, of the modeliFor the ith head vector, W is the translation model parameter, Attention (-) is the Attention mechanism function, and Concat (-) is the vector connection function;
the calculation of the feedforward neural network layer FNN is as follows:
FFN(x)=max(0,xW1+b1)W2+b2
where x is the hidden layer vector, W1、W2、b1、b2The parameters of the model are obtained by automatic learning of the model;
301)(batch,l0,d0) Is the original picture feature (batch, h)1,w1,d1) Is a low-dimensional picture feature (batch, h)2,w2,d2) For the high-dimensional picture characteristics, the low-dimensional and high-dimensional picture characteristics are subjected to dimension transformation to obtain (batch, h)1*w1,d1) And (batch, h)2*w2,d2) (ii) a Sending the three picture characteristics to a multi-head self-attention machine to obtain0D) and (batch, l)1D) and (batch, l)2D) picture features with consistent dimensions;
302) and then the three picture characteristics are sent into a feedforward neural network, and hidden layer information with the same third dimension is still obtained.
CN202210233662.8A 2022-03-10 2022-03-10 Picture subtitle generation method applying feature pyramid Active CN114782848B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210233662.8A CN114782848B (en) 2022-03-10 2022-03-10 Picture subtitle generation method applying feature pyramid

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210233662.8A CN114782848B (en) 2022-03-10 2022-03-10 Picture subtitle generation method applying feature pyramid

Publications (2)

Publication Number Publication Date
CN114782848A true CN114782848A (en) 2022-07-22
CN114782848B CN114782848B (en) 2024-03-26

Family

ID=82424138

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210233662.8A Active CN114782848B (en) 2022-03-10 2022-03-10 Picture subtitle generation method applying feature pyramid

Country Status (1)

Country Link
CN (1) CN114782848B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3040165A1 (en) * 2016-11-18 2018-05-24 Salesforce.Com, Inc. Spatial attention model for image captioning
US20210089807A1 (en) * 2019-09-25 2021-03-25 Samsung Electronics Co., Ltd. System and method for boundary aware semantic segmentation
CN113159034A (en) * 2021-04-23 2021-07-23 杭州电子科技大学 Method and system for automatically generating subtitles by using short video
CN113378973A (en) * 2021-06-29 2021-09-10 沈阳雅译网络技术有限公司 Image classification method based on self-attention mechanism
CN113657124A (en) * 2021-07-14 2021-11-16 内蒙古工业大学 Multi-modal Mongolian Chinese translation method based on circulation common attention Transformer

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3040165A1 (en) * 2016-11-18 2018-05-24 Salesforce.Com, Inc. Spatial attention model for image captioning
US20210089807A1 (en) * 2019-09-25 2021-03-25 Samsung Electronics Co., Ltd. System and method for boundary aware semantic segmentation
CN113159034A (en) * 2021-04-23 2021-07-23 杭州电子科技大学 Method and system for automatically generating subtitles by using short video
CN113378973A (en) * 2021-06-29 2021-09-10 沈阳雅译网络技术有限公司 Image classification method based on self-attention mechanism
CN113657124A (en) * 2021-07-14 2021-11-16 内蒙古工业大学 Multi-modal Mongolian Chinese translation method based on circulation common attention Transformer

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
严娟;方志军;高永彬;: "结合混合域注意力与空洞卷积的3维目标检测", 中国图象图形学报, no. 06, 16 June 2020 (2020-06-16) *
杜海骏;刘学亮;: "融合约束学习的图像字幕生成方法", 中国图象图形学报, no. 02, 16 February 2020 (2020-02-16) *

Also Published As

Publication number Publication date
CN114782848B (en) 2024-03-26

Similar Documents

Publication Publication Date Title
CN113378973B (en) Image classification method based on self-attention mechanism
CN113934890B (en) Method and system for automatically generating scene video by characters
CN114170174B (en) CLANet steel rail surface defect detection system and method based on RGB-D image
CN113159034A (en) Method and system for automatically generating subtitles by using short video
CN112819013A (en) Image description method based on intra-layer and inter-layer joint global representation
CN111861945A (en) Text-guided image restoration method and system
Kang et al. Ddcolor: Towards photo-realistic image colorization via dual decoders
Li et al. Efficient region-aware neural radiance fields for high-fidelity talking portrait synthesis
CN113344036A (en) Image description method of multi-mode Transformer based on dynamic word embedding
Ma et al. Latte: Latent diffusion transformer for video generation
Sarhan et al. HLR-net: a hybrid lip-reading model based on deep convolutional neural networks
CN115601562A (en) Fancy carp detection and identification method using multi-scale feature extraction
CN114780775A (en) Image description text generation method based on content selection and guide mechanism
Fang et al. Sketch assisted face image coding for human and machine vision: a joint training approach
CN110633706A (en) Semantic segmentation method based on pyramid network
CN114782848B (en) Picture subtitle generation method applying feature pyramid
CN116862949A (en) Transformer target tracking method and tracker based on symmetrical cross attention and position information enhancement
CN116823600A (en) Scene text image reconstruction method integrating semantic priori and weighting loss
CN116244464A (en) Hand-drawing image real-time retrieval method based on multi-mode data fusion
CN113869154B (en) Video actor segmentation method according to language description
CN115496134A (en) Traffic scene video description generation method and device based on multi-modal feature fusion
CN112446372B (en) Text detection method based on channel grouping attention mechanism
Bae et al. IPSILON: incremental parsing for semantic indexing of latent concepts
Wang et al. Self-prior guided pixel adversarial networks for blind image inpainting
CN117710986B (en) Method and system for identifying interactive enhanced image text based on mask

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant