CN113535999B - Diversified image description sentence generation technology based on deep learning - Google Patents

Diversified image description sentence generation technology based on deep learning Download PDF

Info

Publication number
CN113535999B
CN113535999B CN202110758735.0A CN202110758735A CN113535999B CN 113535999 B CN113535999 B CN 113535999B CN 202110758735 A CN202110758735 A CN 202110758735A CN 113535999 B CN113535999 B CN 113535999B
Authority
CN
China
Prior art keywords
image
style
model
image description
decoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110758735.0A
Other languages
Chinese (zh)
Other versions
CN113535999A (en
Inventor
任磊
孟子豪
王涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202110758735.0A priority Critical patent/CN113535999B/en
Publication of CN113535999A publication Critical patent/CN113535999A/en
Application granted granted Critical
Publication of CN113535999B publication Critical patent/CN113535999B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a diversified image description sentence generation technology based on deep learning, belonging to the technical field of image description sentence generation; the diversified image description sentence generation technology based on the deep learning is used for improving the problems that the generated sentences are single, and the image details are widely ignored in the traditional image description sentence generation technology, and is suitable for generating the image into diversified description sentences.

Description

Diversified image description sentence generation technology based on deep learning
Technical field:
the invention relates to a diversified image description sentence generation technology based on deep learning, and belongs to the technical field of image description sentence generation.
The background technology is as follows:
currently, image description sentence generation techniques based on deep learning are generally implemented by constructing an "encoder-decoder" model. The encoder is used for converting a digital matrix of the image into high-dimensional feature codes rich in semantic information, and is realized by adopting a residual error model based on a convolutional neural network; and a "decoder" is used to decode the above-mentioned high-dimensional features and input semantic information therein into a text generation model to obtain descriptive sentences, and generally the decoder adopts two structures of a long-short-time neural network-based and a Transformer-based (Transformer) -based self-attention mechanism for text generation. While the loss of the model is typically based on multiple reference sentences, cross entropy is used as a loss function. However, when the method trains the model, the model tends to generate a wider statement description when the statement is actually generated, for example, in the images with different details for two similar scenes, the model tends to generate a partial summary wider statement, so that the details in the images are ignored, and the requirement of generating various description statements by us cannot be met.
Accordingly, based on the above-mentioned problems, the present invention has devised a diversified image description sentence generation technique based on deep learning.
The invention comprises the following steps:
the invention aims to provide a diversified image description sentence generation technology based on deep learning, which is used for solving the problems that generated sentences are single, image details are widely ignored in the traditional image description sentence generation technology, and is suitable for generating images into diversified description sentences.
Convolutional neural networks are widely used in various tasks related to computer vision. The deep convolutional neural network performs feature extraction of the image, and is typically implemented by a convolutional Layer (Convolution Layer) and a Pooling Layer (Pooling Layer).
The pictures are typically stored in a computer in the form of a digital matrix in which each element represents content information for the corresponding location of the picture. The number of matrices and the relationship of the same position values of different matrices depend on the color type of the picture. For the digital matrix, the convolutional neural network firstly sets a plurality of convolutional kernels (filters) on each convolutional layer, detects specific features, and maps an original input picture to a high-dimensional feature space to form an output matrix.
The output matrix of the convolutional layer will be passed to the pooling layer for a "Subsampling" operation, thereby reducing the matrix size. And the pooling layer performs regional division on the convolution layer output matrix, then uses a nonlinear pooling function to extract the relative positions of different features in the convolution product matrix, and splices the outputs together to form the pooling layer output.
In general, the convolution layer and the pooling layer are bundled and reused, and the original larger numerical matrix of the input picture is continuously reduced, so that the extraction effect of the specific features is realized. However, as the depth of the network increases, the problems of gradient disappearance, gradient explosion, etc. limit the training performance of the network. For this problem, a depth residual network is proposed. The depth residual network also adopts a stack of a plurality of convolution-pooling layers, but a Skip Connection (Skip Connection) structure is arranged at each subunit to solve the problem of gradient disappearance caused by network depth increase.
In the invention, a ResNet-101 model is used for extracting the characteristics of the image, and the model belongs to a standard size specification of a residual error network. As shown in FIG. 1, the image file is preprocessed and then subjected to a deep convolutional neural network to form the high-dimensional semantic feature code of the image.
The self-attention mechanism based image description statement generation model is also based on an encoder and a decoder, and the overall structure is shown in fig. 4. The model is trained in a supervised learning mode, and can be divided into two parts according to different inputs. One is an encoder which takes as input the high-dimensional semantic features of the image extracted by the deep convolutional neural network, and the other is a decoder which takes as input the embedded sequence of the image description statement which is written manually.
On the one hand, the high-dimensional semantic features of the image are directly input into the first coding block in the encoder, and the structure of the coding block is shown in fig. 2. The output of each coding block is the input of the next layer until the sixth coding block gives the output.
On the other hand, the manually written image description sentence is input to the decoder after word embedding and position embedding. The detail design and key of the invention are as follows: compared with the traditional Decoder, the designed Decoder not only increases the Multi-Head Attention (Encoder-Decoder Multi-Head Attention) of an Encoder relative to the Encoder, but also increases a style parameter matrix which is input towards different style image description sentences.
The technical scheme adopted by the invention is as follows: a diversified image description sentence generation technology based on deep learning is characterized in that: comprising a decoding block comprising: multi-style Decoder Multi-Head self-Attention, multi-style Encoder Decoder Multi-Head self-Attention, feedforward neural network, compared with Encoder, increase a Multi-style Encoder Decoder Multi-Head Attention (Encoder-Decoder Multi-Head Attention), and increase a style parameter matrix facing different style image description statement input; the input sequence of each layer of decoding blocks first goes into the decoding block multi-head attention as shown in fig. 3 and performs the self-attention mechanism and the calculation of the "add and normalize layer". The calculated output is input to the newly added multi-trellis encoder decoder multi-head attention along with the output of the top-layer encoder;
Multi_Head (D,E,E) =concat(head 1 ,...,head d ) (1)
head i =Attention(D·W i D,E·W i E,E·W i E ) (2)
Figure BDA0003148821980000031
where E is the output from the top encoder and D is the result of the "addition and normalization layer" of the output from the unit decoder. W (W) S Is a style parameter matrix, and as shown in fig. 3, may include WT real style matrix WR, romantic style matrix, WH, humor style matrix. In each decoding block, three style parameter matrixes of reality, romance and humour can be set. The three matrices begin to initialize randomly and the overall model is trained with multiple styles of image description datasets. Image descriptions of different style types are trained using different matrices of style parameters, other parameters are designed to model generic fact descriptions in text data, and are shared in different style styles, in which the model's multi-style parameter matrices must maximize the Euclidean distance between the matrices to achieve variability and uniqueness between the image description outputs in order to preserve the variability of the model's multi-style outputs.
The formula of fig. 2 describes:
in each layer of the coding block, it is assumed that the input of the layer is X, W O Is a multi-headed attention fusion matrix. W (W) i Q ,W i K ,W i V And respectively carrying out multi-head attention weight matrixes. LayerNormalization is layer normalization. Y is Y 1 The layer output is normalized for the first sum. Y is Y 2 Is output by the feedforward neural network. Y is Y 3 The layer output is normalized for the second addition. W (W) 1 And b 1 Is a model parameter.
Figure BDA0003148821980000032
hend i =Attention(x·W i Q ,x·W i K ,x·W i V )#2
Figure BDA0003148821980000041
Figure BDA0003148821980000042
Y 2 =max(0,Y 1 W 1 +b 1 )W 2 +b 2 )#5
Y 3 =LayerNormalization(Y 2 +Y 1 )#6
Encoding block first part: equations 1, 2 and 3 are encoder multi-head self-attention
Encoding block second portion: equation 4 is the first addition and normalization
Coding block third section: formula 5 is a feedforward neural network
Coding block fourth section: equation 6 is the second addition and normalization
The formula of fig. 3 describes:
in each layer of the decoding block, it is assumed that the input of this layer is x, W O Is a multi-headed attention fusion matrix. W (W) i Q ,W i K ,W i V And respectively carrying out multi-head attention weight matrixes. Let E be the input of the sixth layer encoded block. W (W) s Is a style parameter matrix comprising W T True style matrix, W R Romantic style matrix, W H Humor style matrix. D is the first addition and normalization layer output, Y 1 For the second addition and normalization layer output, Y 2 Is output by the feedforward neural network. Y is Y 3 The layer output is normalized for the third summation. W (W) 1 And b 1 Is a model parameter
Figure BDA0003148821980000043
head i =Attention(X·W i Q ,X·W i K ,X·W i V )#8
Figure BDA0003148821980000044
Figure BDA0003148821980000045
Figure BDA0003148821980000046
head i =Attetion(D·W i Q ,E·W i K ,E·W i V )#12
Figure BDA0003148821980000051
Figure BDA0003148821980000052
Y 2 =max(0,Y 1 W 1 +b 1 )W 2 +b 2 )#15
Y 3 =LayerNormalization(Y 2 +Y 1 )#16
Decoding block first part: equations 7, 8 and 9
Decoding block second part: equation 10
Decoding block third section: formulas 11,12, and 13
Decoding fast fourth part: equation 14
Decoding block fifth section: equation 15
Decoding block sixth section: equation 16
The invention has the beneficial effects that: after the use of the technical field of the conventional image description generation based on the deep learning, the description of the image becomes three statement descriptions with reality, romance and humour due to the broad generalization of the description.
At the model level: compared with the traditional image description, three statement descriptions of reality, romance and humorous are generated and unified into one model, and three different statement descriptions share model network parameters of a part, so that the model is prevented from being biased to be fitted to the statement description of a single style, the learned model is more general, and the network parameter utilization rate of the model is effectively improved.
At the description generation effect level: the generated description sentences follow the grammar and good readability of natural language, have grammar correctness, and the generated image description can accurately describe the details of the image, carry out text description on the image in multiple styles, and have description richness.
Description of the drawings:
in order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a general flow chart of an image description generation statement
Fig. 2 is an internal structural diagram of a single encoder block of fig. 4.
Fig. 3 is an internal structural diagram of the multi-trellis decoding block of fig. 4.
Fig. 4 is a diagram of a model structure generated based on image description statements of a self-attention mechanism.
In the drawings:
the specific embodiment is as follows:
in order to make the technical problems, technical schemes and beneficial effects to be solved more clear, the invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Example 1:
FIG. 1 is a general flow chart of an image description generation statement: the method comprises the following specific steps:
1) A real world image file is acquired. 2) For an image file, matrixing is first performed. In the matrix, each element represents the content information of the corresponding position of the picture. The number of matrices and the relationship of the same position values of different matrices depend on the color type of the picture. 3) In order to accelerate the convergence speed of the image description generation model, the matrixed image description file is subjected to data mapping between [0-1] and standardization. 4) The normalized image matrix is input into a deep convolutional neural network. 5) And obtaining the high-dimensional semantic features of the image through multi-level feature extraction of the deep convolutional neural network. 6) Inputting the high-dimensional semantic features of the image into an encoder, and obtaining more abstract depth image semantic features through multi-level encoding; 7) The depth image semantic features are input to a multi-style decoder, the Euclidean distance between parameter matrices is maximized through the designed multi-style parameter matrices to ensure the difference of multi-style output of the model, and the rest parameters except the parameter matrices are shared so as to model general fact description in text data. 8) The effect of generating the multi-style image description is achieved by changing the parameter matrix of the multi-style decoder.
Fig. 1 is divided into 8 parts in total. The encoder of the sixth part in fig. 1 consists of the encoding blocks 1-6 in fig. 4. And the multi-trellis decoder of the seventh part in fig. 1 is composed of multi-trellis decoding blocks 1-6 in fig. 4.
The internal structure for each coding block consists of four parts, shown in fig. 2, including encoder multi-head self-attention, summation and normalization, feed-forward neural network, summation and normalization.
For the internal structure of each decoding block, the method is composed of six parts including multi-style decoder multi-head self-attention, addition and normalization, multi-style encoder decoder multi-head self-attention, addition and normalization, feedforward neural network and addition and normalization, as shown in fig. 3. Wherein the multi-headed self-attention portion of the decoded block contains parameter matrices of three different styles.
Typically, 6 or 12 coding blocks can be taken, and stacked deep coding blocks and decoding blocks facilitate deep semantic features of the model extraction of richer images and text.
Fig. 2: fig. 2 is an internal structural view of the 5 th to 7 th portions of fig. 1.
Comprising the following steps: 1) The high-dimensional semantic features of the image are obtained into more abstract depth image semantic features through the coding block 1 and the coding block 6. These depth semantic features may be input as a "multi-style encoder decoder multi-headed self-attention" portion of the multi-style decoding block. 2) In the training of fig. 4, since the multi-style image description sentence cannot be directly input into the model, it needs to be converted into a dense feature vector through word embedding. And because the model can not directly learn the sequence relation of the input multi-style image description sentences, a position embedding module is added. Finally, word embedding and position embedding of the multi-style image description sentence are added, and finally, a multi-style decoder is input. 3) And finally, obtaining the prediction output of the multi-style image description sentence through the laminated multi-style decoding block, the mapping of the linear layer and the soft maximum layer.
The foregoing description is merely illustrative of specific embodiments of this invention, and is not intended to limit the spirit of the invention, since modifications and variations of the specific embodiments described above will become apparent to those skilled in the art in light of the disclosure herein, without departing from the spirit and scope of the invention.

Claims (2)

1. A diversified image description sentence generation method based on deep learning is characterized in that: 1) Acquiring an image file of a real world; 2) For an image file, firstly, matrixing is carried out, and each element in the matrix represents content information of a corresponding position of a picture; the number of the matrixes and the relation of the numerical values of the same positions of different matrixes depend on the color type of the picture; 3) In order to accelerate the convergence speed of the image description generation model, mapping data between [0-1] and standardization are carried out on the matrixed image description file; 4) The standardized image matrix is input into a deep convolutional neural network; 5) The high-dimensional semantic features of the image are obtained through multi-level feature extraction of the deep convolutional neural network; 6) Inputting the high-dimensional semantic features of the image into an encoder, and obtaining more abstract depth image semantic features through multi-level encoding; 7) The depth image semantic features are input to a multi-style decoder, the Euclidean distance between parameter matrixes is maximized through the designed multi-style parameter matrixes to ensure the difference of multi-style output of the model, and other parameters except the parameter matrixes are shared so as to model general fact description in text data; 8) The effect of generating the multi-style image description is achieved by changing the parameter matrix of the multi-style decoder.
2. The deep learning-based diversified image description sentence generation method as claimed in claim 1, characterized by comprising the steps of: 6 or 12 coding blocks can be taken, and the stacked deep coding blocks and decoding blocks are beneficial to the model to extract deep semantic features of richer images and texts; the decoding block comprises six parts of multi-style decoder multi-head self-attention, addition and normalization, multi-style encoder decoder multi-head self-attention, addition and normalization, feedforward neural network and addition and normalization; wherein the multi-headed self-attention portion of the decoded block contains parameter matrices of three different styles.
CN202110758735.0A 2021-07-05 2021-07-05 Diversified image description sentence generation technology based on deep learning Active CN113535999B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110758735.0A CN113535999B (en) 2021-07-05 2021-07-05 Diversified image description sentence generation technology based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110758735.0A CN113535999B (en) 2021-07-05 2021-07-05 Diversified image description sentence generation technology based on deep learning

Publications (2)

Publication Number Publication Date
CN113535999A CN113535999A (en) 2021-10-22
CN113535999B true CN113535999B (en) 2023-05-26

Family

ID=78126779

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110758735.0A Active CN113535999B (en) 2021-07-05 2021-07-05 Diversified image description sentence generation technology based on deep learning

Country Status (1)

Country Link
CN (1) CN113535999B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113988274B (en) * 2021-11-11 2023-05-12 电子科技大学 Text intelligent generation method based on deep learning
CN114511860B (en) * 2022-04-19 2022-06-17 苏州浪潮智能科技有限公司 Difference description statement generation method, device, equipment and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110210499A (en) * 2019-06-03 2019-09-06 中国矿业大学 A kind of adaptive generation system of image, semantic description
CN110991284A (en) * 2019-11-22 2020-04-10 北京航空航天大学 Optical remote sensing image statement description generation method based on scene pre-classification

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10360466B2 (en) * 2016-12-27 2019-07-23 Facebook, Inc. Systems and methods for image description generation
US11250252B2 (en) * 2019-12-03 2022-02-15 Adobe Inc. Simulated handwriting image generator

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110210499A (en) * 2019-06-03 2019-09-06 中国矿业大学 A kind of adaptive generation system of image, semantic description
CN110991284A (en) * 2019-11-22 2020-04-10 北京航空航天大学 Optical remote sensing image statement description generation method based on scene pre-classification

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于注意力特征自适应校正的图像描述模型;韦人予;蒙祖强;;计算机应用(S1);全文 *

Also Published As

Publication number Publication date
CN113535999A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
CN110209801B (en) Text abstract automatic generation method based on self-attention network
CN113535999B (en) Diversified image description sentence generation technology based on deep learning
CN109522403B (en) Abstract text generation method based on fusion coding
CN107480206A (en) A kind of picture material answering method based on multi-modal low-rank bilinearity pond
CN112084841B (en) Cross-mode image multi-style subtitle generating method and system
CN112233012B (en) Face generation system and method
CN111858932A (en) Multiple-feature Chinese and English emotion classification method and system based on Transformer
CN113221879A (en) Text recognition and model training method, device, equipment and storage medium
CN112070114A (en) Scene character recognition method and system based on Gaussian constraint attention mechanism network
CN110175248A (en) A kind of Research on face image retrieval and device encoded based on deep learning and Hash
CN114549850A (en) Multi-modal image aesthetic quality evaluation method for solving modal loss problem
CN111985470A (en) Ship board correction and identification method in natural scene
CN115082693A (en) Multi-granularity multi-mode fused artwork image description generation method
CN116821291A (en) Question-answering method and system based on knowledge graph embedding and language model alternate learning
CN113140023A (en) Text-to-image generation method and system based on space attention
CN112380843B (en) Random disturbance network-based open answer generation method
CN114004220A (en) Text emotion reason identification method based on CPC-ANN
CN111522923B (en) Multi-round task type dialogue state tracking method
KR102562386B1 (en) Learning method for image synthesis system
CN117093692A (en) Multi-granularity image-text matching method and system based on depth fusion
CN115860054B (en) Sparse codebook multiple access coding and decoding system based on generation countermeasure network
CN115331073A (en) Image self-supervision learning method based on TransUnnet architecture
CN114116960A (en) Federated learning-based joint extraction model construction method and device
CN113377908B (en) Method for extracting aspect-level emotion triple based on learnable multi-word pair scorer
Grassucci et al. Enhancing Semantic Communication with Deep Generative Models--An ICASSP Special Session Overview

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant