CN113535999A - Diversified image description sentence generation technology based on deep learning - Google Patents

Diversified image description sentence generation technology based on deep learning Download PDF

Info

Publication number
CN113535999A
CN113535999A CN202110758735.0A CN202110758735A CN113535999A CN 113535999 A CN113535999 A CN 113535999A CN 202110758735 A CN202110758735 A CN 202110758735A CN 113535999 A CN113535999 A CN 113535999A
Authority
CN
China
Prior art keywords
image
model
decoder
image description
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110758735.0A
Other languages
Chinese (zh)
Other versions
CN113535999B (en
Inventor
任磊
孟子豪
王涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202110758735.0A priority Critical patent/CN113535999B/en
Publication of CN113535999A publication Critical patent/CN113535999A/en
Application granted granted Critical
Publication of CN113535999B publication Critical patent/CN113535999B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a diversified image description sentence generation technology based on deep learning, belonging to the technical field of image description sentence generation; the diversified image descriptive sentence generating technology based on deep learning is used for solving the problems that a generated sentence is single and image details are widely ignored in the traditional image descriptive sentence generating technology, and is suitable for generating an image into diversified descriptive sentences.

Description

Diversified image description sentence generation technology based on deep learning
The technical field is as follows:
the invention relates to a technology for generating diversified image description sentences based on deep learning, and belongs to the technical field of image description sentence generation.
Background art:
currently, the image description sentence generation technology based on deep learning is generally realized by constructing an "encoder-decoder" model. The encoder is used for converting a digital matrix of an image into high-dimensional feature coding rich in semantic information, and is specifically realized by adopting a residual error model based on a convolutional neural network; while the "decoder" is used to decode the above-mentioned high-dimensional features and input the semantic information thereof into a text generation model to obtain descriptive statements, the general decoder adopts two structures, namely a long-short-time neural network-based and Transformer-based (Transformer) -based self-attention mechanism for text generation. While the model penalty is typically based on multiple reference statements, using cross entropy as a penalty function. However, when the model is trained by the method, the model tends to generate a broader statement description when actually generating statements, for example, for two similar scenes, images with different details, the model tends to generate a more general and broader statement, which not only ignores the details in the images, but also cannot meet the requirement of generating diversified description statements.
Therefore, in view of the above problems, the present invention has devised a diversified image description sentence generation technique based on deep learning.
The invention content is as follows:
the invention aims to provide a diversified image descriptive sentence generating technology based on deep learning, which is used for solving the problems that the generated sentence is single and the image details are widely ignored in the traditional image descriptive sentence generating technology and is suitable for generating images into diversified descriptive sentences.
Convolutional neural networks are widely used in various tasks related to computer vision. The deep convolutional neural network performs feature extraction of an image, and is generally implemented by a convolutional Layer (Convolution Layer) and a Pooling Layer (Pooling Layer).
The pictures are usually stored in the form of a digital matrix in the computer, and each element in the matrix represents the content information of the corresponding position of the picture. The number of matrices and the relationship between the same location values of different matrices depend on the color type of the picture. For the digital matrix, the convolutional neural network firstly sets a plurality of convolutional kernels (filters) on each convolutional layer, detects specific characteristics, and maps an original input picture to a high-dimensional characteristic space to form an output matrix.
The output matrix of the convolutional layer will be handed to the pooling layer for a "downsampling" operation, thereby reducing the matrix size. The pooling layer performs region division on the convolution layer output matrix, then uses a nonlinear pooling function to extract the relative positions of different characteristics in the convolution product matrix, and splices the outputs together to form pooling layer output.
Generally, the convolutional layer and the pooling layer are bound and reused, and an originally large numerical matrix of an input picture is continuously reduced, so that the effect of extracting specific features is achieved. However, as the depth of the network increases, the training performance of the network is limited due to the problems of gradient disappearance, gradient explosion and the like. To address this problem, a deep residual network is proposed. The deep residual network also adopts a stack of a plurality of convolution-pooling layers, but a Skip Connection (Skip Connection) structure is arranged in each subunit so as to solve the problem of gradient disappearance caused by the increase of the network depth.
In the invention, a ResNet-101 model is used for extracting the characteristics of the image, and the model belongs to a standard size specification of a residual error network. As shown in fig. 1, the image file is preprocessed and then forms an image high-dimensional semantic feature code through a deep convolutional neural network.
The image description statement generation model based on the self-attention mechanism is also based on an encoder and a decoder, and the overall structure is shown in FIG. 4. The whole model is trained in a supervised learning mode and can be divided into two parts according to different inputs. The first is an encoder taking the high-dimensional semantic features of the image extracted by the deep convolutional neural network as input, and the second is a decoder taking the embedded sequence of the image description sentences written artificially as input.
On one hand, the high-dimensional semantic features of the image are directly input into the first coding block in the encoder, and the structure of the coding block is shown in fig. 2. The output of each coding block is the input of the next layer, and the output is given by the sixth coding block.
On the other hand, an image description sentence written manually is input to a decoder after word embedding and position embedding. The invention has the following detailed design and key points: compared with the traditional Decoder part, the Decoder designed by the method not only increases a coder-Decoder Multi-Head Attention (Encoder-Decoder Multi-Head Attention) relative to the coder, but also increases a style parameter matrix for inputting different style image description sentences.
The technical scheme adopted by the invention is as follows: a diversified image description sentence generation technology based on deep learning is characterized in that: comprising a decoding block comprising: the Multi-Head Attention of the Multi-trellis Decoder, the Multi-Head Attention of the Multi-trellis Encoder Decoder and the feedforward neural network are added, and compared with an Encoder, the Multi-Head Attention (Encoder-Decoder Multi-Head Attention) of the Multi-trellis Encoder Decoder is added, and a style parameter matrix for the input of image description sentences of different styles is added; the input sequence of each layer of decoding blocks first goes into the decoding block multi-headed attention as shown in fig. 3 and performs the self-attention mechanism and the calculation of the "add and normalize layers". The calculated output is input into the multi-head attention of the newly added multi-format encoder decoder together with the output of the top-level encoder;
Multi_Head(D,E,E)=concat(head1,...,headd) (1)
headi=Attention(D·WiD,E·WiE,E·Wi E) (2)
Figure BDA0003148821980000031
wherein, E is the output from the top encoder, and D is the result of the addition and normalization layer of the output given by the decoder of this unit. WSIs a style parameter matrix, and may include WT true style matrix WR, romantic style matrix WH, humorous style matrix, as shown in fig. 3. In each decoding block, three style parameter matrixes of true, romantic and humorous can be set. The three matrices are initialized randomly and the overall model is trained using a multi-style image description dataset. The image descriptions of different style types are trained by using different style parameter matrixes, other parameters are designed to model general fact descriptions in the text data and are shared in different style types, and in the training process, in order to keep the difference of multi-style output of the model, the multi-style parameter matrix of the model must maximize the Euclidean distance between the matrixes so as to realize the purposeThe images describe the difference and uniqueness between the outputs.
The formula of fig. 2 describes:
in each layer of the coding block, the input of the layer is assumed to be X, WOIs a multi-headed attention fusion matrix. Wi Q,Wi K,Wi VRespectively multi-headed attention weight matrices. LayerNormalization is layer normalization. Y is1Output for the first summing and normalization layer. Y is2Is the output of the feedforward neural network. Y is3Output for the second summing and normalization layer. W1And b1Are model parameters.
Figure BDA0003148821980000032
hendi=Attention(x·Wi Q,x·Wi K,x·Wi V)#2
Figure BDA0003148821980000041
Figure BDA0003148821980000042
Y2=max(0,Y1W1+b1)W2+b2)#5
Y3=LayerNormalization(Y2+Y1)#6
Coding a first part: equations 1, 2 and 3 are encoder multi-head self-attention
Coding a second part: equation 4 is the first addition and normalization
Coding block third part: equation 5 is a feedforward neural network
And a fourth part of the coding block: equation 6 is a second addition and normalization
The formula of FIG. 3 describes:
decoding blockIn each layer, assume that the input of this layer is x, WOIs a multi-headed attention fusion matrix. Wi Q,Wi K,Wi VRespectively multi-headed attention weight matrices. Let E be the input to the layer six coding block. WsIs a style parameter matrix comprising WTTrue style matrix, WRRomantic style matrix, WHHumorous style matrix. D is the first addition and normalization layer output, Y1For the second addition and normalization layer output, Y2Is the output of the feedforward neural network. Y is3Output for the third summing and normalization layer. W1And b1As a parameter of the model
Figure BDA0003148821980000043
headi=Attention(X·Wi Q,X·Wi K,X·Wi V)#8
Figure BDA0003148821980000044
Figure BDA0003148821980000045
Figure BDA0003148821980000046
headi=Attetion(D·Wi Q,E·Wi K,E·Wi V)#12
Figure BDA0003148821980000051
Figure BDA0003148821980000052
Y2=max(0,Y1W1+b1)W2+b2)#15
Y3=LayerNormalization(Y2+Y1)#16
Decoding block first part: equations 7, 8 and 9
Decoding block second part: equation 10
Decoding block third part: equations 11,12, and 13
Decoding the fast fourth part: equation 14
Decoding block fifth part: equation 15
Decoding block sixth part: equation 16
The invention has the beneficial effects that: after the traditional image description generation technology based on deep learning is used, the description of the image is changed into three sentence descriptions with reality, romance and humorous due to the broad and general description.
On the model level: compared with the traditional image description, the method has the advantages that three real, romantic and humorous statement descriptions are generated and unified into one model, the three different statement descriptions describe the model network parameters of the shared part, the model is prevented from being biased to be fitted with the statement description of a single style, the learned model is enabled to have generality, and the network parameter utilization rate of the model is effectively improved.
At the aspect of describing and generating effects: the generated description sentence follows the grammar of the natural language and has good readability and grammatical correctness, and the generated image description can accurately describe the details of the image, carry out character description on the image in multiple styles and has description richness.
Description of the drawings:
in order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a general flow diagram of an image description generation statement
Fig. 2 is an internal structural diagram of a single encoder block in fig. 4.
Fig. 3 is an internal structural diagram of the multi-trellis decoding block of fig. 4.
Fig. 4 is a diagram of a model structure generated based on an image description statement of the self-attention mechanism.
In the drawings:
the specific implementation mode is as follows:
in order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Example 1:
FIG. 1 is a general flow diagram of an image description generation statement: the method comprises the following specific steps:
1) an image file of the real world is acquired. 2) For an image file, a matrixing is first performed. In the matrix, each element represents content information of a corresponding position of the picture. The number of matrices and the relationship between the same location values of different matrices depend on the color type of the picture. 3) In order to accelerate the convergence speed of the image description generation model, the data mapping between [0-1] and the standardization of the matrixed image description file are carried out. 4) The normalized image matrix is input to a deep convolutional neural network. 5) And extracting the multi-level features of the deep convolutional neural network to obtain the high-dimensional semantic features of the image. 6) Inputting the high-dimensional semantic features of the image into an encoder, and obtaining more abstract depth image semantic features through multilevel encoding; 7) the semantic features of the depth image are input into a multi-format decoder, the Euclidean distance between parameter matrixes is maximized to ensure the difference of multi-format output of the model through the designed multi-format parameter matrixes, and other parameters except the parameter matrixes are shared to model general fact description in text data. 8) The effect of generating the multi-format image description is achieved by changing the parameter matrix of the multi-format decoder.
Fig. 1 is divided into 8 parts in total. The encoder of the sixth part of fig. 1 consists of the encoding blocks 1-6 of fig. 4. Whereas the multi-trellis decoder of the seventh part of fig. 1 is composed of the multi-trellis decoding blocks 1-6 of fig. 4.
The internal structure of each coding block is composed of four parts shown in figure 2, including self attention of the encoder multi-head, addition and normalization, feed-forward neural network and addition and normalization.
The internal structure of each decoding block is composed of six parts, namely multi-grid decoder multi-head self-attention, addition and normalization, multi-grid encoder decoder multi-head self-attention, addition and normalization, feed-forward neural network and addition and normalization, which are shown in figure 3. Wherein, the multi-head self-attention part of the decoding block comprises three parameter matrixes of different styles.
Generally speaking, 6 or 12 coding blocks can be taken, and the stacked deep coding blocks and decoding blocks are beneficial for the model to extract deep semantic features of richer images and texts.
FIG. 2: fig. 2 is an internal structural view of parts 5 to 7 in fig. 1.
The method comprises the following steps: 1) the high-dimensional semantic features of the image are coded by the coding blocks 1-6 to obtain more abstract depth image semantic features. These depth semantic features would be input as the "multi-trellis encoder decoder multi-headed self-attention" part of the multi-trellis decoding block. 2) During the training of fig. 4, since the multi-style image description sentences cannot be directly input into the model, the multi-style image description sentences need to be respectively embedded by words and converted into dense feature vectors. And because the model cannot directly learn the sequential relation of the input multi-format image description sentences, a position embedding module is added. Finally, the word embedding and the position embedding of the multi-style image description sentence are added, and finally input to the multi-style decoder. 3) And finally, obtaining the predicted output of the multi-format image description statement through the stacked multi-format decoding blocks, the mapping of the linear layer and the soft maximum layer.
The foregoing is directed to embodiments of the present invention, and the foregoing description is not intended to limit the scope of the invention, which is defined by the claims appended hereto.

Claims (2)

1. A diversified image description sentence generation method based on deep learning is characterized in that: 1) acquiring an image file of a real world; 2) for an image file, firstly, matrixing is performed, and in the matrix, each element represents content information of a corresponding position of a picture. The number of the matrixes and the relationship between the same position values of different matrixes are specifically determined by the color type of the picture; 3) in order to accelerate the convergence speed of the image description generation model, mapping data between [0-1] and standardizing the matrixed image description file; 4) inputting the standardized image matrix into a depth convolution neural network; 5) extracting multi-level features of a deep convolutional neural network to obtain high-dimensional semantic features of the image; 6) inputting the high-dimensional semantic features of the image into an encoder, and obtaining more abstract depth image semantic features through multilevel encoding; 7) the semantic features of the depth image are input into a multi-format decoder, the Euclidean distance between the parameter matrixes is maximized to ensure the difference of multi-format output of the model through the designed multi-format parameter matrixes, and other parameters except the parameter matrixes are shared to model general fact description in the text data; 8) the effect of generating the multi-format image description is achieved by changing the parameter matrix of the multi-format decoder.
2. The method for generating diversified image description sentences based on deep learning of claim 1, wherein: 6 or 12 coding blocks can be selected, and the stacked deep coding blocks and decoding blocks are beneficial to extracting deep semantic features of richer images and texts by a model; the decoding block comprises six parts, namely a multi-grid decoder multi-head self-attention part, an addition part and a normalization part, a multi-grid encoder decoder multi-head self-attention part, an addition part and a normalization part, a feedforward neural network part and an addition part and a normalization part; wherein, the multi-head self-attention part of the decoding block comprises three parameter matrixes of different styles.
CN202110758735.0A 2021-07-05 2021-07-05 Diversified image description sentence generation technology based on deep learning Active CN113535999B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110758735.0A CN113535999B (en) 2021-07-05 2021-07-05 Diversified image description sentence generation technology based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110758735.0A CN113535999B (en) 2021-07-05 2021-07-05 Diversified image description sentence generation technology based on deep learning

Publications (2)

Publication Number Publication Date
CN113535999A true CN113535999A (en) 2021-10-22
CN113535999B CN113535999B (en) 2023-05-26

Family

ID=78126779

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110758735.0A Active CN113535999B (en) 2021-07-05 2021-07-05 Diversified image description sentence generation technology based on deep learning

Country Status (1)

Country Link
CN (1) CN113535999B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113988274A (en) * 2021-11-11 2022-01-28 电子科技大学 Text intelligent generation method based on deep learning
CN114511860A (en) * 2022-04-19 2022-05-17 苏州浪潮智能科技有限公司 Difference description statement generation method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180181832A1 (en) * 2016-12-27 2018-06-28 Facebook, Inc. Systems and methods for image description generation
CN110210499A (en) * 2019-06-03 2019-09-06 中国矿业大学 A kind of adaptive generation system of image, semantic description
CN110991284A (en) * 2019-11-22 2020-04-10 北京航空航天大学 Optical remote sensing image statement description generation method based on scene pre-classification
US20210166013A1 (en) * 2019-12-03 2021-06-03 Adobe Inc. Simulated handwriting image generator

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180181832A1 (en) * 2016-12-27 2018-06-28 Facebook, Inc. Systems and methods for image description generation
CN110210499A (en) * 2019-06-03 2019-09-06 中国矿业大学 A kind of adaptive generation system of image, semantic description
CN110991284A (en) * 2019-11-22 2020-04-10 北京航空航天大学 Optical remote sensing image statement description generation method based on scene pre-classification
US20210166013A1 (en) * 2019-12-03 2021-06-03 Adobe Inc. Simulated handwriting image generator

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
韦人予;蒙祖强;: "基于注意力特征自适应校正的图像描述模型", 计算机应用 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113988274A (en) * 2021-11-11 2022-01-28 电子科技大学 Text intelligent generation method based on deep learning
CN113988274B (en) * 2021-11-11 2023-05-12 电子科技大学 Text intelligent generation method based on deep learning
CN114511860A (en) * 2022-04-19 2022-05-17 苏州浪潮智能科技有限公司 Difference description statement generation method, device, equipment and medium

Also Published As

Publication number Publication date
CN113535999B (en) 2023-05-26

Similar Documents

Publication Publication Date Title
CN110795556B (en) Abstract generation method based on fine-grained plug-in decoding
CN109522403B (en) Abstract text generation method based on fusion coding
CN113535999B (en) Diversified image description sentence generation technology based on deep learning
CN107480206A (en) A kind of picture material answering method based on multi-modal low-rank bilinearity pond
CN111858932A (en) Multiple-feature Chinese and English emotion classification method and system based on Transformer
CN112100404A (en) Knowledge graph pre-training method based on structured context information
CN110196928B (en) Fully parallelized end-to-end multi-turn dialogue system with domain expansibility and method
CN113221879A (en) Text recognition and model training method, device, equipment and storage medium
CN115424059B (en) Remote sensing land utilization classification method based on pixel level contrast learning
CN113961736A (en) Method and device for generating image by text, computer equipment and storage medium
CN112395842B (en) Long text story generation method and system for improving content consistency
CN111985470A (en) Ship board correction and identification method in natural scene
CN114420107A (en) Speech recognition method based on non-autoregressive model and related equipment
CN113140023A (en) Text-to-image generation method and system based on space attention
CN114969298A (en) Video question-answering method based on cross-modal heterogeneous graph neural network
CN116821291A (en) Question-answering method and system based on knowledge graph embedding and language model alternate learning
CN112257464B (en) Machine translation decoding acceleration method based on small intelligent mobile equipment
CN111522923B (en) Multi-round task type dialogue state tracking method
CN112380843B (en) Random disturbance network-based open answer generation method
CN113377907B (en) End-to-end task type dialogue system based on memory mask self-attention network
CN114333069B (en) Object posture processing method, device, equipment and storage medium
CN114399646B (en) Image description method and device based on transform structure
CN115331073A (en) Image self-supervision learning method based on TransUnnet architecture
CN115273110A (en) Text recognition model deployment method, device, equipment and storage medium based on TensorRT
CN112100157B (en) Cross-platform multidimensional database architecture design method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant