CN114743056A

CN114743056A - Dynamic early-quit-based image description generation model and model training method

Info

Publication number: CN114743056A
Application number: CN202210439734.4A
Authority: CN
Inventors: 王树徽; 闫旭; 黄庆明
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2022-04-25
Filing date: 2022-04-25
Publication date: 2022-07-12

Abstract

The invention provides an image description generation model based on dynamic early quit, which comprises the following steps: the visual encoder is used for extracting visual characteristics of an image and comprises a plurality of encoding layers which are sequentially connected in series; the text decoder is used for decoding the visual characteristics output by the visual encoder and sequentially outputting words in a plurality of prediction vocabularies to form a natural language description text of the image, and comprises a plurality of decoding layers which are sequentially connected in series, wherein each decoding layer is provided with a dynamic early-quit decision module and a simulation learning network; wherein: each dynamic early-quit decision module is used for judging whether the current prediction needs to be quitted in advance according to the word prediction probability in the prediction vocabulary in the word prediction process and outputting the word corresponding to the maximum probability when the maximum probability in the prediction probabilities of the words in the prediction vocabulary exceeds the reliability threshold; each emulation learning network is used for emulating and predicting an output hidden layer state vector of a corresponding decoding layer according to input.

Description

Dynamic early-quit-based image description generation model and model training method

Technical Field

The invention relates to the field of multimedia data processing, in particular to an image description generation technology in the field of multimedia, and more particularly to an image description generation model based on dynamic early retirement and a model training method.

Background

The image description generation technology is used for generating a text description for a given picture, wherein the text description comprises a plurality of words, and the words are sequentially output to form the text description by identifying image features through an image description generation model, which not only requires that the image description generation model can identify objects contained in the picture, but also requires that a natural language is used for describing the relationship between the objects. The technology can realize natural language description generation of natural scene pictures in life, transmit image visual information in an intuitive text expression mode, and skillfully link computer vision and natural language processing technology together.

The image description generation technology has wide application scenes, for example, in an e-commerce website, the image description generation technology can be used for automatically generating a title text for a commodity picture to meet the retrieval requirements of a user; in the infant education, the method can be used for automatically generating a description text for the picture, and realizing an 'talking on the picture' autonomous teaching system by means of a speech synthesis technology, and can serve an early education system and infant enlightenment; in the medical field, the medical diagnosis system can be used for realizing the identification of case pictures and the automatic generation of medical diagnosis reports, and the time for radiologists to write reports is saved; in the blind-aiding field, visual pictures are converted into characters by means of an image description generation technology, and perception channels are widened for people with visual disorders by means of a voice synthesis technology. However, in order to be widely used in actual production, the image description generation model must solve the problem of slow generation speed of the description text. The existing image description generation model mostly uses an autoregressive generation strategy, namely, each word of a text is output after sequentially passing through all decoding layers from left to right, and the problem of low generation speed is caused. In order to realize the decoding acceleration of the model, a non-autoregressive generation strategy is proposed, namely, all words describing the text are output in parallel, the method improves the generation speed, but because each word is independent when the output is predicted, the establishment of a dependency relationship is lacked, and the consistency and the accuracy of the generated text are poor. Later, a non-autoregressive iterative optimization generation strategy is proposed, the strategy takes the whole sentence description text generated by non-autoregressive at the previous moment as input again, and the text is output by iterative optimization for multiple times. In addition, researchers also propose a two-stage semi-autoregressive generation strategy to solve the balance problem between the generation quality and the generation speed, but if different generation speeds are required to be met, real-time adjustment cannot be achieved, the model needs to be retrained by adjusting the hyper-parameters, and therefore training cost is increased undoubtedly. Therefore, to accelerate text generation, the acceleration of the text generation process in the image description generation task and the problem of adjusting the acceleration ratio in real time are important and urgent problems to be solved.

Disclosure of Invention

Therefore, an object of the present invention is to overcome the above-mentioned drawbacks of the prior art, and to provide an image description generation model based on dynamic early fallback and a model training method, so as to accelerate text generation.

According to a first aspect of the present invention, there is provided a dynamic early-quit-based image description generation model for outputting a natural language description text of an image from an input image, the image description generation model comprising: the visual encoder is used for extracting visual characteristics of an image and comprises a plurality of encoding layers which are sequentially connected in series; the text decoder is used for decoding the visual characteristics output by the visual encoder and sequentially outputting words in a plurality of prediction vocabularies to form a natural language description text of the image, and comprises a plurality of decoding layers which are sequentially connected in series, wherein each decoding layer is provided with a dynamic early-quit decision module and a simulation learning network; wherein: each dynamic early-quit decision module is used for judging whether the current prediction needs to be quitted in advance according to the word prediction probability in the prediction vocabulary in the word prediction process and outputting the word corresponding to the maximum probability when the maximum probability in the prediction probabilities of the words in the prediction vocabulary exceeds a reliability threshold, and the input end of each dynamic early-quit decision module is connected with the corresponding decoding layer, the outputs of all the decoding layers before the decoding layer and the outputs of the simulated learning networks corresponding to all the decoding layers after the decoding layer; the input end of each simulation learning network is connected with the corresponding decoding layer and the outputs of all the decoding layers before the decoding layer, and is used for predicting the output hidden layer state vector of the corresponding decoding layer according to input simulation.

Preferably, each coding layer comprises a self-attention layer and a feedforward neural network which are connected in sequence. Each decoding layer comprises a self-attention layer, an encoding-decoding self-attention layer and a feedforward neural network which are connected in sequence. Each dynamic early-quit decision module comprises a shallow feature fusion layer, a deep feature fusion layer, a fusion gating layer and a classification layer, wherein: the shallow feature fusion layer is used for performing feature fusion on the decoding layer corresponding to the dynamic early-quit decision module where the shallow feature fusion layer is located and hidden layer state vectors of all decoding layers before the decoding layer to obtain a shallow fusion feature vector; the deep layer feature fusion layer is used for carrying out feature fusion on hidden layer state vectors of the simulated learning network corresponding to all decoding layers behind the decoding layer corresponding to the dynamic early-quit decision module where the deep layer feature fusion layer is located so as to obtain a deep layer fusion feature vector; the fusion gating layer is used for performing feature fusion on the shallow fusion feature vector and the deep fusion feature vector to obtain a final fusion feature vector; the classification layer is configured as a fully-connected layer for outputting a prediction probability for each word in the prediction vocabulary from the final fused feature vector. In some embodiments of the present invention, the shallow feature fusion layer performs feature fusion in any one of the following ways: stitching, attention weighting, and a time sequence model. The deep feature fusion layer performs feature fusion in any one of the following ways: stitching, attention weighting, and a time sequence model.

Preferably, each of the mimic learning networks comprises a feed-forward neural network.

Preferably, the visual encoder comprises 6 encoding layers and the text decoder comprises 6 decoding layers.

According to a second aspect of the present invention, there is provided a method of training an image description generative model according to the first aspect of the present invention, the method comprising: s1, acquiring an image set and natural language description texts corresponding to all images in the image set, forming a sample by combining each image and a natural language description text corresponding to the image set to form a data set, dividing the data set into a training set and a testing set, and forming a prediction vocabulary by words in the natural language texts corresponding to all the images; s2, training the image generation description model to be convergent by adopting a training set; and S3, testing the trained image generation description model by adopting a test set, and setting a reliability threshold value of the dynamic early-quit decision module corresponding to each decoding layer, so that the dynamic early-quit decision module judges whether the current prediction needs to be quitted in advance according to the word prediction probability in the prediction vocabulary table in the word prediction process, and outputs the word corresponding to the maximum probability when the maximum probability in the prediction probabilities of the words in the prediction vocabulary table exceeds the reliability threshold value.

Preferably, the step S1 includes: s11, acquiring an image set and natural language description texts corresponding to all images in the image set, and copying images of which the corresponding natural language description texts exceed one sentence according to the sentence number of the corresponding natural language description texts so that one sentence of text corresponds to one image to form image-text pairs, wherein all the image-text pairs are used as samples to form a data set; s12, preprocessing all natural language texts corresponding to all images in the image set to extract all words in the texts, counting the occurrence frequency of the words, and deleting the words with the occurrence frequency less than the preset frequency to obtain a prediction vocabulary list; and S13, dividing the data set into a training set and a testing set.

In some embodiments of the present invention, the model is trained in step S2 by using the following loss function:

L＝λ·L_ce+(1λ)·L_imit

wherein λ is a dynamic early-quit decision model for adjustingBlock and balance factor, L, that models the effects of learning networks_ceTo predict cross-entropy loss between words and correct words, L_imitEmulation losses between a predicted hidden layer state vector generated for emulation of the emulation learning network and a true hidden layer state vector output by the decoding layer.

Preferably, the cross-entropy loss between the predicted word and the correct word is calculated by:

p_m＝softmax(z_m)

where N is the total number of decoded layers, y_iRepresenting predicted words, softmax representing activation function, z_mRepresenting the final fused feature vector input to the mth classification module.

Preferably, the final fused feature vector is calculated by:

z_m＝αh_shallow+(1α)h_deep

α＝σ(FFN(h_shallow))

wherein h is_shallowPerforming feature fusion on decoding layers corresponding to the mth dynamic early-quit decision module and hidden layer state vectors of all decoding layers before the decoding layer to obtain shallow fusion feature vectors h_deepThe method comprises the steps that after feature fusion is carried out on hidden layer state vectors of a simulated learning network corresponding to all decoding layers behind the decoding layer corresponding to the mth dynamic early-quit decision module, deep fusion feature vectors are obtained, FFN (integer) represents a feedforward neural network, and sigma represents a sigmoid activation function.

Preferably, the emulation loss between the predicted hidden layer state vector generated by emulating the emulation learning network and the true hidden layer state vector output by the decoding layer is calculated as follows:

wherein h is_kThe true hidden layer state vector representing the output of the kth decoding layer,

the state vector of the hidden layer which represents the simulation output of the kth simulation learning network when the m classification layers exit in advance,

represents h_kAnd

the degree of pre-similarity.

Preferably, the confidence threshold is set to any value within the range of [0.5,1) according to the requirements of the application scenario.

According to a third aspect of the present invention, there is provided an image description generation method, the method comprising: t1, acquiring an image to be processed; and T2, recognizing the image to be processed by adopting the image description generation model trained based on the method of the second aspect of the invention to generate the text description.

Compared with the prior art, the invention has the advantages that: the invention designs an image description generation model and method based on dynamic early-quit, which can generate a natural language text capable of describing visual contents for a given image, and can realize acceleration of the generation process and real-time adjustment of the acceleration ratio. The invention effectively solves the problems that the prior picture description generation method can not adjust the acceleration ratio of the model generation process in real time and the text generation speed is slow.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings, in which:

FIG. 1 is a block diagram of an image description generative model system framework according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an encoding layer structure according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a decoding layer structure according to an embodiment of the present invention;

FIG. 4 is a block diagram of a dynamic early fallback decision module according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a learning network architecture according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a feature vector structure of a dynamic early-fallback learning network according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of dynamic early fallback decision module feature fusion according to an embodiment of the present invention;

FIG. 8 is a graph illustrating experimental data comparison according to an embodiment of the present invention;

FIG. 9 is a graph illustrating comparison of model performance for an early roll-off model using blend features and without blend features under different acceleration ratio conditions, in accordance with an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As described in the background art, the image description generation task aims to generate a sentence of natural language text capable of describing visual contents for a given image, and the image description generation model outputs a plurality of words constituting the text by prediction. The existing general image description generation technology has the problems that the text generation speed is slow and the model acceleration ratio cannot be adjusted in real time (the acceleration ratio is the acceleration multiple of the model decoding speed, the original 200ms, the current 50ms is the acceleration ratio of 4).

The inventor finds that the reason that the conventional image description generation model text generation speed is slow is that when words are generated in the encoding and decoding process of the conventional image description generation model, each generated word is output after passing through all decoding layers, which greatly reduces the generation speed. The invention provides an image description generation model framework based on dynamic early-quit, which comprises a visual encoder and a text decoder, and is different from the prior art in that a dynamic early-quit decision module is arranged on each decoding layer of the text decoder, so that each generated word does not need to pass through all the decoding layers, each dynamic early-quit decision module of a certain decoding layer judges whether the current prediction needs to be quitted in advance according to the word prediction probability in a prediction vocabulary table, and when the maximum probability in the prediction probabilities of the words in the prediction vocabulary table exceeds a reliability threshold value, the word corresponding to the maximum probability is output, and the improvement of the decoding speed is realized by saving the forward propagation calculation process of the subsequent decoding layers.

For a better understanding of the present invention, reference is made to the following detailed description taken in conjunction with the accompanying drawings and examples.

According to an embodiment of the present invention, there is provided an image description generation model based on dynamic early fallback, as shown in fig. 1, the image description generation model is implemented by using a transform architecture, and includes a visual encoder and a text decoder, where the visual encoder includes a plurality of sequentially connected coding layers (for convenience of description, the subsequent embodiments are illustrated by the visual encoder including N coding layers, and are numbered as coding layer 1, coding layer 2, …, and coding layer N from front to back in sequence), the text decoder includes a plurality of sequentially connected decoding layers (for convenience of description, the subsequent embodiments are illustrated by the text decoder including N decoding layers, and are numbered as decoding layer 1, decoding layer 2, …, and decoding layer N from front to back in sequence), where each decoding layer is configured with a dynamic early fallback decision module and a mimic learning network, the input end of each dynamic early-quit decision module is connected with the outputs of the corresponding decoding layer and all the decoding layers before the decoding layer and the outputs of the simulated learning networks corresponding to all the decoding layers after the decoding layer, and the input end of each simulated learning network is connected with the outputs of the corresponding decoding layer and all the decoding layers before the decoding layer and used for predicting the output hidden layer state vector of the corresponding decoding layer according to input simulation. According to an embodiment of the present invention, as shown in fig. 2, each coding layer includes a self-attention layer and a feedforward neural network connected in sequence, where, for each coding layer, an input of the self-attention layer in the current coding layer is an input of the current coding layer, an output of the self-attention layer in the current coding layer is an input of the feedforward neural network in the current coding layer, and an output of the feedforward neural network in the current coding layer is an output of the current coding layer and is an input of a next coding layer. According to an embodiment of the present invention, as shown in fig. 3, each decoding layer includes a self-attention layer, a coding-decoding self-attention layer, and a feedforward neural network, which are connected in sequence, wherein for each decoding layer, an input of the self-attention layer of the current decoding layer is an input of the current decoding layer, an output of the self-attention layer of the current decoding layer is an input of the coding-decoding self-attention layer of the current decoding layer, an output of the coding-decoding self-attention layer of the current decoding layer is an input of the feedforward neural network of the current decoding layer, and an output of the feedforward neural network of the current decoding layer is an output of the decoding layer and is an input of a next decoding layer. According to an embodiment of the present invention, as shown in fig. 4, each dynamic early-quit decision module includes a shallow feature fusion layer, a deep feature fusion layer, a fusion gating layer, and a classification layer, wherein: the shallow feature fusion layer is used for performing feature fusion on the decoding layer corresponding to the dynamic early-quit decision module where the shallow feature fusion layer is located and hidden layer state vectors of all decoding layers before the decoding layer to obtain a shallow fusion feature vector; the deep layer feature fusion layer is used for carrying out feature fusion on hidden layer state vectors of the simulated learning network corresponding to all decoding layers behind the decoding layer corresponding to the dynamic early-quit decision module where the deep layer feature fusion layer is located so as to obtain a deep layer fusion feature vector; the fusion gating layer is used for performing feature fusion on the shallow fusion feature vector and the deep fusion feature vector to obtain a final fusion feature vector; the classification layer is configured as a fully-connected layer for outputting a prediction probability for each word in the prediction vocabulary from the final fused feature vector. According to one embodiment of the invention, as shown in FIG. 5, each of the simulated learning networks includes a feedforward neural network. Different from the traditional Transformer model, in the text decoder of the model of the invention, a dynamic early-quit decision module which comprises a full-connection layer as a classifier is configured behind each decoding layer and is used for judging whether a current predicted word is directly output at the current layer so as to omit the forward propagation process of the subsequent decoding layer, and when the probability of the predicted word of the classifier behind the decoding layer reaches a certain credibility threshold (the output of the classifier is the predicted probability of all words, and the credibility threshold is the probability value corresponding to the word with the highest probability, the threshold as a hyper-parameter can be set to be any value in the range of [0.5,1 ], for example, 0.5/0.6/0.7/0.8/0.9), words are output in advance, and the forward propagation calculation process of the subsequent decoding layer is saved so as to realize the improvement of the decoding speed. It should be noted that, data transmitted between different coding layers, between the last coding layer and the first decoding layer, between different decoding layers, between a decoding layer and its corresponding dynamic decision module, and between a decoding layer and its corresponding emulation learning network are hidden layer state vectors output by different layers.

According to an embodiment of the present invention, there is provided a method for training the image description generative model described in the previous embodiments, the method comprising steps S1, S2, S3, each of which is described in detail below.

In step S1, an image set and natural language description texts corresponding to all images in the image set are obtained, the data set is preprocessed to combine each image and a sentence of the natural language description text corresponding to the image into a sample to form a data set, the data set is divided into a training set and a testing set, and words in the natural language texts corresponding to all images form a prediction vocabulary. According to one embodiment of the invention, the invention employs an MS COCO Captions public data set that includes 123287 pictures, each with 5 sentences of corresponding natural language description text. In the data preprocessing stage, for picture data, the operation of extracting video features is omitted by using open-source visual features; for text data, all words are converted into lowercase, special symbols are deleted, the occurrence frequency of each word is counted, and words with the occurrence frequency less than 5 times are deleted to obtain a prediction vocabulary. Copying each picture 4 times to form a picture-text pair with each corresponding text as a sample to form a data set, and dividing the data set into a training set and a test set according to an embodiment of the present invention, wherein the data set is divided into the training set and the test set according to a ratio of 7: 3.

In step S2, the image generation description model is trained to converge using the training set. According to one embodiment of the invention, during training, batch picture-text pairs are input into the model to update model parameters, wherein pictures in a sample are input into a visual encoder to obtain a predicted text, the text in the sample is input into a text decoder as a reference text and the predicted text, all predicted words are output after the last decoding layer, and a total training loss function is used for optimization. According to one embodiment of the present invention, the number of coding layers in the visual encoder is 6, the number of decoding layers in the text decoder is 6, the hidden layer state vector dimension is 512, and the input of the fully-connected layer is 2048 dimensions. During the training process, 25 rounds of training are performed using the learning rate 3e-5 as the initial learning rate, and then training is continued with a 90% decay every 5 rounds. The optimizer uses Adam, the prediction decoding time is calculated by using a video card NVIDIA GTX 2080 Ti, and all operation results obtain a final calculation result by running three times to obtain an average value.

In step S3, a test set is used to test the trained image generation description model, where pictures corresponding to samples in the test set are input to obtain a predicted text and input to a text decoder, and a reliability threshold of the dynamic early-quit decision module corresponding to each decoding layer is set, so that the dynamic early-quit decision module determines, according to the word prediction probability in the prediction vocabulary, whether the current prediction needs to be quitted in advance, and outputs a word corresponding to the maximum probability when the maximum probability in the prediction probabilities of the words in the prediction vocabulary exceeds the reliability threshold. According to one embodiment of the invention, during model test, the confidence threshold ({0.5,0.6,0.7,0.8,0.9} of each layer of the classifier is adjusted, and when the maximum probability of the prediction probabilities of all words output by the classifier exceeds the set confidence threshold, the word corresponding to the maximum probability exits at the current layer.

However, when the model moves back earlierWhen the forward propagation process is finished, the feature vectors of deep information of the un-calculated decoding layer are lost, so that the classifier can predict words inaccurately, and the quality of generated texts is reduced. In order to solve the problem, the invention modifies the feature vector of the input classification layer when a certain decoding layer exits decoding, and adopts a feature fusion mode to fuse shallow features and deep features and take the shallow features and the deep features as the input of the classification layer. Taking the example that the image generation model exits decoding at the mth layer, the deep feature vectors of the subsequent m + 1-N decoding layers are lost, the invention adopts the simulation learning network to predict the deep feature vectors of the m + 1-N decoding layers, and the final fusion feature vector is obtained after fusing the shallow feature vector and the deep feature vector respectively and then fusing the shallow feature vector and the deep feature vector, and is used as the input of the classification layer in the dynamic early-quit decision module corresponding to the mth decoding layer. Still taking the image description generative model structure shown in fig. 1 as an example, assuming that the decoding layer m exits, the deep information of the decoding layers m +1 to N is missing, and as shown in fig. 6, the feature vectors predicting the deep information of the decoding layers m +1 to N are simulated by using the corresponding simulated learning networks of the decoding layers m +1 to N. Decoding layers m +1 to N represent layers that are not used in forward propagation, each decoding layer outputting a hidden layer state vector h (denoted by h), each hidden layer state vector predicted by imitating the output of a learning network

E.g. h as shown in fig. 6_mAnd

the simulation learning is a learning mode which takes the behavior mode of the simulation board sample as the characteristic, and the feedforward neural network is used for simulating the deep decoding layer to calculate the deep characteristic. As shown in fig. 7, in the present invention, a current feature vector is constructed in a manner of fusing all shallow features and deep features, and is used as an input of a classification layer, so that not only historical existing information is retained, but also deep information of a future decoding layer obtained by prediction is included, and information loss is avoided. Therefore, the feature vector of the m-th decoding layerThe method is characterized by comprising a shallow feature part and a deep feature part which are respectively from the first m decoding layers and the last (N-m) decoding layers, and the modeling process is as follows:

in order to aggregate all shallow features h from a global perspective₁,...,h_mAnd obtaining a shallow fusion state vector h at the shallow feature fusion layer by using a feature fusion strategy g (·)_shallow：

h_shallow＝g({h₁,...,h_m})

The feature fusion strategy has a plurality of implementation forms, and the invention provides the following three types:

1) splicing: all shallow features h₁,...,h_mSplicing the vectors into a whole vector in sequence, and then inputting a full connection layer for dimension reduction. Wherein the spliced vector is [ h₁；...；h_m]And the full connection layer is represented by FC (-), the fusion state vector is h_shallow＝FC([h₁；...；h_m])。

2) Attention weighting: the m attention weight values with a weight sum of 1 are weighted with the corresponding hidden layer state vector. Let the i-th hidden layer state vector have an attention weight of w_iThen the weighted fusion state vector is

3) A time sequence model: using a layer of long-short term memory network to pair the hidden layer state vector series { h₁,...,h_mCoding, the last layer of output state vector of the network is taken as the fusion state vector, the long-term and short-term memory network is expressed by LSTM ((-) to h)_shallow＝LSTM({h₁,...,h_m})。

However, as mentioned above, if only the historical shallow features are aggregated and the future deep features are ignored, information is lost, resulting in poor quality of the generated description text. Therefore, the present invention uses mock learning to predict deep features and fuse with historical shallow features. Specifically, each layer of decoding layer is provided withA simulated learning network simulates the computation of the decoding layer. During the training process, the forward propagation process passes through all N decoding layers of the model, and the imitation learning network is encouraged to imitate the hidden layer state vectors output by the decoding layers. Assuming that the decoding process is finished at the m-th layer, any subsequent layer (assumed as the k-th layer) receives the hidden layer state vector h of the decoding layer of the m-th layer when performing the emulation learning_mAs input, therefore, the output of the learning network, i.e., the predicted deep feature vector, is modeled

Can be modeled as:

wherein, FFN^k(. to) shows layer k mimics learning networks, implemented using feed forward neural networks.

By modeling learning, an approximation of the hidden layer state vector of the uncalculated layer output can be obtained with minimal cost. Wherein the true value h_kAnd modeling the predicted values generated by the learning network

The similarity between them is calculated using cosine similarity:

thus, the loss function of the mock learning can be implemented using the average cosine similarity for all possible exit positions:

deep fusion feature vector h_deepAt the deep feature fusion layer, a feature fusion strategy g (-) is also used for calculation:

after the shallow fusion state and the deep fusion state are obtained, the two states cannot be directly added because of different reliability. The invention uses a self-adaptive fusion gating mechanism to weight according to the reliability of the shallow fusion state vector and the deep fusion state vector:

α＝σ(FFN(h_shallow))

z_m＝αh_shallow+(1α)h_deep

wherein z is_mThe final feature vector is obtained after the shallow feature and the deep feature are fused, FFN is a feedforward neural network, and sigma is a sigmoid activation function.

In the training process, it is not known which layer the words exit in advance, therefore, the classification layer of each decoding layer predicts the current word distribution by using the fused final feature vector, and when the distribution meets the exiting credibility threshold, the predicted text is output. And combines the predicted word with the correct word y_iCross-entropy (cross-entropy) loss between as a generating loss training model:

wherein p is_m＝softmax(z_m)。

Thus, the total training loss function adjusted using the parameter λ is:

L＝λ·L_ce+(1λ)·L_imit

wherein λ is a balance factor for adjusting the dynamic early-quit decision module and simulating the influence of the learning network, and is a constant in the range of (0, 1).

In order to better verify the effect of the invention, the invention adopts different models on MS COCO Captions public data sets to carry out comparison experiments with the invention (expressed by ours), and compares the following indexes of the different models:

BLEU-4: and 4-tuple bilingual evaluation indexes are used for calculating the matching degree of the reference text and the repetition degree evaluation of the 4-tuple in the generated text.

METEOR: and calculating the accuracy and the recall rate of the generated text based on the translation evaluation indexes of the explicit ordering.

And (4) ROUGE: and calculating the recall rate of the generated text based on the evaluation index of the recall rate.

CIDER: and calculating the matching degree of the cosine similarity evaluation of the reference text and the generated text based on the consensus image description evaluation index.

SPICE: and describing the evaluation index based on the semantic image.

Acceleration ratio: the speed-up multiple of the model decoding speed.

The experimental result is shown in fig. 8, and it can be seen that the model of the present invention can realize 4-5 times acceleration ratio and the performance loss of the generated text is small.

In addition, the invention also compares the model performances of a variant model (TF-EE) which does not use the fusion characteristic vector for early-quit word prediction and a model (DeeCap) which uses the fusion characteristic vector for early-quit word prediction under different acceleration ratio conditions, and the experimental result is shown in FIG. 9. Experimental results prove that the model adopting the fusion feature vector for early-stage regression prediction can achieve performance loss lower than 5% under the condition that the decoding speed is increased by 4-5 times, and the model can achieve real-time control of the acceleration ratio by controlling a classifier to output a reliability threshold (namely switching between different points on a solid line performance line is real-time without retraining the model).

In summary, the invention designs an image description generation model and method based on dynamic early-quit, which can generate a natural language text capable of describing visual contents for a given image, and can realize acceleration of the generation process and real-time adjustment of the acceleration ratio. The invention effectively solves the problems that the prior picture description generation method can not adjust the acceleration ratio of the model generation process in real time and the text generation speed is slow.

It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, such as punch cards or in-groove raised structures having instructions stored thereon, and any suitable combination of the foregoing.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the market, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An image description generation model based on dynamic early-quit, which is used for outputting natural language description text of an image according to an input image, and is characterized in that the image description generation model comprises:

the visual encoder is used for extracting visual characteristics of the image and comprises a plurality of encoding layers which are sequentially connected in series;

the text decoder is used for decoding the visual characteristics output by the visual encoder and sequentially outputting words in a plurality of prediction vocabularies to form a natural language description text of the image, and comprises a plurality of decoding layers which are sequentially connected in series, wherein each decoding layer is provided with a dynamic early-quit decision module and a simulation learning network; wherein:

each dynamic early-quit decision module is used for judging whether the current prediction needs to be quitted in advance according to the word prediction probability in the prediction vocabulary in the word prediction process and outputting the word corresponding to the maximum probability when the maximum probability in the prediction probabilities of the words in the prediction vocabulary exceeds a reliability threshold, and the input end of each dynamic early-quit decision module is connected with the corresponding decoding layer, the outputs of all the decoding layers before the decoding layer and the outputs of the simulated learning networks corresponding to all the decoding layers after the decoding layer;

the input end of each simulation learning network is connected with the corresponding decoding layer and the outputs of all the decoding layers before the decoding layer, and is used for predicting the output hidden layer state vector of the corresponding decoding layer according to input simulation.

2. The model of claim 1, wherein each coding layer comprises a self-attention layer and a feed-forward neural network connected in series.

3. The model of claim 2, wherein each decoding layer comprises a self-attention layer, an encoding-decoding self-attention layer, and a feedforward neural network connected in sequence.

4. The model of claim 3, wherein each dynamic early exit decision module comprises a shallow feature fusion layer, a deep feature fusion layer, a fusion gating layer, a classification layer, wherein:

the shallow feature fusion layer is used for performing feature fusion on the decoding layer corresponding to the dynamic early-quit decision module where the shallow feature fusion layer is located and hidden layer state vectors of all decoding layers before the decoding layer to obtain a shallow fusion feature vector;

the deep layer feature fusion layer is used for carrying out feature fusion on hidden layer state vectors of the simulated learning network corresponding to all decoding layers behind the decoding layer corresponding to the dynamic early-quit decision module where the deep layer feature fusion layer is located so as to obtain a deep layer fusion feature vector;

the fusion gating layer is used for performing feature fusion on the shallow fusion feature vector and the deep fusion feature vector to obtain a final fusion feature vector;

the classification layer is configured as a fully-connected layer for outputting a prediction probability for each word in the prediction vocabulary from the final fused feature vector.

5. The method of claim 4, wherein the shallow feature fusion layer performs feature fusion in any one of the following ways:

stitching, attention weighting, and a time sequence model.

6. The method of claim 4, wherein the deep feature fusion layer performs feature fusion in any one of the following ways:

stitching, attention weighting, and a time sequence model.

7. The model of claim 4, wherein each of the mimic learning networks comprises a feed-forward neural network.

8. The model of claim 5, wherein said visual encoder comprises 6 encoding layers and said text decoder comprises 6 decoding layers.

9. A method of training an image description generative model as claimed in any one of claims 1 to 8 wherein the method comprises:

s1, acquiring an image set and natural language description texts corresponding to all images in the image set, forming a sample by combining each image and a natural language description text corresponding to the image set to form a data set, dividing the data set into a training set and a testing set, and forming a prediction vocabulary by words in the natural language texts corresponding to all the images;

s2, training the image generation description model to be convergent by adopting a training set;

and S3, testing the trained image generation description model by adopting a test set, and setting a reliability threshold value of the dynamic early-quit decision module corresponding to each decoding layer, so that the dynamic early-quit decision module judges whether the current prediction needs to be quitted in advance according to the word prediction probability in the prediction vocabulary table in the word prediction process, and outputs the word corresponding to the maximum probability when the maximum probability in the prediction probabilities of the words in the prediction vocabulary table exceeds the reliability threshold value.

10. The method according to claim 9, wherein the step S1 includes:

s11, acquiring an image set and natural language description texts corresponding to all images in the image set, and copying images of which the corresponding natural language description texts exceed one sentence according to the sentence number of the corresponding natural language description texts so that one sentence of text corresponds to one image to form image-text pairs, wherein all the image-text pairs are used as samples to form a data set;

s12, preprocessing all natural language texts corresponding to all images in the image set to extract all words in the texts, counting the occurrence frequency of the words, and deleting the words with the occurrence frequency less than the preset frequency to obtain a prediction vocabulary list;

and S13, dividing the data set into a training set and a testing set.

11. The method according to claim 9, wherein the model is trained in step S2 by using the following loss function:

L＝λ·L_ce+(1-λ)·L_imit

wherein λ is the regulation dynamicsEarly fallback decision module and balancing factor, L, that mimics learning network effects_ceTo predict the cross-entropy loss between words and correct words, L_imitEmulation losses between a predicted hidden layer state vector generated for emulation of the emulation learning network and a true hidden layer state vector output by the decoding layer.

12. The method of claim 11, wherein the cross-entropy loss between the predicted word and the correct word is calculated by:

p_m＝softmax(z_m)

13. The method of claim 12, wherein the final fused feature vector is calculated by:

z_m＝αh_shallow+(1-α)h_deep

α＝σ(FFN(h_shallow))

14. The method of claim 11, wherein the emulation loss between the predicted hidden layer state vector generated by emulation of the emulation learning network and the true hidden layer state vector output by the decoding layer is calculated by:

denotes h_kAnd

the pre-similarity of (a).

15. The method of claim 10, wherein the confidence threshold is set to any value within the range of [0.5,1) according to the requirements of the application scenario.

16. An image description generation method, characterized in that the method comprises:

t1, acquiring an image to be processed;

t2, recognizing the image to be processed to generate the text description using the image description generation model trained based on the method of any one of claims 9-15.

17. A computer-readable storage medium, having stored thereon a computer program executable by a processor to perform the steps of the method of any one of claims 9 to 15, 16.

18. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the electronic device to carry out the steps of the method of any of claims 9 to 1, 165.