CN110704606B

CN110704606B - Generation type abstract generation method based on image-text fusion

Info

Publication number: CN110704606B
Application number: CN201910764261.3A
Authority: CN
Inventors: 曹亚男; 徐灏; 尚燕敏; 刘燕兵; 谭建龙; 郭莉
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2019-08-19
Filing date: 2019-08-19
Publication date: 2022-05-31
Anticipated expiration: 2039-08-19
Also published as: CN110704606A

Abstract

The invention discloses a generating type abstract generating method based on image-text fusion, which comprises the following steps: 1) dividing a given text data set into a training set, a verification set and a test set; each sample in the text data set is a triple (X, I, Y), wherein X is a text, I is an image corresponding to the text X, and Y is an abstract of the text X; 2) extracting entity features of the images of the text data set, and expressing the extracted entity features into image feature vectors with the same dimensionality as the text; 3) training the generative abstract model by using the training set and the image characteristic vectors corresponding to the training set; 4) inputting a text and a corresponding image and generating an image characteristic vector of the image, and then inputting the text and the image characteristic vector corresponding to the text into a trained generative abstract model to obtain an abstract corresponding to the text. The abstract generated by the invention can effectively adjust the weight of the entity in the text and relieve the problem of unregistered words to a certain extent.

Description

Generation type abstract generation method based on image-text fusion

Technical Field

The invention belongs to the technical field of manual work, and relates to a generating type abstract generating method based on image-text fusion.

Background

The existing generative abstract method is mainly realized based on a seq2seq framework of deep learning and an attention mechanism. The Seq2Seq framework is mainly composed of an encoder (encoder) and a decoder (decoder), both of which are implemented by a neural network, which may be a Recurrent Neural Network (RNN) or a Convolutional Neural Network (CNN). The specific process is as follows, the encoder encodes the input original text into a vector (context), which is a representation of the original text. The decoder is then responsible for extracting important information from this vector, generating a text summary. The attention mechanism is to solve the bottleneck of information loss caused by conversion of long sequences to fixed-length vectors, i.e. to focus attention on the corresponding context in the decoder.

Although the seq2seq framework and attention mechanism based on deep learning have achieved some level of performance in the field of summary generation, they tend to generate high frequency words, which can lead to the problem of key entity bias. In general, there are two forms of deviation for key entities: firstly, due to the limitation of hardware resources, a limited word list is generally adopted, and some obscure key entity words in the article do not appear in the word list, so that the key entities are lost in the generated abstract; the second, relatively low frequency entity is ignored.

In order to solve the problem of key entity deviation, the invention provides a generating type abstract method based on image-text fusion.

Disclosure of Invention

The method and the device can solve the problem that key entities of the existing generated abstract are lost, so that the quality and readability of the generated abstract are improved.

The technical problem is solved by the following technical scheme:

a method for generating a generating abstract based on image-text fusion comprises the following steps:

step 1, carrying out data preprocessing operations such as stop word removal, special word marking and the like on a given text data set, and dividing the data into a training set, a verification set and a test set after shuffling. Each sample in the text dataset is a triplet (X, I, Y); where X is the text, I is the corresponding image (i.e., the image that matches X), and Y is the summary of the text X.

And 2, extracting main characteristic entities from the images corresponding to the text data set in the step 1, and expressing the main characteristic entities as image characteristics with the same dimension as the text. The characteristic entity comprises a full-text graph representation and three image representations of key entities; taking the text a as an example, if 30 words exist, the length of the word vector is 128 dimensions, the text is 30 128-dimensional vectors, the image features include three entities of the global and maximum region, so that the text is 4 128-dimensional vectors, and the text is 34 128-dimensional vectors together.

And 3, training the model by using the image characteristics corresponding to the training set processed in the step 1 and the training set processed in the step 2.

And 4, testing the performance of the model by using the test set after the abstract generation model is trained, wherein the Rouge evaluation index can be used.

And 5, in practical application, inputting a text and a corresponding image on an interactive interface, generating image characteristics of the image, and then inputting the input text and the corresponding image characteristics into the trained generative abstract model to obtain a corresponding abstract.

In step 1, the text data is preprocessed as follows:

step 1.1, the given original data set is subjected to one-to-one correspondence of texts, abstracts and images to obtain a triple (X, I, Y) of each sample.

And step 1.2, removing special characters, emoticons, full-angle characters and the like from the text and the abstract.

And step 1.3, replacing all hyperlink URLs by using TAGURL, replacing all dates by using TAGDATA, replacing all numbers by using TAGNUM and replacing all punctuation marks by using TAGPUN in the data set obtained in the step 1.2.

And step 1.4, filtering stop words by using the stop word list on the data washed by the step 1.3.

And step 1.5, the texts, the abstracts and the images are shuffled simultaneously in a one-to-one correspondence manner, and are proportionally divided into a training set, a verification set and a test set.

And step 1.6, constructing a word list with a certain length according to the data set, representing words in the text and the abstract which do not appear in the dictionary as 'UNK', adding a mark 'BOS' at the beginning of the document, adding 'EOS' after finishing, processing the text and the abstract into fixed lengths respectively, directly cutting off redundant words, and filling the words which are smaller than the length by using a placeholder 'PAD'.

Step 1.7, using the wordlebelling toolkit of Gensim, each word in the text summary dataset is represented by a word vector of fixed dimension k, including the special label of step 1.6.

In step 2, a generating abstract model based on image-text fusion is shown in fig. 1, and includes three modules: the method comprises a feature extraction module, a feature fusion module and an abstract generation module respectively, wherein step 2 is a detailed feature extraction method, and details are as follows:

and 2.1, capturing key entity characteristics of the corresponding images by using the images in the step 1.5 one by one through a Regional Convolutional Neural Network (RCNN) tool. The regional convolutional neural network algorithm comprises four steps of candidate region generation, feature extraction, category marking and position trimming, and the detailed process is as follows:

step 2.1.1, first, an over-segmentation technique is applied to segment each image into as many independent regions as possible, typically more than 1000. Then, the areas of the same image are merged according to a certain rule, and the merging rule comprises similar color merging, similar texture merging and the like. And finally, taking all the regions which appear after combination in the process as preliminary candidate regions.

And 2.1.2, performing feature extraction on each preliminary candidate region appearing in the step 2.1.1 by using a CNN network.

And 2.1.3, inputting the feature representation obtained by each preliminary candidate region into a Support Vector Machine (SVM) classifier, judging whether the feature representation is a corresponding entity label, if so, marking the entity label as 1, performing the step 2.1.4, if not, marking the entity label as 0, and deleting the candidate region.

And 2.1.4, correcting the frame position of the preliminary candidate region according to the result of the category mark by using a Regression (Regression) model. Specifically, for each class of objects, a Linear Ridge Regressor (LRR) is used for refinement.

And 2.2, sequencing the regional entity characteristics of each image obtained in the step 2.1 according to the size of the region, and selecting the first three regional entity characteristics with the largest region as candidate regions.

Step 2.3, uniformly using the VGG-16 network, and using fc for each candidate area feature obtained in the step 2.2 as shown in FIG. 2₇The layers are represented as 4096-dimensional image features, and the global vector of candidate regions is also represented as 4096-dimensional image features.

In the step 3, the detailed steps of feature fusion and abstract generation are as follows:

step 3.1, converting each 4096-dimensional image feature obtained by 2.3 into a feature with the same dimension as the text by using a bilinear network, wherein the feature can be represented as I_t＝W_iI_vIn which I_vRepresenting the image characteristics, W, obtained in step 2.3_iIs a parameter of the bilinear network, I_tRepresenting image feature vectors of the same dimension as the text.

And 3.2, for the same sample, splicing the text vector of the sample obtained in the step 1.7 and the image characteristic vector of the sample obtained in the step 3.1, splicing the text and the image into A, combining the A with the original abstract Y to obtain a binary group (A, Y), and obtaining a training set, a verification set and a test set represented by vectorization again.

Step 3.3, sampling k samples of the new training set obtained in the step 3.2, and sequentially inputting the samples into an encoder to obtain a combined code h of the text and the image_sBy means of an intermediate semantic vector c_tCalculating the current state h of the decoder_tTherefore, feature fusion is realized, and the detailed settings are as follows:

the summary generation module generates a summary using the fused features. Representing the input samples of the training set as (a, Y), where a ═ a₁,a₂,…,a_nThe term represents n characteristics of text and image, and the term represents Y ═ Y₁,y₂,…,y_mGet the summary for

And (4) showing.

In the encoding stage, the input feature vector at the current time i is represented as a_i(vector for splicing text and image), the hidden layer output at the last moment is recorded as h_s-1Then the hidden layer output at the current time i is h_s＝f(h_s-1,a_i)。

In the encoding stage, h is used_tRepresenting the hidden state of the decoder at the current time i.

By means of a transfer matrix W_aCalculate h at the current State_tAnd h_sDegree of association of (c), i.e. score (h)_t,h_s)＝h_tW_ah_sAfter normalizing it, have

Thereby obtaining an intermediate semantic vector c_t＝a_t(s)·h_sAnd corresponding decoder derived hidden states

Is through a parametric network W_cAnd a corresponding activation function, the expression of which is

Step 3.4, the hidden state of the decoder in the current state in the step 3.3 is processed

Through the softmax layer, a generation abstract is obtained and is represented as

Wherein, y_tThe t-th word of the generated abstract Y, A is the splicing characteristic of the text vector and the image characteristic vector of the sample, and Ws is a parameter matrix.

Step 3.5, use optimization objectives

Repeating the steps 3.3 and 3.4 to train the model until the model converges; n is the total number of samples in the training set, theta is the model parameter, y_nIs the nth word of the summary.

In step 4, the evaluation model is as follows:

step 4.1, inputting the characteristics of the test set obtained in the step 3.2 into the model trained in the step 3.5 to obtain a corresponding abstract;

step 4.2, the artificial abstracts corresponding to the test set correspond to the generated abstracts corresponding to the step 4.1 one by one to obtain

Step 4.3, mixing

The F-measures of Rouge-1, Rouge-2 and Rouge-L were evaluated in the Rouge toolkit.

In step 4, the step of applying the model is similar to step 4.1.

Compared with the prior art, the invention has the following positive effects:

compared with a pure text generation system, the abstract generated by the invention can effectively adjust the weight of the entity in the text and relieve the problem of unregistered words to a certain extent.

Drawings

FIG. 1 is a diagram of a generative abstract model based on image-text fusion;

FIG. 2 is a VGG-16 network model diagram.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings.

The present embodiment employs a multi-modal sentence abstract data set MMSS, which is a data set containing triples of text, image and abstract (X, Y, I), where the text and abstract are from the Gigawords data set of a broad evaluation abstract system, and the image is retrieved by a search engine. Finally, an (X, Y, I) triple data set is obtained through manual screening, wherein the (X, Y, I) triple data set comprises 66000 samples in a training set, and 2000 samples in a verification set and a test set respectively.

Step 1, preprocessing a data set.

Step 1.1, the given original data set is subjected to text, abstract and image one-to-one correspondence, namely (X, Y, I).

And step 1.2, removing special characters, emoticons and full-angle characters such as 'Rib', '300' and the like from the text and the abstract.

Step 1.4, since MMSS is a sentence-level abstract, the text is short, so the corresponding stop words are not filtered on the data set.

And step 1.5, the preprocessed text abstract images (X, Y, I) are shuffled simultaneously in a one-to-one correspondence manner, and are proportionally divided into a training set, a verification set and a test set.

Step 1.6, constructing a 5-thousand dictionary according to the data set, representing words in the text and the abstract which do not appear in the dictionary as 'UNK', adding a mark 'BOS' at the beginning of the document, adding 'EOS' at the end, limiting the text length to be 120 words at the longest, summarizing the text to be 30 words, directly cutting off redundant words, and filling the words with a placeholder 'PAD' which are smaller than the text length.

Step 1.7, using the WordEmbedding toolkit of Gensim, each word in the text summary dataset is represented by a 256-dimensional word vector with fixed dimensions, including the special mark of step 1.6.

And 2, extracting main characteristic entities from the image I corresponding to the text data set in the step 1, and expressing the main characteristic entities into image characteristics with the same dimension as the text.

And 2.1, capturing key entity characteristics of the corresponding images by using the images in the step 1.5 one by one through a Regional Convolutional Neural Network (RCNN) tool.

And 2.2, sorting the regional entity characteristics of each image obtained in the step 2.1 according to the size of the region, and selecting the first three regions with the largest regions as candidate regions.

Step 2.3, uniformly using the VGG-16 network, and using fc for each area feature obtained in the step 2.2₇The layers are represented as features of 4096 dimensions.

And 3, a generating abstract model based on image-text fusion is trained by using the training sets in the steps 1 and 2.

And 3.1, converting each area 4096-dimensional feature obtained by the 2.3 into a 256-dimensional feature with the same dimension as the text by using a bilinear network.

And 3.2, splicing the image characteristics obtained in the step 3.1 with the text obtained in the step 1.7, putting the image characteristics in the front of the text, and obtaining a training set, a verification set and a test set represented by vectorization again after BOS marking.

And 3.3, sampling 64 samples of the new training set obtained in the step 3.2, and sequentially inputting the samples into the model for training.

And 3.4, repeating the step 3.3 until the model converges on the training set and is optimal on the verification set.

Step 4, after the abstract generation model is trained, testing the performance of the model by using the test set, and evaluating indexes by using Rouge

Step 4.1, inputting the characteristics of the test set obtained in the step 3.2 into the model trained in the step 3 to obtain a corresponding abstract;

Step 4.3, mixing

In order to compare the advantages and disadvantages of the generating-type abstract generating method (abbreviated as MSE) based on image-text fusion in the invention compared with the existing pure text model, currently, the method respectively adopts the Lead directly selecting the first 8 words, uses the compression of syntactic structure compression, the original Seq2Seq model (Abs), the Seq2Seq model + Attention mechanism (Abs +), uses the layered Attention mechanism to learn the Seq2Seq framework (Multi-Source) of Multi-Source data, records the F-measure of the Rouge score of each model for generating the abstract for the test set, and the experimental results are shown in the following table:

system for controlling a power supply	Rouge-1	Rouge-2	Rouge-L
				Lead	33.46	13.40	31.84
Compress	31.56	11.02	28.87
				Abs	35.95	18.21	31.89
Abs+A	41.11	21.75	39.92
				Multi-Source	39.67	19.11	38.03
MSE	43.94	23.15	41.56

The experimental result shows that after image information is introduced, the three Rouge scores are improved to a certain extent by the image-text fusion-based generation type abstract method, particularly the Rouge2, and the effectiveness brought by the image-text fusion is more effectively explained.

In practical application, a text is input in an interactive interface, image input can be omitted in the application stage, and a corresponding abstract is obtained by using 'PAD' filling:

for example, the input text: "Japan's colleted kidu traction, the large sample Such infection in the country, had induced disorders of # # billion yen-lrb- # billion malls-rrb-, the bank of Japan sack sand wednesday"

Obtaining an abstract: "Japan's bank losses # # # billion yen".

The abstract generated by the invention can effectively generate the entity of 'bank', which can be obtained from the practical case.

Although specific details of the invention, algorithms and figures are disclosed for illustrative purposes, these are intended to aid in the understanding of the contents of the invention and the implementation in accordance therewith, as will be appreciated by those skilled in the art: various substitutions, changes and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. The invention should not be limited to the preferred embodiments and drawings disclosed herein, but rather should be defined only by the scope of the appended claims.

Claims

1. A method for generating a generative abstract based on image-text fusion comprises the following steps:

1) dividing a given text data set into a training set, a verification set and a test set; each sample in the text data set is a triple (X, I, Y), wherein X is a text, I is an image corresponding to the text X, and Y is an abstract of the text X; the generative abstract model comprises a feature extraction module, a feature fusion module and an abstract generation module;

2) the feature extraction module captures the entity features of each image by using a regional convolutional neural network, and then selects the first three entity features with the largest regions as candidate regions; then generating image features of the image global features and image features of the three candidate regions; then converting the image features into image feature vectors with the same dimension as the text;

3) training the generative abstract model by using the training set and the image characteristic vectors corresponding to the training set; when training is carried out, for the same sample, the feature fusion module splices a text vector corresponding to the sample and an image feature vector corresponding to the sample to obtain a training set, a verification set and a test set which are represented vectorially; then k samples are selected from the training set represented by vectorization and are sequentially input into an encoder to obtain the joint encoding h of the text and the image_sBy means of an intermediate semantic vector c_tComputing the hidden state h of the decoder_tThereby realizing feature fusion; then the abstract generating module generates an abstract by using the fused features;

4) inputting a text and a corresponding image and generating an image characteristic vector of the image, and then inputting the text and the image characteristic vector corresponding to the text into a trained generative abstract model to obtain an abstract corresponding to the text.

2. The method of claim 1, wherein the image feature vector comprises an image global feature vector and three entity vectors of a largest region in the image.

3. The method of claim 1, wherein the feature fusion method is: the hidden layer output at the current time i in the coding stage is a joint code h_sAt the current time of the encoding stage, i the hidden state of the decoder is h_tBy means of a transfer matrix W_aCalculate h at the current State_tAnd h_sDegree of association score (h)_t，h_s) And normalizing it to obtain a_t(s) then computing an intermediate semantic vector c_t＝a_t(s)·h_sAnd hidden state of decoder

4. The method of claim 3, wherein the digest generated by the digest generation module is

5. The method of claim 1, wherein the image features of each candidate region are converted into an image feature vector I with the same dimension as the text using a bilinear network_t＝W_iI_vIn which I_vRepresenting features of the image, W_iIs a parameter of the bilinear network, I_tRepresenting image feature vectors of the same dimension as the text.

6. The method of claim 1, wherein the method of capturing the physical features of each image using the area convolution neural network is:

21) dividing each image into a plurality of regions by applying an over-segmentation technology, then merging the regions of the same image according to a set merging rule, and taking all the regions appearing after merging as preliminary candidate regions;

22) performing feature extraction on each preliminary candidate region by using a CNN network;

23) inputting the features obtained from each preliminary candidate region into a support vector machine classifier, and judging whether the features are corresponding entity labels;

24) correcting the frame position of the preliminary candidate region according to the result of the category mark by using a regression model;

25) and sequencing the preliminary candidate regions of the image according to the sizes of the regions, and selecting entities corresponding to the first three regions with the largest regions as entity features of the image.

7. The method of claim 6, wherein the merging rule is near color merging or near texture merging.

8. The method of claim 1, wherein the trained generative digest model is tested using a test set, the generative digest model is verified using a verification set after the test is passed, and the step 4) is performed after the verification is passed.

9. The method of claim 1, wherein the step of converting the signal into a signal comprises converting the signal into a signal having a frequency of about one half of the signal

As an optimization objective, the generative summary model is trained until the generative summary model is generatedConverging the model; where N is the total number of samples in the training set, θ is the generative abstract model parameter, y_nIs the nth word of the abstract, a_nIs the feature corresponding to the nth word.