CN107330444A

CN107330444A - A kind of image autotext mask method based on generation confrontation network

Info

Publication number: CN107330444A
Application number: CN201710396148.5A
Authority: CN
Inventors: 胡伏原; 吕凡; 沈军宇; 孙钰; 李林燕; 李宏
Original assignee: Suzhou University of Science and Technology
Current assignee: Suzhou University of Science and Technology
Priority date: 2017-05-27
Filing date: 2017-05-27
Publication date: 2017-11-07

Abstract

The invention discloses a kind of image autotext mask method based on generation confrontation network, comprise the following steps：False sentence is produced by maker, while rebuilding an arbiter, the sentence of generation and true input by sentence are trained, until arbiter can not determine true sentence and generated statement.The present invention change produced in CNN RNN images automatic sentence mark sentence it is stiff, it is inflexible the problem of, and cause the more accurate sentence of generation, nature, diversity, the sentence of generation scene increasingly complex in can facing the reality, the Expression of language mark image of the mankind is more conformed to, has more be widely applied in practice.

Description

A kind of image autotext mask method based on generation confrontation network

Technical field

Field is marked the present invention relates to image sentence, and in particular to a kind of image autotext based on generation confrontation network Mask method.

Background technology

In recent years, the automatic sentence mark problem of image obtains widely studied.Due to being directed not only to the target of image in itself Identification problem, also relates to natural language processing problem, and current main correlation technique can be summarized as following three kinds：

Semantic template completion method：The method is put the classification text for representing target by obtaining the objectives in image Enter in a fixed spatial term template, automatically generate sentence.By method using the result of target identification constitute one It is individual to include the simple sentence for fixing three semantic primitives.Relation between the target of identification is also together put into same mould by some methods In plate, composition includes the sentence of more multi-semantic meaning.

Feature space matching method：The method constructs a large amount of sentences in advance, by the way that image and the sentence constructed are all thrown The feature space of higher-dimension is mapped to, the close match statement of feature is found.Some methods construct multiple kernel, pass through Ranking mode is compared to the data of each data space, to find relation therebetween.Some methods are proposed by dividing Noise title, label or the statement that may be included in analysis picture, the method mapped for this feature space provide more useful Information.

CNN-RNN methods：The method extracts the feature of image by CNN (convolutional neural networks), inputs the feature into one In individual RNN [29] (Recognition with Recurrent Neural Network), using the training method of NLP (natural language processing), one sentence of training produces mould Block, while training process end to end can be realized.The feature of image zooming-out is directly inputted to circulation nerve net by some methods Network module, incoming LSTM Recognition with Recurrent Neural Network obtains annotation results, and the modelling effect is more outstanding.

Although conventional method can solve the problems, such as mark to a certain extent, still there is certain defect：

Semantic template completion method：This image autotext dimensioning algorithm filled based on semantic template, to a certain degree On can construct the sentence for meeting template, but in actual applications, its language expression ability is very weak, and can answer Scene is relatively limited.

Feature space matching method：This feature space matching method is, it is necessary to which a large amount of phrase datas are supported, and its essence is not It is to produce sentence, but matches existing sentence, the complicated scene in can not facing the reality in actual applications.

CNN-RNN methods：Although the defect of two methods before the method sheet overcomes to a certain extent, due to it Calculated using maximal possibility estimation, the automatic sentence mark of generation is sufficiently close to sample sentence, but apart from authentic context still There is certain gap.Its generated statement lacks lively, naturally statement, seems stiff, inflexible compared to human language.

In recent years, generation confrontation network (GAN, Generative Adversarial Networks) receives academia With the very big attention of industrial quarters, as one of most popular research field over the past two years.It is different from traditional machine learning method, The characteristics of GAN is maximum be to introduce confrontation mechanism, can be used for the modeling and generation of True Data distribution.Currently, generation confrontation Network model has attracted substantial amounts of researcher, is further expanded in all many-sides.As can be seen that with traditional machine Learning method is different, and the characteristics of GAN is maximum is the modeling and generation that can be used in True Data distribution.Make a general survey of existing generation Network method is resisted, it is to be directed to single data field mostly.Therefore, GAN is expected to solve the generated statement life in CNN-RNN methods Hard problem.

The content of the invention

It is an object of the invention to overcome the problem above that prior art is present there is provided a kind of based on generation confrontation network Image autotext mask method, the present invention is summarized based on deep neural network, optical imagery, natural language processing etc. The automatic sentence mark solution of traditional image, is probed into based on the automatic sentence mark side of generation confrontation network research designed image Method and its application.

To realize above-mentioned technical purpose and the technique effect, the present invention is achieved through the following technical solutions：

A kind of image autotext mask method based on generation confrontation network, comprises the following steps：

S 101 marks CNN multi-tags sort module and LSTM sentences generation module as maker, and LSTM sentences is special Extraction module and grader mark are levied as arbiter；

CNN multi-tags sort module extracts the information of picture described in S 102, is then given birth to by LSTM sentences generation module Into sentence, the sentence of generation is the false sentence that the maker is generated；

The sentence of generation and real input by sentence are trained by S 103, the LSTM sentences characteristic extracting module pair The sentence of generation and real sentence are trained, until the arbiter can not differentiate true sentence and generated statement.

Further comprise, also include differentiating that the sentence generated by the maker is by the arbiter in S 103 The method of no description picture, comprises the following steps：

The sentence that the maker is generated is designated as S by S 201_fake, real sentence is designated as S_real, a pictures of training It is designated as I_match, introduce a unmatched picture and be designated as I_mismatch；

The generated statement S of S 202_fakeWith true sentence S_realFeature extraction is carried out by LSTM sentences characteristic extracting module, extracted The feature that arrives, Match characteristics of image, Mismatch characteristics of image carry out feature combination, obtain sentence characteristic set；

Sentence feature in sentence characteristic set is carried out genuine/counterfeit discriminating by grader described in S 203, differentiates the sentence of generation Whether training image is belonged to.

Further comprise, in S203, grader includes during whether the sentence for differentiating generation belongs to training image Combine below：

S_fake I_mismatchArbiter can not be passed through；

S_fake I_matchHalf by arbiter, obtains score s_f；

S_real I_mismatchHalf by arbiter, obtains score s_w；

S_real I_matchBy arbiter, score s are obtained_r。

Further comprise, the arbiter recognizes true sentence by training, and recognizes whether true sentence matches figure Piece, the loss function of the arbiter is expressed as：

Further comprise, the maker utilizes the automatic sentence marking model generation approaching to reality sentence of multi-tag image Sentence, the loss function of the maker is expressed as：

The beneficial effects of the invention are as follows:

1. the method for the present invention overcomes the automatic sentence mask method of the traditional images defect not enough with final result ability to express, The image autotext marking model based on generation confrontation network is constructed, the model can be applied numerous in deep learning In field, it can apply and help disabled person to understand surrounding environment, effectively description network picture, convenient search；Help fast fast-growing Into news picture mark etc..

2. present invention contact GAN structures, change generation sentence in the automatic sentence mark of CNN-RNN images stiff, inflexible The problem of, and causing the more accurate sentence of generation, nature, diversity, the sentence of generation is more multiple in can facing the reality Miscellaneous scene, more conforms to the Expression of language mark image of the mankind, has more be widely applied in practice.

Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, And can be practiced according to the content of specification, below with presently preferred embodiments of the present invention and coordinate accompanying drawing describe in detail as after. The embodiment of the present invention is shown in detail by following examples and its accompanying drawing.

Brief description of the drawings

Technical scheme in technology in order to illustrate the embodiments of the present invention more clearly, in being described below to embodiment technology The required accompanying drawing used is briefly described, it should be apparent that, drawings in the following description are only some realities of the present invention Example is applied, for those of ordinary skill in the art, on the premise of not paying creative work, can also be according to these accompanying drawings Obtain other accompanying drawings.

Fig. 1 is traditional sound field confrontation network structure；

Fig. 2 is LSTM cellular construction figures；

Fig. 3 is image automatic sentence marking structure figure of the present invention based on generation confrontation network；

Fig. 4 is the structure chart for improving arbiter construction.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.

The present embodiment is that generation confrontation network is introduced on the basis of traditional CNN-RNN methods, it is proposed that based on generation Resist the algorithm of the automatic sentence mark of image of network, the problem of overcoming in the automatic sentence mark of traditional images.

Wherein, shown in reference picture 1, tradition generation confrontation network structure is made up of maker G and arbiter D.Wherein, generate Device G receives a noise data z as input, generates an analogue data G (z).Arbiter D is with True Data x or generation number According to G (z) as input, and distinguish whether its input comes from real data distribution p_dala(x).Generation confrontation model training is sentenced Other device D differentiates True Data and the accuracy rate of generation data to maximize it, while training maker G to minimize arbiter Accuracy rate.This target is reached by solving following saddle-point problem.

The model can regard a zero-sum game problem as, in true training process, it is often desired to which the effect of arbiter will It is better, it can so supervise the effect of maker.If arbiter effect is poor, the false data of generation is determined as truly Data, then overall effect can be poor.In the training process, arbiter, retraining maker typically first can repeatedly be trained.

LSTM is a kind of Recognition with Recurrent Neural Network of special construction, and its structure as shown in Figure 2, contains three kinds in its structure Door, is to forget door, input gate and out gate respectively.Expression such as formula (2)~(8) that whole LSTM units are calculated.

i_t=σ (W_xix_i+W_hih_i-1+b_i) (2)

f_t=σ (W_xfxt+W_hfht-1+b_f) (3)

o_t=σ (W_xox_t+W_hoht-1+b_o) (4)

g_t=tanh (W_xcx_t+W_hch_t-1+b_c) (5)

c_t=f_t⊙c_t-1+i_t⊙g_t (6)

h_t=o_t+t⊙tanh(c_t) (7)

p_t+1=Softmax (h_t) (8)

In the present embodiment, as shown in figure 3, resisting the image autotext mask method of network based on generation, including with Lower step：

Specifically, being generated as shown in figure 4, also including differentiating by the arbiter in S 103 by the maker The sentence method that whether describes picture, comprise the following steps：

S_fake I_mismatchArbiter can not be passed through；

S_fake I_matchHalf by arbiter, obtains score s_f；

S_real I_mismatchHalf by arbiter, obtains score s_w；

S_real I_matchBy arbiter, score s are obtained_r。

Further, the arbiter recognizes true sentence by training, and recognizes whether true sentence matches picture, The loss function of the arbiter is expressed as：

Further, the maker generates the sentence of approaching to reality sentence using the automatic sentence marking model of multi-tag image Son, the loss function of the maker is expressed as：

In the present embodiment, using GAN training method, preferable maker and arbiter will be obtained, so as to lift figure The effect marked as automatic sentence.

The principle of image autotext mask method based on generation confrontation network of the present embodiment is：Tradition generation confrontation The characteristics of network has generation high-quality data, will be by generating using the automatic sentence mark of the multi-tag image of script as maker Device produces false sentence, while rebuilding an arbiter, the sentence of generation and true input by sentence are trained, until Arbiter can not determine true sentence and generated statement.The sentence generated by maker discriminates whether to belong in arbiter Initial data is distributed, it is impossible to which whether judge the sentence is the sentence for describing the picture.Then the sentence maker generated It is designated as S_fake, real sentence is designated as S_real, a pictures of training are designated as I_match, introduce a unmatched picture and be designated as I_mismatch；Generated statement S_fakeWith true sentence S_realFeature extraction is carried out by LSTM sentences characteristic extracting module, extracted Feature, Match characteristics of image, Mismatch characteristics of image carry out feature combination, obtain sentence characteristic set；Grader is by sentence Sentence feature in characteristic set carries out genuine/counterfeit discriminating, differentiates whether the sentence of generation belongs to training image.

The foregoing description of the disclosed embodiments, enables professional and technical personnel in the field to realize or using the present invention. A variety of modifications to these embodiments will be apparent for those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, it is of the invention The embodiments shown herein is not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty phase one The most wide scope caused.

Claims

1. a kind of image autotext mask method based on generation confrontation network, it is characterised in that comprise the following steps：

S 101 as maker, LSTM sentence features is carried CNN multi-tags sort module and LSTM sentences generation module mark Modulus block and grader mark are used as arbiter；

CNN multi-tags sort module extracts the information of picture described in S 102, then generates language by LSTM sentences generation module Sentence, the sentence of generation is the false sentence that the maker is generated；

The sentence of generation and real input by sentence are trained by S 103, and the LSTM sentences characteristic extracting module is to generation Sentence and real sentence be trained, until the arbiter can not differentiate true sentence and generated statement.

2. the image autotext mask method according to claim 1 based on generation confrontation network, it is characterised in that S Also include differentiating the method whether sentence generated by the maker describes picture by the arbiter in 103, including Following steps：

The sentence that the maker is generated is designated as S by S 201_fake, real sentence is designated as S_real, a pictures of training are designated as I_match, introduce a unmatched picture and be designated as I_mismatch；

The generated statement S of S 202_fakeWith true sentence S_realFeature extraction is carried out by LSTM sentences characteristic extracting module, extracted Feature, Match characteristics of image, Mismatch characteristics of image carry out feature combination, obtain sentence characteristic set；

Sentence feature in sentence characteristic set is carried out genuine/counterfeit discriminating by grader described in S 203, and whether the sentence of differentiation generation Belong to training image.

3. the image autotext mask method according to claim 2 based on generation confrontation network, it is characterised in that In S203, grader includes following combination during whether the sentence for differentiating generation belongs to training image：

S_fake I_mismatchArbiter can not be passed through；

S_fake I_matchHalf by arbiter, obtains score s_f；

S_real I_mismatchHalf by arbiter, obtains score s_w；

S_real I_matchBy arbiter, score s are obtained_r。

4. the image autotext mask method based on generation confrontation network according to claim 1-3 any one, its It is characterised by, the arbiter recognizes true sentence by training, and recognizes whether true sentence matches picture, the differentiation The loss function of device is expressed as：

5. the image autotext mask method based on generation confrontation network according to claim 1-3 any one, its It is characterised by, the maker generates the sentence of approaching to reality sentence using the automatic sentence marking model of multi-tag image, described The loss function of maker is expressed as：