CN106778926A

CN106778926A - A kind of pictograph of view-based access control model attention model describes method

Info

Publication number: CN106778926A
Application number: CN201611207945.6A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2016-12-23
Filing date: 2016-12-23
Publication date: 2017-05-31

Abstract

A kind of pictograph of the view-based access control model attention model proposed in the present invention describes method, and its main contents includes：Data input, pretreatment, self adaptation attention model, the output of image captions, its process is, it performs various actions and in the context of complex scene comprising the image data set of multiple objects using description people first, and each image matches 5 captions of artificial mark；Then captions length is shortened in pretreatment, data set is input in encoder and extracts spatial image feature；The adaptive space attention model of the view-based access control model sentry door for training finally is fed back to, allows machine to perform the automatically generating image captions of the task, obtain the corresponding natural language description result of image.In terms of image recognition, compared with the method based on template, its performance capabilities is optimal for the present invention；It may also help in user visually impaired, and user is easy to a large amount of typical non-structured vision datas of organizing and navigate.

Description

A kind of pictograph of view-based access control model attention model describes method

Technical field

The present invention relates to field of image recognition, more particularly, to a kind of pictograph description of view-based access control model attention model Method.

Background technology

As science and technology are developed rapidly, in field of image recognition, the neural coding device-decoder chassis based on attention are Described through being widely used in pictograph, i.e. Intelligent Recognition picture material, and it is described with natural language automatically.So And, decoder may need image it is little even without visual information to predict non-visual word, may seem visual Other words generally can reliably be predicted from language model.And if using the pictograph description side of view-based access control model attention model Method, then can solve the problems, such as that the image captions for automatically generating are of low quality, and it can automatically determine and when relies on When visual signal, rely only on language model.

The present invention proposes a kind of pictograph of view-based access control model attention model and describes method, and it is held using description people first The various image data sets for acting and multiple objects being included in the context of complex scene of row, each image pairing 5 is artificial The captions of mark；Then captions length is shortened in pretreatment, data set is input in encoder and extracts spatial image feature；Finally The adaptive space attention model based on " vision sentry " door for training is fed back to, is allowed machine to perform and is automatically generated image captions Task, obtain the corresponding natural language description result of image.The present invention in terms of image recognition, compared with the side based on template Method, its performance capabilities is optimal；It may also help in user visually impaired, and make user a large amount of typical cases that are easy to organize and navigate non- The vision data of structuring.

The content of the invention

For the image captions for automatically generating problem of low quality, it is an object of the invention to provide a kind of view-based access control model The pictograph of attention model describes method.

To solve the above problems, the present invention provides a kind of pictograph of view-based access control model attention model and describes method, its master Wanting content includes：

(1) data input；

(2) pre-process；

(3) self adaptation attention model；

(4) image captions output.

Wherein, a kind of pictograph of view-based access control model attention model describes method, including new space transforms model, is used for Extract spatial image feature；Self adaptation attention mechanism, introduces new shot and long term memory (LSTM) extension, produces one Extra " vision sentry " vector is rather than single hidden state；" vision sentry " is the additional potential table of decoder memory Show, fallback option is provided to decoder；One new sentry's door is further obtained by " vision sentry ", it determines that decoder is wanted How many fresh information are obtained from image, " vision sentry " is rather than relied on and is generated next word.

Wherein, described data input, employs scenario objects data set；Most of images in scenario objects data set Describe people and perform various actions, and be that, comprising multiple objects in the context of complex scene, each image there are 5 manually The captions of mark.

Wherein, described pretreatment, truncates scenario objects data set length more than 18 captions of character；Then build Occurs the vocabulary of the word of at least 5 times and 3 times in training set.

Wherein, described self adaptation attention model, including encoder, space transforms model, sentry's door and decoder；It can Visual signal when is relied on to automatically determine, language model when is relied only on, and when visual signal is depended on, Model also determines which region of image noted.

Further, described encoder, including the expression of image is obtained using convolutional neural networks；Use ResNet Last convolutional layer space characteristics output, its size be 2048 × 7 × 7；We use Represent the spatial convoluted neural network characteristics of the everywhere in k grid position；It is special that global image is obtained in the following manner Levy：

Wherein a^gIt is global image feature, in order to model conveniently, we use the individual layer sense with rectifier activation primitive Know that image feature vector is converted into the new vector with dimension d by device：

v_i=ReLU (W_aa_i) (2)

v^g=ReLU (W_ba^g) (3)

Wherein W_aAnd W_gIt is weight parameter, the spatial image characteristic formp V=[v of conversion₁,…,v_k]。

Further, described space transforms model, including the space transforms model is used to calculate context vector C_t, it is fixed Justice is：

C_t=g (V, h_t) (4)

Wherein g is to note function,It is spatial image feature, each spatial image is characterized in Show with a part of corresponding d dimension tables of image；h_tIt is hidden state of the recurrent neural network in time t；

The spatial image feature of given LSTMAnd hidden stateWe pass through monolayer neural networks, It is followed by softmax functions to feed back them, is distributed with the attention on k region for producing image：

α_t=softmax (z_t) (6)

WhereinIt is vector that all elements are both configured to 1；WithLearn Parameter；It is the attention weight of feature in V；Based on noting being distributed, context vector C can be obtained by below equation_t：

Wherein combine C_tAnd h_tBy formula：logp(y_t|y₁,…,y_t-1, I) and=f (h_t,C_t) the next word y of prediction_t+1。

Further, described sentry's door, including LSTM is extended to obtain " vision sentry " vector s_t：

g_t=σ (W_xx_t+W_hh_t-1) (8)

s_t=g_t⊙tanh(m_t) (9)

Wherein W_xAnd W_hIt is the weight parameter to be learnt, x_tBe when step-length t at input to LSTM, and g_tIt is to apply In memory cell m_tOn door；⊙ represents element product, and σ is logic sigmoid activation；

Based on " vision sentry ", new context vector is calculated we have proposed a kind of self adaptation attention modelIts quilt It is modeled as the mixing of the feature (i.e. the context vector of space transforms model) and " vision sentry " vector of space transforms image；It is mixed Matched moulds type is defined as follows：

Wherein β_tIt is the new sentry door in time t；In our mixed model, β_tScope is [0,1]；Value 1 is represented only to be made With " vision sentry " information, and 0 represents when next word is generated using only spatial image information；

In order to calculate new sentry's door β_t, we have modified space transforms component；Especially, we add extra to z Element, the vector includes the attention fraction as defined in equation 5；The element indicates network for sentry (relative with characteristics of image) Place how many " attentions "；The addition of this extra elements is converted to by by equation 6：

WhereinRepresent connection；W_sAnd W_gIt is weight parameter；It is worth noting that, W_gIt is and identical weight in equation 5 Parameter；It is the attention distribution of spatial image feature and " vision sentry " vector；We by the vector last Individual element is construed to gate value：β_t=α_t[k+1]；Probability on the vocabulary of the possibility word of time t may be calculated：

Wherein W_pIt is the weight parameter to be learnt；The formula encourages model adaptively to consider figure when next word is generated Picture and " vision sentry "；Sentry's vector is updated in each time step.

Further, described decoder, including using the structure based on recurrent neural network, the embedded vector w of connection_t's Word and global image characteristic vector v^gTo obtain input vector x_t=[w_t；v^g]；Using monolayer neural networks by " vision sentry " to Amount s_tWith LSTM output vectors h_tIt is transformed to the new vector with dimension d.

Wherein, described image captions output, the spatial image profile feedback of extraction is given train based on " the vision whistle The adaptive space attention model of soldier " door, allows machine to perform the automatically generating image captions of the task, obtains the corresponding nature of image Language describes result.

Brief description of the drawings

Fig. 1 is that a kind of pictograph of view-based access control model attention model of the invention describes the system flow chart of method.

Fig. 2 is that a kind of pictograph of view-based access control model attention model of the invention describes the scenario objects data set of method.

Fig. 3 is that a kind of pictograph of view-based access control model attention model of the invention describes the model support composition of method.

Specific embodiment

It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combine, the present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.

Fig. 1 is that a kind of pictograph of view-based access control model attention model of the invention describes the system flow chart of method.Main bag Include data input；Pretreatment；Self adaptation attention model；Image captions are exported.

v_i=ReLU (W_aa_i) (2)

v^g=ReLU (W_ba^g) (3)

C_t=g (V, h_t) (4)

Wherein g is to note function, V=[v₁,…,v_k],It is spatial image feature, each spatial image is characterized in Show with a part of corresponding d dimension tables of image；h_tIt is hidden state of the recurrent neural network in time t；

α_t=softmax (z_t) (6)

WhereinIt is vector that all elements are both configured to 1；W_v,WithIt is the ginseng to be learnt Number；It is the attention weight of feature in V；Based on noting being distributed, context vector C can be obtained by below equation_t：

g_t=σ (W_xx_t+W_hh_t-1) (8)

s_t=g_t⊙tanh(m_t) (9)

Fig. 2 is that a kind of pictograph of view-based access control model attention model of the invention describes the scenario objects data set of method. Most of images of scape object dataset describe people and perform various actions, and are comprising many in the context of complex scene Individual object, each image has 5 captions of artificial mark.

Fig. 3 is that a kind of pictograph of view-based access control model attention model of the invention describes the model support composition of method.The model It is one new adaptive it should be noted that coder-decoder framework, including encoder, space transforms model, sentry's door and decoder, It automatically decides when to check image and when generates next word by language model.

For those skilled in the art, the present invention is not restricted to the details of above-described embodiment, without departing substantially from essence of the invention In the case of god and scope, the present invention can be realized with other concrete forms.Additionally, those skilled in the art can be to this hair Bright to carry out various changes and modification without departing from the spirit and scope of the present invention, these improvement also should be regarded as of the invention with modification Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention More and modification.

Claims

1. a kind of pictograph of view-based access control model attention model describes method, it is characterised in that mainly including data input (one)； Pretreatment (two)；Self adaptation attention model (three)；Image captions export (four).

2. the pictograph based on a kind of view-based access control model attention model described in claims 1 describes method, it is characterised in that Including new space transforms model, for extracting spatial image feature；Self adaptation attention mechanism, introduces a new shot and long term Memory (LSTM) extends, and produces extra " vision sentry " vector rather than single hidden state；" vision sentry " is The additional potential expression of decoder memory, fallback option is provided to decoder；One is further obtained by " vision sentry " newly Sentry door, it determine decoder want how many fresh information obtained from image, rather than rely on " vision sentry " generate it is next Individual word.

3. based on the data input () described in claims 1, it is characterised in that employ scenario objects data set；Scene Most of images of object dataset describe people and perform various actions, and are comprising multiple in the context of complex scene Object, each image has 5 captions of artificial mark.

4. based on the pretreatment (two) described in claims 1, it is characterised in that scenario objects data set length more than 18 The captions of character are truncated；Then build and occur the vocabulary of at least 5 times and the word of 3 times in training set.

5. based on the self adaptation attention model (three) described in claims 1, it is characterised in that including encoder, space transforms Model, sentry's door and decoder；It can automatically determine when rely on visual signal, when rely only on language mould Type, and when visual signal is depended on, model also determines which region of image noted.

6. based on the encoder described in claims 5, it is characterised in that including obtaining image using convolutional neural networks Represent；Exported using the space characteristics of the last convolutional layer of ResNet, its size is 2048 × 7 × 7；We use A= {a₁,…,a_k},Represent the spatial convoluted neural network characteristics of the everywhere in k grid position；By following Mode obtains global image feature：

a^{g} = \frac{1}{k} Σ_{i = 1}^{k} a_{i} - - - (1)

Wherein a^gIt is global image feature, in order to model conveniently, we will using the single-layer perceptron with rectifier activation primitive Image feature vector is converted into the new vector with dimension d：

v_i=ReLU (W_aa_i) (2)

v^g=ReLU (W_ba^g) (3)

7. based on the space transforms model described in claims 5, it is characterised in that be used to calculate including the space transforms model Context vector C_t, it is defined as：

C_t=g (V, h_t) (4)

Wherein g is to note function, V=[v₁,…,v_k],It is spatial image feature, each spatial image is characterized in and figure A part of corresponding d dimension tables of picture show；h_tIt is hidden state of the recurrent neural network in time t；

The spatial image feature of given LSTMAnd hidden stateWe are used by monolayer neural networks Softmax functions feed back them, are distributed with the attention on k region for producing image：

α_t=softmax (z_t) (6)

WhereinIt is vector that all elements are both configured to 1；W_v,WithIt is the parameter to be learnt；It is the attention weight of feature in V；Based on noting being distributed, context vector C can be obtained by below equation_t：

C_{t} = Σ_{i = 1}^{k} a_{t i} v_{t i} - - - (7)

8. based on the sentry's door described in claims 5, it is characterised in that vectorial to obtain " vision sentry " including extension LSTM s_t：

g_t=σ (W_xx_t+W_hh_t-1) (8)

s_t=g_t⊙tanh(m_t) (9)

Wherein W_xAnd W_hIt is the weight parameter to be learnt, x_tBe when step-length t at input to LSTM, and g_tIt is consequently exerted at storage Device unit m_tOn door；⊙ represents element product, and σ is logic sigmoid activation；

Based on " vision sentry ", new context vector is calculated we have proposed a kind of self adaptation attention modelIt is modeled It is the mixing of the feature (i.e. the context vector of space transforms model) of space transforms image and " vision sentry " vector；Hybrid guided mode Type is defined as follows：

{\hat{c}}_{t} = β_{t} s_{t} + (1 - β_{t}) c_{t} - - - (10)

Wherein β_tIt is the new sentry door in time t；In our mixed model, β_tScope is [0,1]；Value 1 represent using only " vision sentry " information, and 0 represents when next word is generated using only spatial image information；

In order to calculate new sentry's door β_t, we have modified space transforms component；Especially, we add extra element to z, The vector includes the attention fraction as defined in equation 5；The element indicates network placement (relative with characteristics of image) for sentry How much " note "；The addition of this extra elements is converted to by by equation 6：

{\hat{α}}_{t} = s o f t m a x ([z_{t}; w_{h}^{T} \tanh (W_{s} s_{t} + (W_{g} h_{t}))]) - - - (11)

WhereinRepresent connection；W_sAnd W_gIt is weight parameter；It is worth noting that, W_gIt is and identical weight parameter in equation 5；It is the attention distribution of spatial image feature and " vision sentry " vector；We by the vector last unit Element is construed to gate value：β_t=α_t[k+1]；Probability on the vocabulary of the possibility word of time t may be calculated：

p_{t} = s o f t m a x (W_{p} ({\hat{c}}_{t} + h_{t})) - - - (13)

Wherein W_pIt is the weight parameter to be learnt；The formula encourage model generate next word when adaptively consider image with " vision sentry "；Sentry's vector is updated in each time step.

9. based on the decoder described in claims 5, it is characterised in that including using the structure based on recurrent neural network, The embedded vector w of connection_tWord and global image characteristic vector v^gTo obtain input vector x_t=[w_t；v^g]；Use monolayer neuronal net Network is by " vision sentry " vector s_tWith LSTM output vectors h_tIt is transformed to the new vector with dimension d.

10. based on image captions output (four) described in claims 1, it is characterised in that the spatial image feature that will be extracted The adaptive space attention model based on " vision sentry " door for training is fed back to, is allowed machine to perform and is automatically generated image captions Task, obtain the corresponding natural language description result of image.