CN107807971A

CN107807971A - A kind of automated graphics semantic description method

Info

Publication number: CN107807971A
Application number: CN201710969647.9A
Authority: CN
Inventors: 吕学强; 董志安; 李卓
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2017-10-18
Filing date: 2017-10-18
Publication date: 2018-03-16

Abstract

The present invention relates to a kind of automated graphics semantic description method, including the automated graphics semantic description model based on CNN and GRU is built and trained, be specially：Step 1) objective function；Step 2) carries out the process translated from image to semantic description；Step 3) carries out reverse propagation to error.Automated graphics semantic description method provided by the invention, input using certain layer of full articulamentum feature of CNN extractions as GRU models, the low-level image feature and image, semantic of effective integration image describe high-layer semantic information, precision is high, the degree of accuracy is high, just reach higher semantic description precision using less parameter, the needs of practical application can be met well.

Description

A kind of automated graphics semantic description method

Technical field

The invention belongs to image, semantic description technique field, and in particular to a kind of automated graphics semantic description method.

Background technology

In recent years, the mankind are directed to the research that computer is understood image, semantic always.With computer hardware technique Development, automated graphics semantic description technology turns into study hotspot in recent years.Automated graphics semantic description is not only appreciated that figure Entity as in, and is appreciated that the event described in image, scene etc., be to image, semantic deeper into understanding.Now Automated graphics semantic description be also in the starting stage, due to " semantic gap " problem existing for image and natural language sheet The problem complicated and changeable of the syntactic structure of body, computer can not realize the accurate description to image, semantic information all the time.With Computer hardware in recent years and depth learning technology developing rapidly in image domains, increasing researcher throws Enter among the research of automated graphics semantic description.Depth learning technology such as convolutional neural networks model is artificial compared to other The characteristics of image of design has the ability for preferably extracting characteristics of image, but depth learning technology needs to rely on stronger calculating Ability, in recent years the depth learning technology that develops into of the hardware facility such as computer hardware such as GPU provide powerful calculating branch Hold, this causes this more difficult work of automated graphics semantic description to turn into the study hotspot of field of machine vision instantly.

Just move towards intellectualization times in the world of today.Increasing WeiLai Technology, such as unmanned, intelligent robot Deng gradually entering into the popular visual field.Unmanned to need computer to carry out automatic understanding to traffic information, intelligent robot needs Design and simulation eye simulates the function of human eye and human brain to identify the object of surrounding.These all rely on depth of the computer to image Enter to understand.Picture material automatic describing into natural language, is then subsequently understood image by automated graphics semantic description for computer Content.Therefore automated graphics semantic description is the supportive technology of following intellectualization times, has important Research Significance and business With value.The research of automated graphics semantic description is in the development starting stage.The achievement in research obtained at present is also few.One Aspect is that image " semantic gap " problem is not solved effectively at this stage because image description content itself is complex Certainly, objects in images accuracy of identification is not high.Another aspect automated graphics semantic description is that picture material is described as into nature language Speech, but fixed and clause is not more for natural language form itself.How picture material is expressed as abundant in content, clause Changeable natural language is highly difficult and has very much a challenge.Automated graphics semantic description routine thinking is first in image Entity mark vocabulary, then word combination is formed a complete sentence using language model.Because the content that image includes in itself is more rich Richness, and some objects on image may be capped or imperfect, the object after image is split can not by it is accurate know mark Note, this directly results in image, semantic, and to describe precision not high.And the semantic description content-form of this method is more single, knot Structure is relatively simple, and the understanding to image, semantic is not accurate comprehensive enough.Convolutional neural networks (CNN) are applied to extraction in recent years On characteristics of image, in the prior art, using the input through the characteristics of image that CNN is extracted as Recognition with Recurrent Neural Network (RNN), figure As output of the semantic description information as RNN, by image, semantic describe problem regard as it is translated from image to semantic description Journey, construct the automated graphics semantic description model based on CNN and RNN.But understanding of this method to image, semantic is accurate Degree is not high, and the sentence marked using the model is not clear and coherent enough, and the accuracy of marked content is not high.

The content of the invention

For above-mentioned problems of the prior art, can be avoided the occurrence of it is an object of the invention to provide one kind above-mentioned The automated graphics semantic description method of technological deficiency.

In order to realize foregoing invention purpose, technical scheme provided by the invention is as follows：

A kind of automated graphics semantic description method, including build and train the automated graphics semanteme based on CNN and GRU to retouch Model is stated, is specially：

Step 1) objective function；

Step 2) carries out the process translated from image to semantic description；

Step 3) carries out reverse propagation to error.

Further, the object function in step 1) is

Wherein θ represents parameter all in the model, and I represents piece image, S=(S₀... S_N) represent what is finally predicted Combinations of words, i.e., final semantic description.

Further, the step 2) is as shown by the following formula：

x_-1=CNN (I)；

x_t=W_es_t, t ∈ 0 ... N-1 }；

h_t=GRU (x_t), t ∈ 0 ... N-1 }；

p_t+1=g (W_ph_t)；

Wherein, I represents piece image, S=(s₀, s₁, s₂... s_n) the complete semantic description of diagram picture is represented, by N word composition.s_tUsing one-hot coding form；Wherein s₀It is a special words " start ", represents the beginning of a word； s_nIt is a special words " end ", represents the end of a word.

Further, the step 3) includes：

Define loss function：The loss function is that prediction of all moment word is correct The summation of log probable values after take the result of negative, i.e. cross entropy loss function；

By training the parameter constantly updated in model so that penalty values are as far as possible small；

The parameter is updated using stochastic gradient descent method and chain type Rule for derivation.

Further, the parameter includes GRU models inner parameter, term vector coding parameter, characteristics of image coding ginseng Number, output decoding parametric.

Further, in the training process of model, the weighting parameters of the GRU networks at each moment be all it is shared, on The output of one moment GRU network, the part as current time GRU network input.

Further, CNN includes two kinds of hidden layer structure of convolutional layer and pond layer.

Further, do not connected entirely between CNN next layer of neuron and last layer neuron, i.e. its neuron Between be local sensing；There is identical weight, i.e. the connection of neuron is weight in another aspect neuron connection procedure Shared.

Further, exist in GRU structure and reset thresholding, it is expressed asWherein σ is activation Function, x_tIt is the input information of t, h_t-1It is the output information of t-1 moment hidden layers,Thresholding input is reset for t Layer weights,Thresholding hidden layer weights are reset for t.

Further, renewal door in GRU structure be present, it can be expressed as formula：

Wherein σ is activation primitive, x_tIt is the input information of t, h_t-1It is the output information of t-1 moment hidden layers, Thresholding input layer weights are updated for t,Thresholding hidden layer weights are updated for t.

Automated graphics semantic description method provided by the invention, using the full articulamentum feature of certain layer of CNN extractions as GRU The input of model, the low-level image feature and image, semantic of effective integration image describe high-layer semantic information, and precision is high, and the degree of accuracy is high, Just reach higher semantic description precision using less parameter, the needs of practical application can be met well.

Brief description of the drawings

Fig. 1 is the structure of the present invention and the step flow for training the automated graphics semantic description model based on CNN and GRU Figure；

Fig. 2 is the automated graphics semantic description model structure schematic diagram based on CNN and GRU；

Fig. 3 is traditional neural network model basic structure schematic diagram；

Fig. 4 is RNN neural network model conventional structure schematic diagrames；

Fig. 5 is GRU model structure schematic diagrames.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with the accompanying drawings and specific implementation The present invention will be further described for example.It should be appreciated that specific embodiment described herein is only to explain the present invention, not For limiting the present invention.Based on the embodiment in the present invention, those of ordinary skill in the art are not before creative work is made The every other embodiment obtained is put, belongs to the scope of protection of the invention.

A kind of automated graphics semantic description method, including build and train the automated graphics semanteme based on CNN and GRU to retouch The step of stating model, shown in reference picture 1, the step of building and train the automated graphics semantic description model based on CNN and GRU Specifically include following steps：

Objective function：

The process translated from image to semantic description is carried out, as shown by the following formula：

x_-1=CNN (I) (2)；

x_t=W_es_t, t ∈ 0 ... N-1 } (3)；

h_t=GRU (x_t), t ∈ 0 ... N-1 } (4)；

p_t+1=g (W_ph_t) (5)；

Wherein, I represents piece image, S=(s₀, s₁, s₂... s_n) the complete semantic description of diagram picture is represented, by N word composition.s_tUsing one-hot coding, (one hot, in addition to being 1 except a certain position, remaining position is 0.It is by a N-dimensional vector structure Into N represents the number of words in word lexicon) form.Wherein s₀It is a special words " start ", represents opening for a word Begin.s_nIt is a special words " end ", represents the end of a word.Only in t=-1 moment input picture features to GRU nets In network, the moment, each word in semantic description S corresponding to input picture was into GRU networks in order afterwards, during to ensure t It is consistent, it is necessary to the characteristics of image inputted to the t=-1 moment and moment input afterwards to carve the information dimension being input in GRU networks Word s_tEncoded, s_tNeed by word weight parameter W_eCoding, characteristics of image are needed by Image Coding parameter W_LCompile Code.Since the t=0 moment, the output h of each moment GRU models_tBy exporting decoding parametric W_pPass through softmax after decoding again Grader can obtain a prediction result p_t+1(such as shown in formula (5)), i.e. each moment is produced in semantic description in order Current time GRU mode input word next word.This prediction result and current time GRU mode input word Next correct word gap be present, need to carry out error reverse propagation in the training process.

It is defined as follows loss function：

It is the backpropagation of error to model training process, updates the process of model parameter.Damage as shown in formula (6) It is the result that negative is taken after the correct log probable values of prediction of all moment word are summed to lose function, that is, intersects entropy loss letter Number.By training the parameter constantly updated in model so that penalty values are as far as possible small.These parameters include joining inside GRU models Number, term vector coding parameter, characteristics of image coding parameter, output decoding parametric etc..What the renewal for these parameters was applied to Method is stochastic gradient descent method (SGD) and chain type Rule for derivation.In the training process of model, the GRU networks at each moment Weighting parameter be all shared.The output of last moment GRU network, the part input (tool as current time GRU network Shown in body such as formula (1)-(4)).

Convolutional neural networks (CNN) include two kinds of unique hidden layer structure of convolutional layer and pond layer.Recently as meter The development of calculation machine hardware (CPU, GPU), the calculating performance of computer are greatly improved.Neutral net mould more complicated CNN etc. The calculating performance that type relies on computer powerful is increasingly becoming the study hotspot of numerous research fields.There is CNN preferable feature to carry Ability is taken, it is widely used in the fields such as image, video, voice at present.

CNN has unique network structure.Its uniqueness is mainly reflected in two aspects：It is the next of it on one side Do not connected entirely between layer neuron and last layer neuron, i.e., be local sensing between its neuron；On the other hand god Through having identical weight in first connection procedure, i.e. the connection of neuron is that weight is shared.This unique local sensing and The shared network structure of weight approaches with biological neural network.Such model can effectively reduce the parameter in network, effectively Reduce the complexity of network.CNN has two kinds of unique hidden layer structures, i.e. convolutional layer and pond layer.A certain layer convolution in CNN Layer is made up of a variety of convolution kernels, and a convolution kernel is the wave filter of a M*M size, and it is used for extracting in last layer receptive field Certain local feature of each local location.Pond layer be used for last layer convolution feature carry out dimensionality reduction, concrete operations be by Last layer convolution feature is divided into multiple N*N region.The characteristic value of average (or maximum) in each region is extracted as dimensionality reduction Feature afterwards.CNN generally would generally access a softmax points after a series of convolutional layers, pond layer, full articulamentum Class device, for handling more classification problems.

Recognition with Recurrent Neural Network (Recurrent Neural Network, hereinafter referred to as RNN) is the one of neural network model Kind, because it has unique memory function structure, it is applied in Machine Translation Model.Neural network model includes defeated Enter layer, hidden layer, output layer three-decker.In traditional neural network model, such as previously described convolutional neural networks mould It is connectionless per interior nodes from level to level, the node between each layer is to deposit to output layer from input layer to hidden layer among type It is as shown in Figure 3 in connection, concrete structure.This traditional neural network model and the function not comprising recall info, as one It is helpless to need a bit by the problem of information is calculated has been produced.If for example, in short, to predict down The word of one appearance, need in most cases by above caused vocabulary, for example " I is a basketball movement Member, I likes to play basketball " so in short, " playing basketball " inside latter sentence is needed by " the basketball movement in last sentence Member " is inferred to.RNN models information caused by the moment can will be remembered before and be applied to current time calculating process In, this has benefited from RNN compared to the change that traditional neural network model occurs in structure, and the input of RNN hidden layer is not The output of current time input layer is only included, also includes the output information of last moment hidden layer, i.e., the node inside hidden layer There is a connection, concrete structure information is as shown in Figure 4.

GRU is that RNN models are improved, it with can be solved as LSTM models existing for RNN models it is long-term Dependence Problem, it is less compared to parameter for LSTM models inside its model, more it is not easy over-fitting in training process, and Training speed is very fast.The specific structures of GRU are as shown in Figure 5：

Reference picture 5, a thresholding r be present, referred to as reset thresholding, it can be expressed as formula：

Wherein σ is activation primitive, x_tIt is the input information of t, h_t-1It is the output information of t-1 moment hidden layers, Thresholding input layer weights are reset for t,Thresholding hidden layer weights are reset for t.

Another thresholding z in Fig. 5 be present, referred to as update door, it can be expressed as formula：

From figure 5 it can be seen that the output information of t hidden layer can be expressed as：

Wherein z_tThreshold conditon information, h are updated for t_t-1It is the output information of t-1 moment hidden layers,It is t Hidden layer status information, its specific formula is such as shown in (3.4)：

Wherein φ is activation primitive, and ⊙ is that the dot product of matrix operates, W_tFor t input layer weights, U_tImplied for the t moment Layer weights, r_tThreshold conditon information, h are reset for t_t-1It is the output information of t-1 moment hidden layers.

By formula (3.4) it can be seen that when thresholding r is reset close to 0, h_tLatter close to 0, i.e. t It is input layer information there was only current time in the hidden state at moment, and last moment hidden layer output information will be ignored.This Kind setting can allow for hidden layer to abandon to the incoherent information of later point so that the information stayed is more meaningful.It is another Aspect, renewal thresholding are used to control the choice between last moment hidden layer information and current time hidden state information, passed through Formula (3.3) is it can be seen that work as Z_tWhen taking 1, t hidden layer only exports last moment hidden layer information, works as Z_tWhen taking 0, during t Carve hidden layer and only export t hidden layer status information.The replacement thresholding at each moment and renewal thresholding be it is separate, Therefore when needing recent information, reset thresholding and be at active state.When needing long-range information, renewal goalkeeper can be in Active state.

Embodiment described above only expresses embodiments of the present invention, and its description is more specific and detailed, but can not Therefore it is interpreted as the limitation to the scope of the claims of the present invention.It should be pointed out that come for one of ordinary skill in the art Say, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the guarantor of the present invention Protect scope.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims

1. a kind of automated graphics semantic description method, it is characterised in that including building and training based on CNN and GRU from cardon As semantic description model, it is specially：

Step 1) objective function；

Step 3) carries out reverse propagation to error.

2. automated graphics semantic description method according to claim 1, it is characterised in that the object function in step 1) is

3. the automated graphics semantic description method according to claim 1-2, it is characterised in that the step 2) such as following public affairs Shown in formula：

x_-1=CNN (I)；

x_t=W_es_t, t ∈ 0 ... N-1 }；

h_t=GRU (x_t), t ∈ 0 ... N-1 }；

p_t+1=g (W_ph_t)；

Wherein, I represents piece image, S=(s₀, s₁, s₂... s_n) the complete semantic description of diagram picture is represented, it is single by n Word forms.s_tUsing one-hot coding form；Wherein s₀It is a special words " start ", represents the beginning of a word；s_nIt is one Individual special words " end ", represent the end of a word.

4. automated graphics semantic description method according to claim 1, it is characterised in that the step 3 includes：

Define loss function：The loss function is by the correct log of prediction word of all moment The result of negative, i.e. cross entropy loss function are taken after probable value summation；

5. automated graphics semantic description method according to claim 1, it is characterised in that the parameter includes GRU models Inner parameter, term vector coding parameter, characteristics of image coding parameter, output decoding parametric.

6. the automated graphics semantic description method according to claim 1-5, it is characterised in that in the training process of model In, the weighting parameter of the GRU networks at each moment is all shared, the output of last moment GRU network, as current time The part input of GRU networks.

7. the automated graphics semantic description method according to claim 1-6, it is characterised in that CNN includes convolutional layer and pond Change two kinds of hidden layer structure of layer.

8. the automated graphics semantic description method according to claim 1-7, it is characterised in that CNN next layer of neuron It is not connected entirely between last layer neuron, i.e., is local sensing between its neuron；Another aspect neuron connects During there is identical weight, i.e. the connection of neuron is that weight is shared.

9. the automated graphics semantic description method according to claim 1-8, it is characterised in that weight in GRU structure be present Thresholding is put, it is expressed asWherein σ is activation primitive, x_tIt is the input information of t, h_t-1When being t-1 The output information of hidden layer is carved,Thresholding input layer weights are reset for t,Thresholding hidden layer weights are reset for t.

10. the automated graphics semantic description method according to claim 1-9, it is characterised in that exist more in GRU structure New door, it can be expressed as formula：