CN107577983A

CN107577983A - It is a kind of to circulate the method for finding region-of-interest identification multi-tag image

Info

Publication number: CN107577983A
Application number: CN201710562354.9A
Authority: CN
Inventors: 林倞; 王州霞; 李冠彬; 陈添水; 成慧
Original assignee: National Sun Yat Sen University
Current assignee: Sun Yat Sen University; National Sun Yat Sen University
Priority date: 2017-07-11
Filing date: 2017-07-11
Publication date: 2018-01-12

Abstract

The present invention provides a kind of method for circulating and finding region-of-interest identification multi-tag image, the multi-tag image recognition framework that this method proposes, it is not only unrelated with candidate region, and the different region of the yardstick of semantic correlation can be automatically found in the picture, and these interregional Context-dependents are obtained simultaneously；For spatial alternation network, we also proposed three constraints.They not only facilitate positioning and have more the region of semantic information, and can further improve the accuracy of multi-tag image recognition；The invention is not only effectively improved the identification accuracy of multi-tag image, and largely improves the efficiency of identification.

Description

It is a kind of to circulate the method for finding region-of-interest identification multi-tag image

Technical field

The present invention relates to computer vision, area of pattern recognition, is circulated more particularly, to one kind and finds that region-of-interest is known The method of other multi-tag image.

Background technology

Identification multi-tag image is a common and actual task in computer vision, because the image in real world Generally comprise abundant and various semanteme.And how the main difficult point of this task is effectively by semantic label and image Hold (such as region or subregion) to associate, particularly under the scene of complexity, such as foreground object is scattered and size not Unanimously.

It is used for the method for image multi-tag classification now generally by means of single labeling and object location techniques.And in recent years Proved to be directed to this way to solve the problem, while consider the spatial information of different objects and their global information in image Very big performance boost can be brought.The typical process of existing method includes two steps：1) substantial amounts of candidate region is extracted, and Assuming that these candidate regions contain all foreground objects.2) predict the label of these candidate regions and it is regular be this image Multiple labels.But these methods to generate candidate region dependence normally result in computing redundancy, and can ignore or Excessively simplify the context relation between foreground object.In addition, the method based on the two steps, its training stage is not so Perfection, it is in training stage and test phase combined optimization end to end all difficult to realize.

Research i.e. at present for multi-tag image recognition is primarily present problems with：

1) current research, the generation dependent on candidate region mostly, and the generating algorithm of most of candidate regions, especially It is bottom-up generating algorithm, it is extremely time-consuming.Use candidate region in addition, can ignore or excessively simplify foreground object between it is upper Hereafter relation；

2) current research, it is difficult to realize combined optimization process end to end.

The content of the invention

To provide a kind of method for circulating discovery region-of-interest and identifying multi-tag image, this method can be carried effectively the present invention The high precision of multi-tag image recognition, and largely saved time cost.

In order to reach above-mentioned technique effect, technical scheme is as follows：

It is a kind of to circulate the method for finding region-of-interest identification multi-tag image, comprise the following steps：

S1：The feature representation of sample is extracted using a convolutional neural networks；

S2：Using the transformation matrix of last moment prediction by spatial alternation network in the characteristic pattern that step S1 is obtained section Take concerned region；

S3：By the long mnemon in short-term of region-of-interest input, the unit is according to input information and the hiding shape of last moment State and the hidden state and memory state at memory state generation current time；

S4：The classification scores vector of the region-of-interest is predicted according to the hidden state at current time, and predicts subsequent time Transformation matrix needed for spatial alternation network；

S5：Circulation performs step S2-S4, until the scores vector that kth, fusion 2 are predicted to the K moment, obtains the image Final classification results.

Further, the transformation matrix in the step S2, its form of expression areWherein (s_x,s_y) table Show scale transformation, (r_x,r_y) represent that rotation transformation perseverance is zero, (t_x,t_y) translation transformation is represented, its span is [- 1,1]； Parameter of the spatial alternation network in transformation matrix is zooming and panning conversion, in each passage interception pair of global characteristics figure The one piece of region answered, and be adjusted to fixed size and exported.

Further, the specific implementation process of the spatial alternation network is as follows：

S21：To known target matrix coordinate (x^t, y^t), wherein -1≤x^t≤ 1, -1≤y^t≤ 1, ask corresponding in source matrix Coordinate (x^s, y^s), wherein -1≤x^s≤1,-1≤y^s≤ 1, formula is

S22：By the coordinate (x that value is [- 1,1]^s, y^s) coordinate of original matrix is mapped back, formula is(x^t, y^t) similarly obtain (X^t, Y^t), wherein source matrix M^sWith objective matrix M^t Size is respectively (H^s, W^s), (H^t, W^t)；

S23：Pass through the method coordinates computed (X of linear interpolation^s, Y^s) value, as objective matrix M^tCoordinate (X^t, Y^t) Value.

Wherein, the convolutional neural networks for extracting feature are derived from VGG (Simonyan ICLR2015) except pool5 and institute thereafter There is layer, the model that its parameter trains to obtain with single labeling by the class of mass data ImageNet data sets 1000 is carried out initially Change.

Sample is adjusted to N*N (the method N is 512 and 640 two yardsticks) size by original image in step S1, and intercepts Wherein (N-64) * (N-64) size is as input.In addition, the sample of interception can be overturn (in training process at random with 0.5 probability Random interception, random upset, fix when test four corners that interception N*N sizes are (N-64) * (N-64) and Central area, and overturn).

Further, affiliated step S4 sorter network is made up of one layer of full articulamentum, and it is the hidden of current time that it, which is inputted, Tibetan state, output are the vectors that length is C, and wherein C represents the classification number of data set, and the vector is to current region-of-interest category In the marking situation of each classification.

Further, the positioning network of the step S4 is made up of one layer of full articulamentum, and it is the hidden of current time that it, which is inputted, Tibetan state, the vector for being 4 for length is exported, represents s respectively_x, t_x, s_y, t_y, i.e., scale transformation and translation transformation only are included, are pair The prediction of transformation matrix corresponding to next region-of-interest.

Further, mixing operation is for each classification in the step S5, and its final score is selected from all concerns Such fraction highest region-of-interest in region, it is specific to represent as follows：Remember that scores vector corresponding to the 2-K moment is { s₂, s₃,···,s_k, whereinFinal score vector after note fusion is s { s¹,s²,…,s^C, then,C=1,2 ... C.

Compared with prior art, the beneficial effect of technical solution of the present invention is：

Multi-tag image recognition framework proposed by the present invention, it is not only unrelated with candidate region, and can automatically scheme The different region of semantic related yardstick is found as in, and obtains these interregional Context-dependents simultaneously；Become for space Switching network, we also proposed three constraints.They not only facilitate positioning and have more the region of semantic information, and can enter one Step improves the accuracy of multi-tag image recognition；The invention is not only effectively improved the identification accuracy of multi-tag image, and And largely improve the efficiency of identification.

Brief description of the drawings

The training of Fig. 1 models of the present invention and test basic framework figure.

Fig. 2 hollow converting network of the present invention constrains 1 schematic diagram.

Embodiment

Accompanying drawing being given for example only property explanation, it is impossible to be interpreted as the limitation to this patent；

In order to more preferably illustrate the present embodiment, some parts of accompanying drawing have omission, zoomed in or out, and do not represent actual product Size；

To those skilled in the art, it is to be appreciated that some known features and its explanation, which may be omitted, in accompanying drawing 's.

Technical scheme is described further with reference to the accompanying drawings and examples.

Embodiment 1

As shown in figure 1, a kind of circulate the method for finding region-of-interest identification multi-tag image, comprise the following steps：

Transformation matrix in step S2, its form of expression areWherein (s_x,s_y) represent scale transformation, (r_x, r_y) represent that rotation transformation perseverance is zero, (t_x,t_y) translation transformation is represented, its span is [- 1,1]；Spatial alternation network root It is that zooming and panning convert according to the parameter in transformation matrix, one piece of region corresponding to each passage interception in global characteristics figure, And it is adjusted to fixed size and is exported.

The specific implementation process of spatial alternation network is as follows：

S22：By the coordinate (x that value is [- 1,1]^s, y^s) coordinate of original matrix is mapped back, formula is(x^t, y^t) similarly obtain (X^t, Y^t), wherein source matrix M^sWith objective matrix M^tIt is big Small is respectively (H^s, W^s), (H^t, W^t)；

Step S4 sorter network is made up of one layer of full articulamentum, and it inputs the hidden state for current time, and output is Length is C vector, and wherein C represents the classification number of data set, and the vector is to belong to each classification to current region-of-interest Marking situation.

Step S4 positioning network is made up of one layer of full articulamentum, and it inputs the hidden state for current time, exports and is Length is 4 vector, represents s respectively_x, t_x, s_y, t_y, i.e., scale transformation and translation transformation only are included, are to next region-of-interest The prediction of corresponding transformation matrix.

Mixing operation is its final score such fraction in all region-of-interests for each classification in step S5 Highest region-of-interest, it is specific to represent as follows：Remember that scores vector corresponding to the 2-K moment is { s₂,s₃,···,s_k, whereinFinal score vector after note fusion is s { s¹,s²,…,s^C, then,C=1,2 ... C.

Technical scheme is further elaborated with reference to specific technical scheme.

1. data processing：

A), the training stage：The size of all training samples is uniformly adjusted to N × N (N takes 512 and 640 in the invention), with The block of machine interception wherein (N-64) × (N-64) sizes, inputted using 0.5 probability Random Level upset as final sample.

B), test phase：The size of all test samples is uniformly adjusted to N × N (N takes 512 and 640 in the invention), Its four corners and the middle block for intercepting (N-64) × (N-64) sizes respectively, using the block of these blocks and its flip horizontal as sample This input.Therefore each sample of test process shares 10 interception blocks, also imply that and have 10 classification results outputs.Our meetings The result final as the test sample to this 10 result averageds.

2. convolutional neural networks：For extracting the feature representation of sample, by 13 layers of convolutional layer (Convolutional Layer) form, wherein being interspersed with pond layer (Max-pooling Layer) and the linear elementary layer (ReLU of correction Nonlinearity Layer).Its initiation parameter is the model parameter trained by mass data ImageNet data sets.

3. spatial alternation network：Its specific algorithm is described in detail in step S21-S23.It should be noted that the invention In middle embodiment, the output size of spatial alternation network is 7x7.

4. grow memory network in short-term：The network structure includes input gate i_t, out gate o_tWith forgetting door f_t, specific algorithm is such as Under：

i_t=σ (v_ixx_t+w_imm_t-1+b_i)

f_t=σ (w_fxx_t+w_fmm_t-1+b_f)

c_t=f_t⊙c_t-1+i_t⊙g(w_cxx_t+w_cmm_t-1+b_c)

o_t=σ (w_oxx_t+w_omm_t-1+b_o)

m_t=o_t⊙h(c_t)

Wherein w..., b... represent weight and deviant, c respectively_t, m_tThe memory state at current time and hidden is represented respectively Tibetan state.σ represents sigmoid functions, and g and h are generally tanh functions.⊙ represents that the element of two vectorial correspondence positions is carried out Dot product.

In addition, hidden state and memory state are the vector that size is 2048.

5. sorter network：It is made up of one layer of full articulamentum, it inputs the hidden state for current time, and output is that size is C vector, C represent the classification number of data set.Its initiation parameter is Gauss number.

6. positioning network, it is made up of one layer of full articulamentum, it inputs the hidden state for current time, and output is that size is 4 vector, is sx, tx, sy, ty respectively, i.e., scale transformation and translation transformation in transformation matrix.Its initiation parameter is Gauss Random number.

7. fusion.

8. grader loss function：The invention is using Euclidean distance algorithm.Assuming that number of training is N, Mei Gexun Practice sample x_iCorresponding label vector isC represents the number of categories of datasets.If the sample is marked There is classification c in note, thenOtherwiseAnd label probability vector representation isGive Surely the probability vector predicted is vectorial p_i, then

So, final grader loss function can be expressed as：

9.3 constraints on spatial alternation network：

A), anchor constraint (Anchor constraint)：The invention can navigate to distribution as much as possible in order that obtaining model In the object of image diverse location, redundancy is reduced, it is proposed that anchor constrains, such as Fig. 2.Set the model one and share K+1 moment, then K region-of-interest can be positioned, wherein first region-of-interest (Bluepoint) will not be given constraint, and ensuing K-1 is paid close attention to The positioning study in region, the invention provides anchor as depicted (red point) so that positioning study has certain guiding.Its formula It is expressed as follows：

WhereinPrediction of the k moment to translation transformation in transformation matrix is represented, andIt is then its corresponding anchor.

B), dimensional constraints (Scale constraint)：During scale transformation in predictive transformation matrix, in order to not give The yardstick put is excessive (excessive yardstick means that region-of-interest is intended to full figure), and the invention proposes dimensional constraints, even in advance Scale transformation the parameter sx and sy of survey are more than parameter alpha (α=0.5 in the embodiment), then give and punish.Its formula is expressed as follows

C), just constraint (Positive constraint)：If the scale transformation parameter in transformation matrix is negative, cut The characteristic pattern taken can be overturn or turned upside down by left and right.The decline on recognition performance is brought in order to avoid this operation, the hair Bright to propose positive constraint, scale transformation the parameter sx and sy even predicted is less than parameter beta (β=0.1 in the embodiment), then gives Punishment.Its formula is expressed as follows：

l_P=max (0, β-s_x)+max (0, β-s_y)

To sum up, positioning loss function can be expressed as：

L_loc=l_s+λ₁l_A+λ₂l_P

Wherein λ 1 and λ 2 is hyper parameter (λ in the embodiment₁=0.01, λ₂=0.1).

10. the total losses function of the invention model：

L=L_cls+γL_loc

Wherein γ is hyper parameter (γ=0.1 in the embodiment).

Same or analogous label corresponds to same or analogous part；

Position relationship is used for being given for example only property explanation described in accompanying drawing, it is impossible to is interpreted as the limitation to this patent；

Obviously, the above embodiment of the present invention is only intended to clearly illustrate example of the present invention, and is not pair The restriction of embodiments of the present invention.For those of ordinary skill in the field, may be used also on the basis of the above description To make other changes in different forms.There is no necessity and possibility to exhaust all the enbodiments.It is all this All any modification, equivalent and improvement made within the spirit and principle of invention etc., should be included in the claims in the present invention Protection domain within.

Claims

1. a kind of circulate the method for finding region-of-interest identification multi-tag image, it is characterised in that comprises the following steps：

S2：Intercepted using the transformation matrix of last moment prediction in the characteristic pattern that step S1 is obtained by spatial alternation network by The region of concern；

S3：By the long mnemon in short-term of region-of-interest input, the unit according to input information and the hidden state of last moment and Memory state generates the hidden state and memory state at current time；

S4：The classification scores vector of the region-of-interest is predicted according to the hidden state at current time, and predicts subsequent time space Transformation matrix needed for converting network；

S5：Circulation performs step S2-S4, and until the scores vector that kth, fusion 2 are predicted to the K moment, it is final to obtain the image Classification results.

2. the method that circulation according to claim 1 finds region-of-interest identification multi-tag image, it is characterised in that described Transformation matrix in step S2, its form of expression areWherein (s_x,s_y) represent scale transformation, (r_x,r_y) represent Rotation transformation perseverance is zero, (t_x,t_y) translation transformation is represented, its span is [- 1,1]；Spatial alternation network is according to conversion square Parameter in battle array is that zooming and panning convert, one piece of region corresponding to each passage interception in global characteristics figure, and is adjusted to Fixed size is exported.

3. the method that circulation according to claim 2 finds region-of-interest identification multi-tag image, it is characterised in that described The specific implementation process of spatial alternation network is as follows：

S21：To known target matrix coordinate (x^t, y^t), wherein -1≤x^t≤ 1, -1≤y^t≤ 1, seek corresponding coordinate in source matrix (x^s, y^s), wherein -1≤x^s≤1,-1≤y^s≤ 1, formula is

4. the method that circulation according to claim 3 finds region-of-interest identification multi-tag image, it is characterised in that affiliated Step S4 sorter network is made up of one layer of full articulamentum, and it inputs the hidden state for current time, and output is that length is C Vector, wherein C represent the classification number of data set, and the vector is the marking feelings for belonging to each classification to current region-of-interest Condition.

5. the method that circulation according to claim 4 finds region-of-interest identification multi-tag image, it is characterised in that described Step S4 positioning network is made up of one layer of full articulamentum, and it inputs the hidden state for current time, and it is 4 to export as length Vector, s is represented respectively_x, t_x, s_y, t_y, i.e., scale transformation and translation transformation only are included, are to conversion corresponding to next region-of-interest The prediction of matrix.

6. the method that circulation according to claim 5 finds region-of-interest identification multi-tag image, it is characterised in that described Mixing operation is for each classification in step S5, and its final score such fraction highest in all region-of-interests is closed Region is noted, it is specific to represent as follows：Remember that scores vector corresponding to the 2-K moment is { s₂,s₃,···,s_k, whereinFinal score vector after note fusion is s={ s¹,s²,…,s^C, then,