CN108681752A

CN108681752A - A kind of image scene mask method based on deep learning

Info

Publication number: CN108681752A
Application number: CN201810525276.XA
Authority: CN
Inventors: 郝玉洁; 林劼; 陈炳泉; 钟德建; 杜亚伟; 马俊; 杨晨
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-05-28
Filing date: 2018-05-28
Publication date: 2018-10-19
Anticipated expiration: 2038-05-28
Also published as: CN108681752B

Abstract

The invention discloses a kind of image scene mask method based on deep learning, including scene image data collection build convolutional neural networks, training pattern, image labeling；The scene image data collection learns scene Recognition model for training and test depth；The structure convolutional neural networks, model of the structure for the convolutional neural networks of scene Recognition；The training pattern obtains scene Recognition model by training convolutional neural networks；Described image marks, and the scene that Model Identification image is obtained to image marks word.The present invention solves the deficiency of image scene mark, solves the accuracy rate of image scene mark.

Description

A kind of image scene mask method based on deep learning

Technical field

The present invention relates to artificial intelligence and mode identification technology, more particularly to a kind of image based on deep learning Scene recognition method.

Background technology

Image scene identification is one important research topic of field of machine vision, its goal in research is to use computer Automatic identification and understand the scene information in image.With the propagation of image data on the internet, various websites need to handle The image data of magnanimity carries out automatic understanding and classification using computer to image, and scene Recognition technology is in this application There is highly important role.

Due to the wide application prospect of scene Recognition technology, which is constantly subjected to many researchers and studies.It is external Aspect, Li Fei-Fei etc. propose the middle level semanteme side combined with potential Di Li Crays distributed model using vision bag of words Method carries out scene Recognition；Ade Oliva highlight the importance of global characteristics, propose to carry out scene Recognition using global characteristics Space envelope model；Laze bnik et al. then optimize traditional vision bag of words, and spatial information is added, it is proposed that Spatial pyramid matching process；Bolei Zhou et al. trial solves the problems, such as scene Recognition with depth learning technology, they use The Places-CNN of contextual data collection training carries out scene Recognition, and achieves good effect.Domestic aspect, Jiang Yue et al. Scene Recognition is carried out using improved spatial pyramid matching process；Money a one-legged monster in fable et al. is by scene Recognition technology and robot technology knot It closes, and achieves good practice effect；Appoint skill et al. to be then improved to traditional potential Di Li Crays distributed model, carries The high efficiency of scene Recognition.

Traditional scene recognition method generally uses low-level image feature or high-level characteristic, the advantages of these method simply easy Row has good logicality, meets the intuitive cognition of the mankind.But when data to be dealt with reach certain scale, scene It is traditional that so many scene information can not just be indicated based on low-level image feature and high-level characteristic when classification reaches certain amount.Cause This, this is solved the problems, such as using conventional method, gradually faces bottleneck, further problems are especially faced on large-scale dataset.

And the method based on deep learning be very suitable for processing this problem, the fast development of deep learning method, Exactly have benefited from the surge of data volume, because depth network generally requires being trained for a large amount of data, is formed complicated and strong The big network architecture.The existing image scene identification technology based on deep learning has been achieved for good accuracy rate, still Accuracy of identification is also to be hoisted.

Invention content

In order to overcome existing technology insufficient to image scene mark accuracy rate, the present invention proposes a kind of based on deep learning Image scene recognition methods figure can be promoted using the image scene recognizer based on the newest deep learning network architecture Image field scape marks accuracy of identification.

Specifically, a kind of image scene recognition methods based on deep learning, which is characterized in that include the following steps：

S1. scene image data collection is established：Establish the data set of the image pattern comprising abundant scene, wherein every figure It includes N image patterns that decent, which all has accurate scene mark, each scene type, generates training image collection；

S2. convolutional neural networks model is built：Structure is by characteristic extracting module, candidate region generation module, global area Obtain the convolutional neural networks model that sub-module, key area selecting module, candidate region tuning module form；

S3. training pattern：The parameter of convolutional neural networks model is carried out just using other trained model parameters Beginningization utilizes training image collection tuning model parameter using BP algorithm and batch gradient descent method, iterates until acquiring most The model parameter of small test error value；

S4. image labeling：By image input to be marked, trained model, the scene for obtaining image mark vocabulary, Vocabulary is written in the attribute of image.

Preferably, the step S1 includes following sub-step：

S11. image preprocessing is carried out to scene image sample, preprocessing process includes：Data type conversion, histogram are equal Weighing apparatusization, normalization, geometric correction and sharpening；

S12. 80% image pattern composition training image collection is randomly selected, the training of model, remaining 20% image sample are used for This is used for model measurement, the accuracy rate that detection model identifies each scene image.

Preferably, the step S2 includes following sub-step：

S21. characteristic extracting module constitutes image characteristics extraction network using VGG16 models, and completes carrying for characteristics of image It takes, obtains the characteristic pattern of image；

S22. candidate region generation module divides the image into n region, compositing area using the image segmentation based on figure Collect R, calculates separately the similarity S (r of each two adjacent area in the collection R of region_g,r_j), similarity S (r_g,r_j) computational methods be S(r_g,r_j)=ω₁S_color(r_g,r_j)+ω₂S_texture(r_g,r_j)+ω₃S_size(r_g,r_j)+ω₄S_fill(r_g,r_j), wherein r_g,r_jPoint It is not the region g and region j, S of region collection R_colorIt is color similarity, S_textureIt is texture similarity, S_sizeIt is that size is similar Degree, S_fillIt is overlapping similarity, ω₁、ω₂、ω₃、ω₄It is weighted value, ω₁+ω₂+ω₃+ω₄=1；Then according to each two area The similarity in domain, two big area preferences of similarity are polymerize, until being polymerized to whole image, are gone out in polymerization process The candidate region for the region composition image now crossed, every image generates more than 2000 a candidate regions (RoIs), by candidate region Position and size are saved in file；

S23. it is that the entire feature of the obtained characteristics of image figures of step S21 is passed through two respectively that global area, which obtains sub-module, Full articulamentum and two Relu activation primitives obtain the feature vector of global area, calculate point that global area belongs to scene type Number；

S24. key area selecting module is the region for selecting candidate region size to account for global area designated ratio β or more, Feature vector is obtained by two full articulamentums and two Relu activation primitives, selected candidate region is calculated and belongs to scene type Score, selects several maximum candidate regions of its mid-score as key area and is added with global area to obtain image and belong to field Then the score of scape calculates the probability that image belongs to certain class scene using Softmax regression functions, finally utilize probabilistic forecasting figure It seem which kind of scene；

S25. tuning module in candidate region is obtained by the corresponding scene type of all key areas and by step S22 Regional location and size obtain feature vector by two full articulamentums and two Relu activation primitives, feature vector using Position and the size that candidate frame is adjusted in frame regression function are input to after one full articulamentum.

Preferably, the computational methods of the color similarity are：By each Color Channel of region g, j based on 25 sections Calculate and obtain histogram, the color histogram in each region shares 25*3=75 section, value to each histogram divided by After area size normalizes, formula is usedThe color for calculating two regions is similar Degree, whereinBe respectively region g, j color histogram in value after the normalization of k-th minizone, m=75.

Preferably, the computational methods of the texture similarity are：Use variance for 1 each Color Channel of region g, j Gaussian Profile do gradient statistics in 8 directions, each direction obtains gradient statistic histograms by 10 interval computations, each Then the 8*3*10=240 section of gradient statistic histogram in region uses formula Calculate texture similarity, whereinBe respectively region g, j gradient statistic histogram in k-th of minizone value, l= 240。

Preferably, the computational methods of the size similarity are：Using formula Calculate size similarity, wherein size (r_g)、size(r_j) respectively indicate region g, j area, size (im) indicate whole image Area.

Preferably, the computational methods of the overlapping similarity are：Using formula It calculates and overlaps similarity, wherein size (B_gj) be region g and j minimum Outsourcing area area, size (r_g)、size(r_j) point Not Biao Shi region g, j area, size (im) indicate whole image area.

Preferably, the step S3 includes following sub-step：

S31. VGG-16 model parameters is used to initialize the ginseng of the convolutional neural networks model each hidden layer and output layer Number；

S32. every batch of inputs m pictures, is output and input accordingly according to every layer of corresponding formula calculating, encounters and connect entirely Formula σ (W are used when connecing layer^ιa^i,ι-1+b^ι) input of hidden layer is calculated, wherein σ is activation primitive, and W is weight parameter, and a is input Vector, ι are the number of plies, and b is bigoted parameter, and i is the i-th pictures；The use formula identical with full articulamentum when encountering convolutional layer Calculate the input of hidden layer；Formula pool (a are used when encountering pond layer^i,ι-1) next layer of input is calculated until obtaining entire net The output of network, wherein pool are pond layer functions；

S33. loss function is usedThe gradient for calculating whole network is missed Difference, wherein B is { L_i,I_i,r_iIndicate a batch of training data, L_iIt is image I_iTrue tag, P (s=L_i|I_i,r_i) table Show i-th of candidate region r_iBelong to the probability of scene s, M indicates the quantity of the batch input picture；

S34. reversed to calculate gradient error in layer and correct weight parameter and bigoted parameter, forward direction update is per layer parameter When, formula is used respectively when encountering full articulamentumWithIt calculates newly Weighted value and bigoted value, wherein δ are gradient error, and α is input vector, and m is the quantity of a batch training image, and i is i-th figure Picture uses formula when encountering convolutional layer respectivelyWithMeter Weighted value and bigoted value are calculated, wherein u, v indicate δⁱSubmatrix, until adjusted value be less than stop iteration threshold, wherein rot180 Representing matrix 180 degree rotates.

Preferably, the step S4 includes following sub-step：

S41. image to be marked is pre-processed, is inputted as image scene identification model；

S42. it obtains model and word is marked to the scene type of highest scoring in input picture；

S43. image is written into mark word.

The beneficial effects of the present invention are：

Scene classification is carried out to image for computer and marks the problem of accuracy deficiency, method proposed by the present invention is adopted With the image scene recognizer based on the newest deep learning network architecture, image scene mark identification essence can be obviously improved Degree.

Description of the drawings

Fig. 1 is a kind of image scene recognition methods flow chart based on deep learning proposed by the present invention.

Fig. 2 is structure convolutional neural networks model flow schematic diagram.

Fig. 3 is training convolutional neural networks model flow schematic diagram.

Specific implementation mode

For a clearer understanding of the technical characteristics, objects and effects of the present invention, now control illustrates this hair Bright specific implementation mode.

A kind of image scene recognition methods embodiment flow chart based on deep learning proposed by the present invention as shown in Figure 1, Include the following steps：

As a kind of preferred embodiment, step S1 includes following sub-step：

S11. image preprocessing is carried out to scene image sample, preprocessing process includes：Data type conversion, histogram are equal Weighing apparatusization, normalization, geometric correction and sharpening.Since the quality of scene image will influence the recognition effect of model, in training Image is pre-processed before model.

As a kind of preferred embodiment, step S2 includes following sub-step：

S21. characteristic extracting module constitutes image characteristics extraction network using VGG16 models, and completes carrying for characteristics of image It takes, obtains the characteristic pattern of image.

S22. candidate region generation module divides the image into n region, compositing area using the image segmentation based on figure Collect R, calculates separately the similarity S (r of each two adjacent area in the collection R of region_g,r_j), similarity S (r_g,r_j) computational methods be S(r_g,r_j)=ω₁S_color(r_g,r_j)+ω₂S_texture(r_g,r_j)+ω₃S_size(r_g,r_j)+ω₄S_fill(r_g,r_j), wherein r_g,r_jPoint It is not the region g and region j, S of region collection R_colorIt is color similarity, S_textureIt is texture similarity, S_sizeIt is that size is similar Degree, S_fillIt is overlapping similarity, ω₁、ω₂、ω₃、ω₄It is weighted value, ω₁+ω₂+ω₃+ω₄=1；Calculating two adjacent regions Similarity S (the r in domain_i,r_j) when, color similarity S_color(r_i,r_j) computational methods be：By each Color Channel of region i, j Histogram is obtained by 25 interval computations, the color histogram in each region shares 25*3=75 section, to each histogram After the value in section divided by area size normalize, formula is usedCalculate the areas Liang Ge The color similarity in domain, whereinBe respectively region i, j color histogram in after the normalization of k-th minizone Value, m=75.

Texture similarity S_texture(r_i,r_j) computational methods be：Use variance for 1 each Color Channel of region i, j Gaussian Profile do gradient statistics in 8 directions, each direction obtains gradient statistic histograms by 10 interval computations, each Then the 8*3*10=240 section of gradient statistic histogram in region uses formulaMeter Calculate texture similarity, whereinBe respectively region i, j gradient statistic histogram in k-th of minizone value, l= 240。

Size similarity S_size(r_i,r_j) computational methods be：Using formula Calculate size similarity, wherein size (r_i)、size(r_j) respectively indicate region i, j area, size (im) indicate whole image Area.

Overlapping similarity S_fill(r_i,r_j) computational methods be：Using formula It calculates and overlaps similarity, wherein size (B_ij) be region i and j minimum Outsourcing area area, size (r_i)、size(r_j) point Not Biao Shi region i, j area, size (im) indicate whole image area.

Then according to the similarity in each two region, two big area preferences of similarity are polymerize, until being polymerized to Until whole image, the candidate region of the region composition image occurred in polymerization process, every image generates more than 2000 and waits Favored area (RoIs), the position of candidate region and size are saved in file.

S23. it is that the entire feature of the obtained characteristics of image figures of step S21 is passed through two respectively that global area, which obtains sub-module, Full articulamentum and two Relu activation primitives obtain the feature vector of global area, calculate point that global area belongs to scene type Number.

S24. key area selecting module is the region for selecting candidate region size to account for global area designated ratio β or more, Feature vector is obtained by two full articulamentums and two Relu activation primitives, selected candidate region is calculated and belongs to scene type Score, selects several maximum candidate regions of its mid-score as key area and is added with global area to obtain image and belong to field Then the score of scape calculates the probability that image belongs to certain class scene using Softmax regression functions, finally utilize probabilistic forecasting figure It seem which kind of scene.

It is as shown in Figure 2 to build convolutional neural networks model flow schematic diagram.

As a kind of preferred embodiment, step S3 includes following sub-step：

S34. reversed to calculate gradient error in layer and correct weight parameter and bigoted parameter, forward direction update is per layer parameter When, formula is used respectively when encountering full articulamentumWithCalculate new weight Value and bigoted value, wherein δ are gradient error, and α is input vector, and m is the quantity of a batch training image, and i is i-th image, Use formula respectively when encountering convolutional layerWith Weighted value and bigoted value are calculated, wherein u, v indicate δⁱSubmatrix, until adjusted value be less than stop iteration threshold, wherein Rot180 representing matrix 180 degrees rotate.

Training convolutional neural networks model flow diagram is as shown in Figure 3.

As a kind of preferred embodiment, step S4 includes following sub-step：

S43. image is written into mark word.

It should be noted that for each embodiment of the method above-mentioned, for simple description, therefore it is all expressed as to a system The combination of actions of row, but those skilled in the art should understand that, the application is not limited by the described action sequence, because For according to the application, certain some step can be performed in other orders or simultaneously.Secondly, those skilled in the art also should Know, embodiment described in this description belongs to preferred embodiment, involved action and unit not necessarily this Shen It please be necessary.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, is not described in some embodiment Part, may refer to the associated description of other embodiment.

One of ordinary skill in the art will appreciate that realizing all or part of flow in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in computer read/write memory medium In, the program is when being executed, it may include such as the flow of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, ROM, RAM etc..

The above disclosure is only the preferred embodiments of the present invention, cannot limit the right model of the present invention with this certainly It encloses, therefore equivalent changes made in accordance with the claims of the present invention, is still within the scope of the present invention.

Claims

1. a kind of image scene recognition methods based on deep learning, which is characterized in that include the following steps：

S1. scene image data collection is established：Establish the data set of the image pattern comprising abundant scene, wherein every image sample It includes N image patterns that this, which all has accurate scene mark, each scene type, generates training image collection；

S2. convolutional neural networks model is built：Structure is by characteristic extracting module, candidate region generation module, global area score The convolutional neural networks model of module, key area selecting module, candidate region tuning module composition；

S3. training pattern：The parameter of convolutional neural networks model is initialized using other trained model parameters, Training image collection tuning model parameter is utilized using BP algorithm and batch gradient descent method, is iterated until acquiring minimum test The model parameter of error amount；

S4. image labeling：Image to be marked is inputted into trained model, the scene mark vocabulary of image is obtained, word It converges and is written in the attribute of image.

2. a kind of image scene recognition methods based on deep learning as described in claim 1, which is characterized in that the step S1 includes following sub-step：

S11. image preprocessing is carried out to scene image sample, preprocessing process includes：Data type conversion, histogram equalization Change, normalization, geometric correction and sharpening；

S12. 80% image pattern composition training image collection is randomly selected, the training of model is used for, remaining 20% image pattern is used In the accuracy rate that model measurement, detection model identify each scene image.

3. a kind of image scene recognition methods based on deep learning as described in claim 1, which is characterized in that the step S2 includes following sub-step：

S21. characteristic extracting module constitutes image characteristics extraction network using VGG16 models, and completes the extraction of characteristics of image, obtains To the characteristic pattern of image；

S22. candidate region generation module divides the image into n region using the image segmentation based on figure, compositing area collection R, Calculate separately the similarity S (r of each two adjacent area in the collection R of region_g,r_j), similarity S (r_g,r_j) computational methods be S (r_g, r_j)=ω₁S_color(r_g,r_j)+ω₂S_texture(r_g,r_j)+ω₃S_size(r_g,r_j)+ω₄S_fill(r_g,r_j), wherein r_g,r_jIt is respectively The region g and region j, S of region collection R_colorIt is color similarity, S_textureIt is texture similarity, S_sizeIt is size similarity, S_fillIt is overlapping similarity, ω₁、ω₂、ω₃、ω₄It is weighted value, ω₁+ω₂+ω₃+ω₄=1；Then according to each two region Similarity, two big area preferences of similarity are polymerize, until being polymerized to whole image, were occurred in polymerization process Region composition image candidate region, every image generates more than 2000 a candidate regions (RoIs), by the position of candidate region It is saved in file with size；

S23. it is to connect the entire feature for the characteristics of image figure that step S21 is obtained entirely by two respectively that global area, which obtains sub-module, It connects layer and two Relu activation primitives obtains the feature vector of global area, calculate the score that global area belongs to scene type；

S24. key area selecting module is the region for selecting candidate region size to account for global area designated ratio β or more, is passed through Two full articulamentums and two Relu activation primitives obtain feature vector, calculate point that selected candidate region belongs to scene type Number, selects several maximum candidate regions of its mid-score as key area and is added with global area to obtain image and belong to scene Score, then calculate image using Softmax regression functions and belong to the probability of certain class scene, finally utilize probabilistic forecasting image It is which kind of scene；

S25. tuning module in candidate region is the region for obtaining the corresponding scene type of all key areas and process step S22 Position and size obtain feature vector by two full articulamentums and two Relu activation primitives, and feature vector is using one Position and the size that candidate frame is adjusted in frame regression function are input to after full articulamentum.

4. a kind of image scene recognition methods based on deep learning as claimed in claim 3, which is characterized in that the color The computational methods of similarity are：Each Color Channel of region g, j are obtained into histogram by 25 interval computations, each region Color histogram shares 25*3=75 section, after value divided by area size to each histogram normalize, uses FormulaCalculate the color similarity in two regions, whereinIt is area respectively Value in the color histogram of domain g, j after k-th of minizone normalization, m=75.

5. a kind of image scene recognition methods based on deep learning as claimed in claim 3, which is characterized in that the texture The computational methods of similarity are：Variance is used to be done in 8 directions for 1 Gaussian Profile each Color Channel of region g, j Gradient counts, and each direction obtains gradient statistic histogram, the gradient statistic histogram 8* in each region by 10 interval computations Then 3*10=240 section uses formulaCalculate texture similarity, whereinBe respectively region g, j gradient statistic histogram in k-th of minizone value, l=240.

6. a kind of image scene recognition methods based on deep learning as claimed in claim 3, which is characterized in that the size The computational methods of similarity are：Using formulaCalculate size similarity, wherein size (r_g)、size(r_j) respectively indicate region g, j area, size (im) indicate whole image area.

7. a kind of image scene recognition methods based on deep learning as claimed in claim 3, which is characterized in that described overlapping The computational methods of similarity are：Using formulaIt calculates and overlaps similarity, Wherein size (B_gj) be region g and j minimum Outsourcing area area, size (r_g)、size(r_j) region g, j are indicated respectively Area, size (im) indicate the area of whole image.

8. a kind of image scene recognition methods based on deep learning as described in claim 1, which is characterized in that the step S3 includes following sub-step：

S31. VGG-16 model parameters is used to initialize the parameter of the convolutional neural networks model each hidden layer and output layer；

S32. every batch of inputs m pictures, is output and input accordingly according to every layer of corresponding formula calculating, encounters full articulamentum When use formula σ (W^ιa^i,ι-1+b^ι) calculate hidden layer input, wherein σ be activation primitive, W is weight parameter, and a is input vector, ι is the number of plies, and b is bigoted parameter, and i is the i-th pictures；When encountering convolutional layer, use formula identical with full articulamentum calculates hidden The input of layer；Formula pool (a are used when encountering pond layer^i,ι-1) next layer of input is calculated until obtaining the defeated of whole network Go out, wherein pool is pond layer functions；

S33. loss function is usedThe gradient error of whole network is calculated, Wherein, B is { L_i,I_i,r_iIndicate a batch of training data, L_iIt is image I_iTrue tag, P (s=L_i|I_i,r_i) indicate the I candidate region r_iBelong to the probability of scene s, M indicates the quantity of the batch input picture；

S34. reversed to calculate gradient error in layer and correct weight parameter and bigoted parameter, when forward direction update is per layer parameter, Formula is used respectively when encountering full articulamentumWithCalculate new power Weight values and bigoted value, wherein δ are gradient error, and α is input vector, and m is the quantity of a batch training image, and i is i-th figure Picture uses formula when encountering convolutional layer respectivelyWith Weighted value and bigoted value are calculated, wherein u, v indicate δⁱSubmatrix, until adjusted value be less than stop iteration threshold, wherein Rot180 representing matrix 180 degrees rotate.

9. a kind of image scene recognition methods based on deep learning as described in claim 1, which is characterized in that the step S4 includes following sub-step：

S43. image is written into mark word.