CN110633727A

CN110633727A - Deep neural network ship target fine-grained identification method based on selective search

Info

Publication number: CN110633727A
Application number: CN201910571107.4A
Authority: CN
Inventors: 沈同圣; 刘峰; 赵德鑫; 黎松; 罗再磊; 于化鹏; 孟路稳
Original assignee: National Defense Technology Innovation Institute PLA Academy of Military Science
Current assignee: National Defense Technology Innovation Institute PLA Academy of Military Science
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2019-12-31

Abstract

The invention relates to a deep neural network ship target fine-grained identification method based on selective search, and belongs to the field of computer vision. The method comprises the following steps: step 1: extracting a target candidate region by using a selective search method for an input image; step 2: determining the positions of the foreground and the background in the image by using the significance detection prior, and clustering all candidate regions; and step 3: training the multi-scale component model by using a convolutional neural network, and taking the class with the highest score as a component model for target recognition; and 4, step 4: designing a multi-scale pooling method, and generating multi-scale description of a target component model through deconvolution and maximum pooling; and 5: and finishing fine-grained identification of the target through a convolutional neural network model. The method can be applied to the recognition of various ship targets with similar appearance structures in complex scenes, and can improve the recognition accuracy by using the local feature description extracted by a ship data set through a convolutional neural network.

Description

Deep neural network ship target fine-grained identification method based on selective search

Technical Field

The invention belongs to the technical field of computer vision and image processing, and relates to a target fine-grained identification method, in particular to a deep neural network ship target fine-grained identification method based on selective search.

Background

The development and exploration of the marine environment by human beings are increasing day by day, and the identification problem of different types of ship targets has wide application prospect and research value in the aspects of maritime search and rescue, fishing boat monitoring, accurate weapon guidance and the like. In a marine environment, ship target classification is crucial to improving the capability of a maritime safety system, and for a given ship image, the classification of the ship target can be automatically identified and the position of the target can be located by using computer vision and machine learning technology.

The ship targets of different classes are similar in structure and difficult to distinguish, ships of the same class can have multiple colors, and under the condition of being near to a shore, the interference of the background is various, so that the problems bring challenges to accurate identification of the ship targets. The target recognition technology based on deep learning shows wide application prospects, and particularly, the application of the convolutional neural network can extract more fine target characteristics in an image and realize end-to-end training. Meanwhile, the deep convolutional neural network does not need to design features manually, can automatically learn the features suitable for image representation, and is higher in accuracy and better in representation.

With the successful application of the deep learning method, general image recognition makes a very big breakthrough, for example, recognition is performed on targets such as houses, vehicles, ships and the like, while fine-grained image recognition is further performed, and fine-grained classification needs to be performed on different subclasses of objects in the same class, and the problem to be solved is to identify slight differences among targets in similar classes.

The fine-grained target identification needs to be distinguished through details and local differences among targets, when the targets change in posture, scale or rotation, the problems become more complex, and the conventional deep learning model still has a challenge to the identification of fine-grained pictures.

In the prior art, the research methods of fine-grained identification can be mainly divided into three categories. The first type starts from the network depth, and the model has better discrimination by increasing the network depth; the second category attempts to eliminate the influence of object posture, camera position and the like through image preprocessing, and is classified in a more uniform environment; the third category focuses on the details of the object parts and classifies the objects by localized object parts or distinctive fine areas.

The method is mainly based on a third method for research, and fine-grained classification is carried out on ship targets of different types by positioning difference part information among targets of different types.

Disclosure of Invention

The invention aims to overcome at least one defect in the prior art, provides a deep neural network ship target fine-grained identification method based on selective search, positions candidate areas by combining the methods of selective search and significance detection, eliminates background interference, and selects an area with the largest contribution to classification identification through neural network training. The method disclosed by the invention is progressive layer by layer, adds deconvolution and pooling operations in a general convolutional neural network, can effectively generate multi-scale component description, extracts key component information which is useful for fine-grained target identification in the ship image, and is an effective, automatic and robust fine-grained ship target identification method.

The technical scheme of the invention is as follows: the deep neural network ship target fine-grained identification method based on selective search comprises the following steps:

step 1: aiming at an input image, segmenting the image by using a Selective Search method of Selective Search, merging a plurality of generated sub-regions according to similarity standards such as color, texture, size and the like, and extracting a target candidate region;

step 2: dividing the image by using a saliency detection method, extracting an interest region in the image, and determining foreground target and background information in the image in a priori manner;

and step 3: combining the results of the step 1 and the step 2, clustering all candidate regions according to the divided information, and further reducing the range of the target candidate region;

and 4, step 4: designing a model based on a deep neural network to carry out feature extraction and classification, wherein the model comprises a convolution layer, a pooling layer, a deconvolution layer and a full-link layer;

and 5: training by using an existing labeled data set, obtaining the detection score of each cluster image block in a supervised learning mode, designing a cost function, carrying out iterative update on network parameters by adopting a gradient descent method, and taking the class with the highest training result score as a candidate component model for target fine-grained identification;

step 6: designing an effective multi-scale pooling method, performing deconvolution and maximum pooling on the last layer of convolution layer in the network to generate multi-scale description of a target component model, and increasing detailed description of a fine-grained target;

and 7: and finishing fine-grained identification of the target on the selected multi-scale component in a training mode through an AlexNet neural network model.

Furthermore, in the step 1, for a large amount of background information and similar structure appearances contained in the input ship target image, the fine-grained target identification needs to describe the target by using distinctive component information.

Further, in step 2, the candidate region determined by the selective search contains most of background information, a visual attention mechanism is introduced by a saliency detection method, a potential region where a target may appear is preferentially located, a foreground target and a background region in an image are divided, all candidate regions are clustered, each image block generated by the selective search is input, and a clustered target candidate region component model is output.

Further, in the step 4, a convolutional neural network is designed and used, which includes 5 convolutional layers, 3 pooling layers, a first layer uses an 11 × 11 filter, a second layer uses a 5 × 5 filter, a third layer to a fifth layer uses a 3 × 3 filter, the ReLU is used as an activation function, the component models of 10 categories in the step 2 are sent into the network for training, an overall score of one category is obtained, and a cluster with the highest score is selected as a component model to be identified in the target category.

Further, in step 5, for the target candidate region determined in step 4, according to the parameters during network forward propagation, the corresponding size of each structural unit in the original image is calculated by deconvolution, and by using this strategy, the deconvolution pooling operation is applied to all candidate regions in a selected certain cluster to generate multi-scale receptive field information, which can provide more refined target description, and the generated multi-scale component model is used as the description of the target.

Further, in step 7, training is performed by using a designed or existing convolutional neural network (including but not limited to AlexNet, VGG16, ResNet, and initiation) model, iterative optimization is performed by using a random gradient descent (SGD) method, using a feed-forward neural network model and a back propagation algorithm, and classification of the target is completed by combining Softmax.

Further, the specific process in the step 1 is as follows:

aiming at an input image, four similarity calculation methods are designed to combine the regions, and a plurality of color spaces are utilized to adapt to scene change and illumination change.

Designing color similarity, counting a 25-dimensional histogram of each segmented region according to color distribution, wherein three RGB color spaces can be represented by a 75-dimensional vector and L is utilized₁The norm is normalized.

And designing texture similarity, extracting a region feature vector by using the Sift-Like as a feature descriptor, dividing 8 different gradient directions for each color space of the image, and calculating a Gaussian differential with the variance sigma being 1.

And (3) designing the scale similarity, merging small regions as much as possible, preventing one region from phagocytizing all other regions, and ensuring that the algorithm extracts candidate regions with different scales on the whole image.

Designing matching similarity, and measuring two regions r_iAnd r_jThe matching degree of the area of the combined area bounding box is ensured to be as small as possible.

Further, the specific process in the step 2 is as follows:

performing superpixel segmentation on the image by adopting an SLIC (Linear edge segmentation) method, and dividing the image R into n regions, wherein R is { R ═ R₁，r₂，...，r_n}。

And traversing and calculating the Euclidean color distance between each super pixel point and other super pixel points in the CIE-Lab space, taking the Euclidean color distance as a similarity evaluation standard, and marking the Euclidean color distance as s (r)_i，r_j)

Different gray level distributions in the saliency map indicate the significance, wherein the highlight part indicates the strong significance, the saliency map is divided by utilizing a multi-level threshold, and all candidate regions extracted by selective search are clustered, so that the regions with the determined similarity significance are divided into one class.

Further, the specific process in step 3 is as follows:

and (3) designing or adopting a popular convolutional neural network model, training each cluster in the step (2), obtaining the integral score of one category, and selecting the cluster with the highest score as a component model for fine-grained target recognition in the text.

The network model adopted in the step can be shared with the network in the step 5, and the same network is used, so that the training efficiency is improved.

Further, the specific process in step 4 is as follows:

as the depth of the network increases, each cell corresponds to a different size of field in the original image. And calculating the corresponding size of each structural unit in the original image by utilizing deconvolution according to the parameters in the forward propagation of the network.

And performing deconvolution calculation by using a plurality of activation vectors with different sizes in the convolutional layer to obtain the receptive fields corresponding to different scales in the input image. The invention utilizes very low calculation cost to generate the multi-scale receptive field characteristics of the image.

The technical scheme adopted by the invention has the following technical advantages:

the method carries out the type identification on the ship image based on the fine-grained neural network, compared with the traditional neural network identification model, the method can pertinently extract the target candidate part model and describe the local characteristics of the target, the identification accuracy is gradually improved through different steps, and the accurate identification of different ship targets is realized.

The method provided by the invention is suitable for fine-grained identification of the target in a complex background, the candidate areas are searched and clustered by adopting various indexes, the calculation efficiency is obviously improved, the multi-scale image of the target component can be generated by using deconvolution and maximum pooling operation, the learned characteristics are richer and more various, the invariance of the model in the aspects of rotation, scaling, shielding and the like can be improved to a certain extent, and the accuracy of the object to be identified is improved.

The method provided by the invention realizes an end-to-end template fine-grained recognition task, the input original image can automatically judge the target category, when a new category needs to be recognized, only a corresponding target image database is constructed, and a fine-grained neural network model is retrained, and the method can also be used for fine-grained recognition of other categories of targets.

Drawings

FIG. 1 is an overall flow chart of the deep neural network ship target fine-grained identification method based on selective search.

FIG. 2 is a flow chart illustrating the steps of the present invention.

Fig. 3 is a schematic diagram of the network structure of each layer of the convolutional neural network model adopted in the present invention.

FIG. 4 is a schematic diagram of a multi-scale description of a target part model generated by the deconvolution pooling operation of the present invention.

Detailed Description

Specific embodiments of the present invention will be described below with reference to the accompanying drawings, but the embodiments of the present invention are not limited thereto.

Fig. 1 is a schematic diagram showing the steps of the implementation method of the present invention. The method comprises the following steps:

aiming at the input ship image data set, each picture contains a ship target of one type, and a corresponding ship type label is input at the same time. The data adopted by the invention is a MARVEL database issued by an intelligent data research center of the Clarke Aselsan company, each type of image comprises targets under various conditions of different angles, illumination, backgrounds and the like, and the types of some ship targets are difficult to judge through human eyes due to the change of the distance and the shooting angle of the targets during shooting and the conditions of goods loaded in ships, so that the requirement of a fine-grained target identification task can be met. The invention selects 10 kinds of targets, 20000 pictures as training and testing data, and divides the training set, the verification set and the testing set according to 7: 2: 1.

In the step 1: and extracting a target candidate region by adopting a selective searching method, and adapting to scene change and illumination change by utilizing a plurality of color spaces. For each divided region, four similarity calculation methods are designed to combine the regions, namely color similarity, texture similarity, size similarity and matching similarity

Color similarity: for each segmented region, counting a 25-dimensional histogram according to color distribution, the three RGB color spaces can be represented by a 75-dimensional vector and by L₁The norm is normalized. Then region r_iCan be expressed as

The color similarity calculation formula between different areas is as follows:

texture similarity: region feature vectors are extracted using the Sift-Like as a feature descriptor, 8 different gradient directions are divided for each color space of the image, and Gaussian Derivative (Gaussian Derivative) with variance σ of 1 is calculated. Then, a 10bins histogram is obtained for each color space, and the texture feature of the region can be represented by a feature vector with dimensions 3 × 8 × 10 ═ 240. The texture similarity between the regions is calculated by the formula

Size similarity: the similarity is set in order to merge small regions as much as possible, prevent one region from phagocytosing all other regions, and ensure that the algorithm extracts candidate regions with different scales on the whole image, and the size (im) represents the size of the image represented by pixels. The similarity calculation formula is as follows:

matching similarity: the similarity measures two regions r_iAnd r_jThe matching degree of (B) ensures that the smaller the Bounding Box of the merged region is, the better the matching degree is, BB_ijIs composed of_iAnd r_jMinimum bounding box of, s_fill(r_i，r_j) Denotes BB_ijIn does not contain r_iAnd r_jThe similarity calculation method is as follows:

combining the above four similarity measurement modes, namely the region merging strategy of the selective search method, a_iE {0, 1}, the formula is:

s(r_i，r_j)＝a₁s_colour(r_i，r_j)+a₂s_texture(r_i，r_j)+a₃s_size(r_i，r_j)+a₄s_fill(r_i，r_j)

the size of a test image is 256 multiplied by 256, the candidate area of the selective search method is k equal to 300, and the candidate area extracted by selective search covers part models of all scales in the target, but simultaneously contains a large amount of backgrounds.

The significance detection and clustering method in step 2 clusters all candidate regions extracted by selective search, so that the regions with determined similarity significance are divided into one class.

The pseudo-code for saliency detection and region clustering is as follows:

according to different thresholds, the saliency map is divided, candidate regions R extracted in selective search are clustered according to coordinate positions, R categories of image block data sets can be obtained, and R is 10.

Fig. 3 shows a convolutional network model adopted in the present invention, in which the images in R clusters are sent to a convolutional neural network for training, and the one with the highest target recognition score is selected.

The convolutional neural network model adopted by the invention comprises 5 convolutional layers, 3 pooling layers and 3 full-connection layers.

Training each cluster, outputting probability scores by a softmax layer of the last layer of the network, and calculating the total score of each cluster according to the formula:

wherein the content of the first and second substances,

all the categories are sent into a network for training, the overall score of one category is obtained, and a cluster with the highest score is selected to serve as a component model for fine-grained target recognition.

As shown in fig. 4, fig. 4 is a model structure diagram of the deconvolution pooling operation in the present invention, and the specific operation steps are as follows:

inputting a target part model image block with the size of (224, 224, 3);

the input image is subjected to first layer convolution, the convolution kernel size is (7, 7), the channel number is 96, the step size is 2, the pooling layer size is (3, 3), the step size is 2, the local normalized size is (5, 5), and the image size is changed to (55, 55, 96);

the feature map of the previous step is subjected to second layer convolution, the convolution kernel size is (5, 5), the channel number is 256, the step size is 2, the pooling layer size is (3, 3), the step size is 2, the local normalized size is (3, 3), and the image size is changed to (13, 13, 256);

the feature map of the previous step is subjected to a third layer of convolution, the convolution kernel size is (3, 3), the image extension pixel is 1, the channel number is 384, the step size is 1, and the image size is changed to (13, 13, 384);

the feature map of the previous step is subjected to the fourth layer of convolution, the convolution kernel size is (3, 3), the image expansion pixel is 1, the channel number is 384, the step size is 1, and the image size is changed to (13, 13, 384);

the feature map of the previous step is subjected to fifth-layer convolution, the convolution kernel size is (3, 3), the image extension pixel is 1, the channel number is 256, the step size is 1, and the image size is changed to (13, 13, 256);

as the depth of the network increases, each cell corresponds to a different size of field in the original image. Extracting the image block model after the fifth layer convolution, and calculating the corresponding size of each structural unit in the original image by using deconvolution according to the parameters when the network is transmitted in the forward direction

With convolution layer size a and convolution kernel shift step size s, an output layer bin of T corresponds to the input layer size [ s (T-1) + a ] × [ s (T-1) + a ]. For example, in a Conv5 layer with a size of 13 × 13, a region with a size of 139 × 139 in the original image can be covered by deconvolution calculation for every 1 cell.

And performing deconvolution calculation by using a plurality of activation vectors with different sizes in the convolutional layer to obtain the receptive fields corresponding to different scales in the input image. The method generates the multi-scale receptive field characteristics of the image with very low calculation cost

As shown in fig. 4, considering an N × N structure unit in the convolutional layer, combining neighborhood information in M × M range by multi-scale pooling, where M range is [1, N ], when M changes to different values, the corresponding network structure can cover different ranges of receptive fields in the original image, thereby providing more comprehensive information.

The method selects the activation units with different sizes in the Conv5 layer to generate multi-scale target representation, the size of the activation units is 1, 2 and 3, a deconvolution pooled multi-scale component model is generated, and the target is described by utilizing the component information, so that the accuracy of fine-grained target identification can be realized.

The invention realizes the algorithm by utilizing python and Tensorflow, and can reasonably design and adjust the parameters of the algorithm according to the requirements of users.

The parameters of the invention are set to have an initial learning rate of 0.01, learning rate related parameters of 0.1, an impulse value of 0.9 and a weight delay of 0.0005. The learning rate is attenuated once every 20 iteration cycles for 80 cycles, the batch processing batch is set to 64, and a random gradient descent algorithm is adopted for optimization.

The neural network model, the gradient descent algorithm, the parameter selection method and the like in the specific implementation mode can be reasonably selected according to actual requirements. Reasonable modifications in implementation detail can be made by those skilled in the art without departing from the scope of the invention.

The embodiments of the present invention are not limited to the above-mentioned embodiments, and any other modifications, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents and are included in the scope of the present invention.

Claims

1. The deep neural network ship target fine-grained identification method based on selective search is characterized by comprising the following steps of:

2. The fine-grained ship target identification method based on selective search and multi-scale pooling of claim 1, wherein: in the step 1, aiming at a large amount of background information and similar structure appearances contained in an input ship target image, fine-grained target identification needs to describe a target by using distinctive component information.

3. The fine-grained ship target identification method based on selective search and multi-scale pooling of claim 1, wherein: in the step 2, the candidate regions determined by selective search contain most of background information, a visual attention mechanism is introduced through a saliency detection method, potential regions where targets may appear are preferentially located, foreground targets and background regions in the images are divided, all candidate regions are clustered, each image block generated by selective search is input, and a clustered target candidate region component model is output.

4. The fine-grained ship target identification method based on selective search and multi-scale pooling of claim 1, wherein: in the step 4, a convolutional neural network is designed and used, and the convolutional neural network comprises 5 convolutional layers and 3 pooling layers, wherein the first layer uses an 11 × 11 filter, the second layer uses a 5 × 5 filter, the third layer to the fifth layer use a 3 × 3 filter, ReLU is used as an activation function, all component models of the clusters in the step 3 are sent into the network for training, the overall score of each cluster is obtained, the class with the highest score is selected as a fine-grained component model of a target in an image, and the target fine-grained component can be further determined on the basis of the step 3.

5. The fine-grained ship target identification method based on selective search and multi-scale pooling of claim 1, wherein: in the step 5, for the target candidate region determined in the step 4, according to the parameters during network forward propagation, in the last layer of convolution layer, the corresponding size of each structural unit in the original image is calculated by using deconvolution pooling, so that multi-scale receptive field information is generated, a more refined target description can be provided, and the generated multi-scale component model is used as the final information to be identified of the target.

6. The fine-grained ship target identification method based on selective search and multi-scale pooling of claim 1, wherein: in the step 7, training is performed by using a designed or existing convolutional neural network (including but not limited to AlexNet, VGG16, ResNet, and information) model, iterative optimization is performed by using a random gradient descent (SGD) method, using a feedforward neural network model and a back propagation algorithm, and classification of the target is completed by combining Softmax.