CN109376591B

CN109376591B - Ship target detection method for deep learning feature and visual feature combined training

Info

Publication number: CN109376591B
Application number: CN201811050911.XA
Authority: CN
Inventors: 邵振峰; 吴文静; 张瑞倩; 王岭钢; 李成源
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2018-09-10
Filing date: 2018-09-10
Publication date: 2021-04-16
Anticipated expiration: 2038-09-10
Also published as: CN109376591A

Abstract

The invention provides a ship target detection method for deep learning characteristic and visual characteristic combined training, which comprises the following steps: sample data collection, CNN feature extraction, traditional invariant moment feature and LOMO feature extraction, feature dimension reduction and feature fusion network FCNN construction, finally, the network is trained by using the sample data, and the model is tested by using test data. Compared with the prior art, the visual feature extraction process comprehensively considers the characteristics of ship shapes, colors and textures, so that the detection process has interpretability, and other features except the traditional features can be studied in a CNN back propagation process. The method is rapid, efficient and high in accuracy, has a good detection result for complex scenes such as cloud, cloudy days, rainy days and the like, and is high in robustness. The method can extract the characteristics complementary with the traditional characteristics, has extremely high speed and can achieve the effect of real-time monitoring.

Description

Ship target detection method for deep learning feature and visual feature combined training

Technical Field

The invention belongs to the field of ship detection computer vision, and particularly relates to a ship target detection method for deep learning feature and visual feature combined training.

Background

China has wide coastlines, sea areas and abundant ocean resources, with the continuous development of economy, the number of ships on the sea is more and more, and the ship detection has urgent practical requirements. The ship target detection is to detect an interested ship target from an image by using computer vision and image processing technologies, further extract a large amount of useful information, and have wide application prospects in military and civil fields. For example, in the civil field, by acquiring information such as the position, size, driving direction, driving speed and the like of a ship, monitoring can be performed on a specific sea area and a bay port, monitoring can be performed on marine transportation, illegal fishing, illegal smuggling, illegal oil pollution dumping and the like, and the method has important significance on economic development, environmental protection, sea area use management, and ocean rights and interests maintenance.

In the modern society, video monitoring cameras are ubiquitous, multiple monitoring pictures can be displayed on a television wall of a monitoring center at the same time, and if the monitoring is only observed and detected by human eyes, abnormal events in the video can be easily missed. With the rapid development of computer networks, people increasingly favor to analyze video images obtained by a sensor by using computer vision instead of human eyes to obtain target information in the images. The picture target detection is generally divided into two steps: feature extraction and classifier classification location, there are two main categories of features used to ship detection: visual features and Convolutional Neural Network (CNN) extracted features.

(one) visual characteristics. More commonly used are visual features such as color, shape and texture.

(1) And (4) color characteristics. Since color is often quite related to an object or scene, color characterization is the most widely used visual feature. In addition, the color features have small dependence on the size, direction and visual angle of the image, and have high robustness. The more commonly used color characteristics are: color histogram and information entropy.

(2) And (4) shape characteristics. The shape feature describes the local nature of the target, and the reflected target shape information is not completely consistent with the visual perception of a human. The more commonly used shape characteristics are: area, aspect ratio, and moment-invariant. The invariant moment is a moment characteristic quantity which is still unchanged after the target is subjected to translation, rotation, scaling and scaling, and 7 geometric invariant moments (Hu) can be selected to represent the shape characteristics of the target area.

(3) And (4) texture features. The texture features describe surface properties of the object to which the image or image region corresponds. As a statistical feature, the texture feature has rotation invariance and has strong resistance to noise. However, when the resolution of the image changes, the calculated texture may have a large deviation. In addition, the texture reflected from the 2-D image is not necessarily the true texture of the surface of the 3-D object, as it may be affected by illumination and reflections. The gray level co-occurrence matrix is the most common texture feature and has strong adaptability and robustness.

(II) CNN characteristics

Natural images have their inherent properties, i.e. for some parts of the image, their statistical properties are the same as for other parts. This means that features learned in this part can also be used in another part, so that the same learned features can be used for all positions on the image. In other words, for the identification problem of the large-size image r × c (r is the number of rows and c is the number of columns), a small area a × b (a is the number of rows and b is the number of columns) is randomly selected from the image as a training sample, some features are learned from the small sample, and then the features are used as a filter to perform convolution operation with the original whole image, so that a feature map after convolution at any position in the original image is obtained. The method can automatically learn the characteristics of various targets to obtain the high-dimensional characteristics of the ship, and compared with the traditional method, the detection result precision is greatly improved.

However, the application of the conventional feature and the CNN feature to the ship detection has the following limitations:

(1) the traditional characteristics have excellent interpretability and controllability, and the detection result under a calm sea surface is good. However, when interference such as cloud shadow and sea wave exists, the false detection rate is high. And the speed of manually selecting the features is slow, which is not beneficial to practical application.

(2) The convolutional neural network can automatically learn the high-dimensional characteristics of the ship, and the detection speed is high. However, the black box type features have poor comprehensibility, and the feature retention degree of different vessels with different sizes after convolution is different, which can also cause inconsistency of detection effects of different vessels.

Disclosure of Invention

The technical problem solved by the invention is as follows: the defects of the prior art are overcome, and the ship target detection method for the deep learning characteristic and visual characteristic combined training is provided.

The technical scheme of the invention provides a ship target detection method for deep learning characteristic and traditional characteristic combined training, which comprises the following steps:

collecting sample data, namely collecting monitoring video frame data of a coastal area under visible light, extracting an image, and labeling the image containing a ship target;

step two, CNN characteristic extraction, which comprises inputting the obtained sample into a convolutional neural network for training to obtain a training result model of the ship target, and outputting the CNN characteristic by the convolutional neural network;

step three, traditional feature extraction, including the extracted invariant moment feature and LOMO feature of the ship target area;

fourthly, reducing dimension of the characteristics, namely connecting the invariant rectangular shape characteristics with the LOMO characteristics, and reducing dimension of the connected traditional characteristics by adopting a principal component analysis algorithm;

constructing a feature fusion network (FCNN) to map the CNN features and the traditional features to a uniform feature space;

and sixthly, training the feature fusion network FCNN by using the sample data, and verifying and testing the feature fusion network FCNN obtained by training by using the test data.

In the step I, the images containing the ship target are marked according to the standard of the PASCAL VOC data set, and the generated marking files are four vertex coordinates and corresponding images of the minimum enclosing rectangle of the ship target on each image, so that a ship image sample library is constructed.

And in the second step, a convolutional neural network based on a region is adopted, the convolutional neural network consists of a plurality of alternating convolutional layers, pooling layers and full-link layers, and a back propagation algorithm is adopted for updating.

In the second step, the convolutional neural network based on region is adopted, and the structure comprises the following steps,

1) a first layer: the size of a convolution kernel is 11 multiplied by 11, the size of max power convolution is 2 multiplied by 2, a BN layer is connected, and the size of an output feature map is 55 multiplied by 55;

2) a second layer: the size of a convolution kernel is 5 multiplied by 5, the size of max power convolution is 2 multiplied by 2, a BN layer is connected, and the size of an output feature map is 27 multiplied by 27;

3) and a third layer: the size of a convolution kernel is 3 multiplied by 3, the size of max power convolution is 2 multiplied by 2, a BN layer is connected, and the size of an output feature map is 13 multiplied by 13;

4) a fourth layer: the size of the convolution kernel is 3 multiplied by 3, and the size of the output feature map is 13 multiplied by 13;

5) and a fifth layer: the size of the convolution kernel is 3 multiplied by 3, and the size of the output feature map is 13 multiplied by 13;

6) two full-link layers FC7 and FC 8.

In the third step, the LOMO characteristic comprehensively considers the influence of illumination and visual angle change on the image, and firstly, a Retinex algorithm is adopted to preprocess the input image, so that the influence caused by illumination is reduced; secondly, aiming at the image preprocessed by the Retinex algorithm, extracting color features by applying an HSV color histogram; in addition, SILTP descriptors are applied to extract illumination invariant texture features of images.

And in the fifth step, a fusion layer and a regression layer are arranged in the feature fusion network FCNN, the input of the fusion layer is CNN features and traditional features, the number of ship types detected by the target is T, the output of the regression layer is a T multiplied by 1 vector, the value range of each line is 0 to 1, and the probability that the sample belongs to each class is represented.

Compared with the prior art, the invention has the following advantages and positive effects:

the characteristics of ship shape, color and texture are comprehensively considered in the traditional characteristic extraction process, so that the detection process has interpretability, and other characteristics except the traditional characteristics can be learnt in the CNN back propagation process. In addition, the Hu invariant moment features are only 7, and color histogram features HSV and scale invariant feature patterns (SILTP) used in Local maximum triggering (LOMO) features are also simpler to calculate, so that the overall calculation speed is not slowed down.

The CNN feature extraction part adopts a convolution neural network based on a region, and the method is rapid, efficient and high in accuracy. The method still has a good detection result for complex scenes such as cloud and fog, cloudy days, raining and the like, and has high robustness. The method can extract the characteristics complementary with the traditional characteristics, has extremely high speed and can achieve the effect of real-time monitoring.

The deep learning characteristic and the traditional characteristic are trained in a combined manner, so that on one hand, a classical ship detection operator can be utilized, the detection process is simplified, and the understanding is facilitated; on the other hand, joint training and feature complementation can fully automate the detection process, and the method does not need human-computer interaction and utilizes practical application.

Drawings

FIG. 1 is a general flow diagram of an embodiment of the present invention.

FIG. 2 is a flow chart of Hu invariant moment extraction in step (c) -a according to an embodiment of the present invention.

FIG. 3 is a flowchart of LOMO feature extraction in step c-b according to an embodiment of the present invention.

Fig. 4 is a structural diagram of a fusion network in step (v) of the embodiment of the present invention.

Detailed Description

For better understanding of the technical solutions of the present invention, the following detailed description of the present invention is made with reference to the accompanying drawings and examples.

Referring to fig. 1, a method provided by an embodiment of the invention includes the following steps:

firstly, collecting sample data.

The data required to be collected by the method is mainly coastal area monitoring video frame data under visible light. For the collected video data, each frame of image, which is 1920 × 1080 pixels in size, can be obtained by decoding and extracting in specific implementation. According to the standard of a Pascal data set (PASCAL VOC), the image containing the ship target is labeled, and the generated labeling file is the coordinates and the corresponding images of four vertexes of the minimum enclosing rectangle of the ship target on each picture, so that a ship image sample library is constructed.

② CNN characteristic extraction.

Unifying the size of the samples obtained in the step (i) to 224 multiplied by 224, and then inputting the samples into a convolutional neural network for training to obtain a training result model of the ship target. The convolutional neural network based on the region used in the embodiment of the invention comprises the following layer structure:

7) a first layer: the convolution kernel size is 11 multiplied by 11, the convolution size of max power is 2 multiplied by 2, a BN layer is connected, and the output feature map size is 55 multiplied by 55

8) A second layer: the convolution kernel size is 5 multiplied by 5, the convolution size of max power is 2 multiplied by 2, a BN layer is connected, and the output feature map size is 27 multiplied by 27

9) And a third layer: the convolution kernel size is 3 multiplied by 3, the convolution size of max power is 2 multiplied by 2, a BN layer is connected, and the output feature map size is 13 multiplied by 13

10) A fourth layer: the convolution kernel size is 3 multiplied by 3, and the output feature map size is 13 multiplied by 13

11) And a fifth layer: the convolution kernel size is 3 multiplied by 3, and the output feature map size is 13 multiplied by 13

6) Two full-connection layers FC7 and FC8

5 convolutional layers, 3 pooling layers (max power), 3 normalization layers (BN layers), and 2 full-link layers, and finally the output of one full-link layer FC8 is a 4096-dimensional vector, which is the CNN feature.

In specific implementation, the deep learning network is composed of a plurality of alternating convolution layers, a pooling layer and a full connection layer, and the network parameters are updated mainly by adopting a back propagation algorithm (BP algorithm), and the deep learning network is composed of an input layer, a plurality of hidden layers and an output layer. The layers are connected through different convolution modes. For a common convolution layer, the feature layer of the previous layer is convolved by a learnable convolution kernel, and then an output feature layer can be obtained through an activation function. Each output layer may be a combination of convolving the values of multiple input layers:

wherein M is_jRepresenting a set of selected input layers, i being the index value of an input layer cell, j being the index value of an output layer cell,

representing the weight between the input layer and the output layer, i.e. the value at each position of the convolution kernel,

representing the additive offset between the layers, f () representing the activation function of the output layer,

the jth output layer representing the l layer,

the ith input layer represents the l-1 layers, l is used for identifying the ith convolutional layer, and represents convolution.

For the pooling layer, there are N input layers and N output layers, except that each output layer is smaller.

Wherein down () represents a down-sampling function. Typically, all pixels in different n × n regions of the input image are summed. Therefore, the output image is reduced by n times in two dimensions, and the value of n can be preset by a user during specific implementation. Each output layer corresponds to a multiplicative bias beta and an additive bias b,

denotes the first layerThe multiplicative offset for the j output layers,

represents the additive offset of the jth output layer of the ith layer,

the jth output layer representing the l layer,

the jth input layer representing the l-1 layer.

For the output fully connected layer, it is often better to convolve the input feature layers and sum the convolved values to obtain an output layer. Examples with alpha_ijIndicating the weight or contribution of the ith input layer in obtaining the jth output feature layer. Thus, the jth output layer can be represented as:

wherein,

indicating the activation bias between the various layers,

the jth output layer representing the l layer,

j-th input layer, N, representing l-1 layer_inRepresents the obtained j output layer result and N_inThe input layers are related.

Traditional feature extraction.

Extracting traditional characteristics of the ship target area obtained in the step I, wherein the visual characteristics used by the invention comprise the following steps: hu invariant moment feature and LOMO feature, the implementation of the embodiment is as follows:

and a invariant moment belongs to shape characteristics and is a digital characteristic with translation, scaling and rotation invariance in an image. Fig. 2 is a flow chart of the extraction of the Hu invariant moment. Firstly, preprocessing an input image, wherein the preprocessing comprises two operations of median filtering smoothing and binarization, then performing region segmentation by using a Simple Linear Iterative Clustering (SLIC) segmentation algorithm, and finally calculating 7 Hu invariant moment features of each ship region. Smoothing, binarization and segmentation are prior art, and the present invention is not repeated herein. Assuming that the input image is discretized in a preprocessing stage into a digital image f (x, y), (x, y) of size M × N, representing the coordinates of the pixel points on the image, the geometrical moments of which are defined as:

wherein p is the order of the image in the x direction and q is the order of the image in the y direction. Set { m_pqIs uniquely defined by f (x, y), whereas f (x, y) is also defined by m_pqAnd (4) determining uniquely.

Central moment u of image f (x, y)_pqIs defined as:

wherein x is₀、y₀The central coordinate of the image is calculated by the following formula:

wherein m is₁₀、m₀₁Is the 1 st order geometric moment of the image, m₀₀Is the 0 th order geometrical moment of the image. It is thus possible to obtain a central moment of the image of order no more than 3 u, respectively₀₀、u₀₁、u₁₀、u₁₁、u₂₀、u₀₂、u₁₂、u₂₁、u₃₀、u₀₃。

For a general gray scale image, the central moment has the following law:

1)u₂₀and u₀₂Is the moment of inertia of the area gray around the vertical and horizontal axes, respectively, through the center of the gray. If u₂₀>u₀₂Then the image is elongated in the horizontal direction; otherwise, the image is elongated in the vertical direction.

2)u₃₀And u₀₃Can be used to measure the symmetry of the object with respect to the vertical and horizontal axes, respectively. If u₃₀0, then the object is symmetric about the vertical axis; if u₀₃0, the object is symmetrical about the horizontal axis. For rotation and scale sensitivity, scale invariance can be obtained by normalization, and central moment eta is normalized_pqIs defined as:

wherein r is an intermediate variable, p is greater than or equal to 0, q is greater than or equal to 0, and p + q is greater than or equal to 2.

7 feature sets phi with translational, scaling and rotational invariance can be derived by using the central moments of order 2 and 3₁～Φ₇：

Φ₁＝η₂₀+η₀₂

Φ₃＝(η₃₀-3η₁₂)²+(3η₂₁-η₀₃)²

Φ₄＝(η₃₀+η₁₂)²+(η₂₁-η₀₃)²

Φ₅＝(η₃₀-3η₁₂)(η₃₀+η₁₂)[(η₃₀+η₁₂)²-(3η₂₁+η₀₃)²]+(3η₂₁ +η₀₃)(η₀₃+η₂₁)[3(η₃₀+η₁₂)²-(η₂₁+η₀₃)²]

Φ₆＝(η₂₀-η₀₂)²[(η₃₀+η₁₂)²-(η₂₁+η₀₃)²]+4η₁₁(η₃₀+η₁₂)(η₂₁+η₀₃)

The b LOMO Feature, Local maximum entertainment Feature, is a combination of color and texture features that describe the ship in the picture from both color and camera perspectives.

Fig. 3 is a flow chart of the LOMO feature extraction, and an image enhanced Retinex algorithm is first adopted to pre-process an input image, so as to reduce the influence caused by illumination. The Retinex algorithm considers the color information of the picture, aims to output a color image which is close to human perception and rich in color, and particularly can enhance the detail information of a shadow area.

The preprocessed image is then split equally into 5 vertical stripes, and within each vertical stripe, a 20 x 20 Size (Size) sub-window is used to locate a partial block of the ship's area with an overlap of 10 pixels (Strip). The image is divided into 5 vertical strips, and then in each strip, the size of 20 × 20 sub-windows is used, strip is slid at 10, and n is the number of sub-windows. Taking the ship size of 1280 × 480 as an example, the size of each vertical stripe is 256 × 480, the number of sub-windows with the size of 20 × 20 in each vertical stripe is n, 25 × 47, 1175, the total number of sub-windows 1175 × 5 is 5875, and the specific number needs to be determined according to the size of the ship target.

Within each sub-window, two SILTP histograms (i.e., SILTP 0.34, 3 and SILTP 0.34, 5, 3 total) are extracted⁴One) and an 8 x 8 joint HSV histogram, each histogram representing the probability of occurrence of a pattern within a sub-window. SILTP improves LBP descriptors by introducing scale-invariant local contrast tolerance, achieving invariance to image scale variations and robustness to noise. Suppose the position of the pixel point in the sub-window is (x)_c,y_c) The SILTP is calculated in the following manner:

wherein, I_cIs the gray value of the center pixel point of the sub-window, I_qIs the gray value of the pixel point corresponding to the Q neighborhood with radius R,

connecting binary values of all neighborhoods into a character string, wherein t is a threshold range and s_t(I_c,I_q) A binary value representing a certain pixel position. Referring to fig. 3, SILTP 0.34, 3 in both directions indicates that texture features are extracted within 4 neighborhoods of radius 3 with a threshold of 0.3. Similarly, SILTP 0.34, 5 indicates that texture features are extracted within a 4-neighborhood of radius 5, with a threshold of 0.3.

Then all the sub-windows in the same vertical position are compared, and the maximum value in each type of histogram in the sub-windows is selected as the final histogram. The obtained histogram realizes invariance to the change of the view angle and simultaneously captures the local area characteristics of the ship target.

In the embodiment, the specific implementation is as follows:

1) color is an important feature that describes visible light images. However, since the illumination condition of the video camera installed in the coastal area is not controllable, the camera is differently set. Thus, the color between pictures may differ in different camera views. The invention comprises the following steps:

firstly, the Retinex algorithm is adopted to preprocess the input image, so that the influence caused by illumination is reduced. The Retinex algorithm considers the color information of the picture, aims to output a color image which is close to human perception and rich in color, and particularly can enhance the detail information of a shadow area.

Secondly, aiming at the picture preprocessed by the Retinex algorithm, extracting color features by applying an HSV color histogram; besides, SILTP (Scale Invariant Local texture pattern) descriptors are applied to extract the illumination Invariant texture features of the pictures. SILTP improves LBP descriptors by introducing scale-invariant local contrast tolerance, achieving invariance to image scale variations and robustness to noise.

2) Boats under different cameras will typically appear at different viewing angles, which can also present difficulties in boat detection. Therefore, the temperature of the molten metal is controlled,

the present invention uses a sliding window to describe the local details of the ship area. Specifically, the method comprises the following steps:

a partial block of the ship area is first located with an overlap of 10 pixels using a sub-window of size 20 x 20. Within each sub-window, two SILTP histograms are extracted (3)⁴One) and an 8 x 8 joint HSV histogram, each histogram representing the probability of occurrence of a pattern within a sub-window.

The invention takes the target size of a ship as 1280 multiplied by 480 for example, the target size of 640 × 240 and 320 × 120 will also be obtained after scaling. By concatenating all features, the resulting final feature possesses (8 × 8 × 8 color histograms + 3)⁴X 2 SILTP histograms) x (127+63+31 vertical stripes 694 x 221 153,374 dimensions.

Fourthly, reducing the dimension of the characteristic.

And (4) connecting the invariant rectangular features and the LOMO features obtained in the step (three), wherein the dimension is very large, and the embodiment of the invention adopts a Principal Component Analysis (PCA) algorithm to reduce the dimension of the connected traditional features to 4096 dimensions. The principal component analysis algorithm is the prior art, and the present invention is not described in detail.

Constructing a feature fusion network.

In order to map the CNN features and the traditional features to a uniform feature space, the invention provides a feature fusion network (FCNN). FIG. 4 is a block diagram of a converged network in which deep learning hyper-parameters are updated in the back propagation process, subject to conventional characteristics. The fused features will be more distinctive than the CNN features alone and the traditional features.

The embodiments are specifically realized as follows in the following,

the FC7 and FC8 layers are output layers of the convolutional neural network, the output of the conventional features is also 4096-dimensional feature vectors, and the input of the fusion layer (i.e. FC9 layer) is CNN features and the conventional features:

x＝[LOMO+Gu,CNNfeatures]

where x is the input to the fusion layer, LOMO is the local maximization feature, Hu is the invariant moment feature, and CNNFETURES are the convolutional neural network features. Output of fused layer (4096-dimensional) Z_Fusion(x) Comprises the following steps:

where h () represents the activation function, with the modified linear unit ReLU,

is a weight, b_FusionIs an offset.

Assuming that the number of ship categories detected by the target is T, the output of the FC9 layer is a 4096 × 1-dimensional vector, the output of the softmax layer (regression layer) is a T × 1-dimensional vector, and the size of the value of each row ranges from 0 to 1, representing the probability that this sample belongs to each category. The calculation process from the FC9 layer to the Softmax layer is network training, and an optimal T x 4096 matrix is found, so that the loss of the Softmax layer is minimum. The calculation process is according to BP algorithm, the hyperparameter of the l layer after iteration is:

wherein,

representing the weight of the ith layer after the iteration,

representing the offset of the ith layer after iteration;

W^(l)representing the weight of the ith layer before iteration, b^(l)Representing the offset of the ith layer before iteration;

ΔW^(l)representing the weight gradient of the ith layer after iteration, Δ b^(l)Representing the offset gradient of the ith layer after iteration;

α represents the activation rate of the L-th layer, λ represents the learning rate of the L-th layer, and m represents the number of samples.

The loss function p (y ═ j | x; θ) is calculated as:

wherein,

y represents an output node of the network;

j represents the output value, i.e., the class number;

(ii) a Representing an input vector;

θ represents all model parameters, with the magnitude k (n + 1);

e represents a natural logarithm;

model parameters representing the jth category;

model parameters representing the kth class;

n represents the total number of categories;

k denotes the kth category.

The last layer of the network uses cross entropy loss:

wherein,

P_kand J is the cross entropy loss obtained after the probability output of each category is operated for the output of the last layer.

Sixthly, training the feature fusion network FCNN.

And training the feature fusion network FCNN by using the sample data, and verifying and testing the feature fusion network FCNN obtained by training by using the test data.

In the embodiment, 3500 training pictures and 3500 testing pictures are adopted, and after fusion network training is completed by using the training pictures, testing is performed by using the testing pictures. And inputting the detection picture into the trained model to obtain a result.

Therefore, the specific implementation process of the ship target detection method for deep learning characteristic and traditional characteristic combined training is introduced. In specific implementation, the process provided by the technical scheme of the invention can be automatically operated by a person skilled in the art by adopting a computer software technology.

The specific examples described herein are merely illustrative of the invention. Various modifications or additions may be made or substituted in a similar manner to the specific embodiments described herein by those skilled in the art without departing from the spirit of the invention or exceeding the scope thereof as defined in the appended claims.

Claims

1. A ship target detection method for deep learning feature and traditional feature combined training is characterized by comprising the following steps:

2. The ship target detection method based on deep learning feature and traditional feature combined training as claimed in claim 1, wherein: in the first step, images containing ship targets are marked according to the standard of the PASCAL VOC data set, and generated marking files are coordinates and corresponding images of four vertexes of a minimum enclosing rectangle of the ship targets on each image, so that a ship image sample library is constructed.

3. The ship target detection method based on deep learning feature and traditional feature combined training as claimed in claim 1, wherein: and step two, adopting a convolution neural network based on the region, wherein the convolution neural network consists of a plurality of alternating convolution layers, pooling layers and full-link layers, and updating by adopting a back propagation algorithm.

4. The ship target detection method based on deep learning feature and traditional feature combined training as claimed in claim 3, wherein: in the second step, the adopted convolution neural network based on the region has the following structure,

6) two full-link layers FC7 and FC 8.

5. The ship target detection method based on deep learning feature and traditional feature combined training as claimed in claim 1, wherein: in the third step, the LOMO characteristic comprehensively considers the influence of illumination and visual angle change on the image, firstly, a Retinex algorithm is adopted to preprocess the input image, and the influence caused by illumination is reduced; secondly, aiming at the image preprocessed by the Retinex algorithm, extracting color features by applying an HSV color histogram; in addition, SILTP descriptors are applied to extract illumination invariant texture features of images.

6. The ship target detection method based on deep learning feature and traditional feature combined training as claimed in claim 1 or 2 or 3 or 4 or 5, wherein: in the fifth step, a fusion layer and a regression layer are arranged in the feature fusion network FCNN, the input of the fusion layer is CNN features and traditional features, the number of the ship types detected by the target is T, the output of the regression layer is a T multiplied by 1 vector, the value range of each line is 0 to 1, and the probability that the sample belongs to each class is represented.