CN111368850B

CN111368850B - Image feature extraction method, image target detection method, image feature extraction device, image target detection device, convolution device, CNN network device and terminal

Info

Publication number: CN111368850B
Application number: CN201811589348.3A
Authority: CN
Inventors: 刘阳; 罗小伟; 林福辉
Original assignee: Spreadtrum Communications Tianjin Co Ltd
Current assignee: Spreadtrum Communications Tianjin Co Ltd
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2022-11-25
Anticipated expiration: 2038-12-25
Also published as: CN111368850A

Abstract

A method and a device for extracting image features and detecting an object, a convolution device, a CNN network device and a terminal are provided, wherein the convolution device comprises: the channel expansion module is used for performing convolution operation on the feature mapping of the image data and expanding the channel number of the feature mapping obtained by convolution to obtain a first feature mapping; the depth separation convolution module is used for performing depth separation convolution on the first feature mapping output by the channel expansion module to obtain a second feature mapping; and the channel compression module is used for receiving the second feature mapping output by the depth separation convolution module, carrying out convolution operation on the second feature mapping, and compressing the channel number of the data subjected to the convolution operation to obtain a third feature mapping, wherein the channel number of the third feature mapping is smaller than that of the first feature mapping. By the technical scheme provided by the invention, the image convolution calculation complexity can be reduced, the calculation efficiency is improved, and the feature extraction difficulty is favorably reduced.

Description

Image feature extraction method, image target detection method, image feature extraction device, image target detection device, convolution device, CNN network device and terminal

Technical Field

The invention relates to the technical field of target detection, in particular to a method and a device for extracting image features and detecting a target, a convolution device, a CNN network device and a terminal.

Background

The target detection is a core problem in the field of computer vision, and mainly aims to analyze image or video information and judge whether certain objects (such as human faces, pedestrians, automobiles and the like) exist. If so, the specific location of each object is determined. The target detection technology can be widely applied to the fields of security monitoring, automatic driving, man-machine interaction and the like, and is a premise for performing high-order tasks such as behavior analysis, semantic analysis and the like.

There are many target detection methods, and the most influential in the conventional method is a deformation Model (DPM) based on components and a self-lifting cascade Model (AdaBoost Cascaded Model). The former is mainly applied to the pedestrian detection field, and the latter is mainly applied to the face detection field. However, the detection accuracy and adaptability of the two methods are surpassed by a deep learning method based on a Convolutional Neural Network (CNN). The deep learning method based on the CNN is mainly applied to the field of target detection. Methods for target detection based on CNN can be divided into two categories: one of them is a method based on target candidate window, and the typical representative is a Faster region-based convolutional neural network (fast Regions with CNN, abbreviated as fast R-CNN) detection method. The other type is a candidate window independent (propofol Free) detection method, and typical candidate window independent methods include a Single Shot multi-box Detector (SSD) detection method and a real-time target (yoly Only Look one, YOLO) detection method.

However, the target detection accuracy greatly depends on the feature extraction of the image data. The feature extraction method of image data relies on image convolution to extract salient features. The existing image convolution method has high complexity of extracting image features and long time consumption.

Disclosure of Invention

The invention solves the technical problem of how to optimize a convolution device to reduce the complexity of convolution calculation and improve the calculation efficiency so as to be beneficial to maintaining higher feature extraction precision and simultaneously reduce the complexity of feature extraction.

To solve the above technical problem, an embodiment of the present invention provides an image convolution apparatus, including: the channel expansion module is used for performing convolution operation on the feature mapping of the image data and expanding the channel number of the feature mapping obtained by convolution to obtain a first feature mapping; the depth separation convolution module is used for performing depth separation convolution on the first feature mapping output by the channel expansion module to obtain a second feature mapping; and the channel compression module is used for receiving the second feature mapping output by the depth separation convolution module, carrying out convolution operation on the second feature mapping, and compressing the channel number of the data subjected to the convolution operation to obtain a third feature mapping, wherein the channel number of the third feature mapping is smaller than that of the first feature mapping.

Optionally, the channel expansion module includes: the first convolution layer submodule is used for determining (e.Min) as the number of output channels of the first convolution layer submodule, and performing M multiplied by M convolution (e.Min) times on the feature mapping input to the channel expansion module, wherein M, min is a positive integer, e represents a preset expansion coefficient, e is greater than 1, e is a positive integer, and Min represents the number of the channels of the feature mapping; the first batch normalization layer submodule is used for carrying out batch normalization on the output result of the first convolution layer submodule; and the first restricted linear unit layer submodule is used for performing restricted linear processing on the data output by the first batch normalization layer submodule to obtain the first feature mapping.

Optionally, the depth separation convolution module includes: a depth separation convolution layer submodule for performing nxn depth separation convolution on the first feature mapping, wherein N > M, and N is a positive integer; the second batch normalization layer submodule is used for carrying out batch normalization on the convolution result obtained by the deep separation convolution layer submodule; and the second restricted linear unit layer submodule is used for performing restricted linear processing on the data obtained by the second batch processing normalization layer submodule to obtain the second feature mapping.

Optionally, the channel compression module includes: a second convolutional layer submodule for determining (e · Min) as the number of input channels of the second convolutional layer submodule, and performing Mout M × M convolution on the second feature mapping, where Mout is a positive integer and represents the number of output channels of the channel compression module; and the third batch normalization layer submodule is used for carrying out batch normalization on the convolution result output by the second convolution layer submodule.

Optionally, the channel expansion module includes: a first convolution batch processing layer sub-module for determining (e.Min) as the output channel number of the first convolution batch processing layer sub-module, and performing (e.Min) times of M × M volumes on the feature mapping input to the channel expansion module by adopting the following formulaProduct and batch normalization, M is a positive integer, e represents a predetermined expansion coefficient, e>1, and e, min are positive integers, min represents the number of channels of the feature map,

a first limited linear unit layer submodule, configured to perform limited linear processing on output data of the first convolution batch processing layer submodule to obtain the first feature map; wherein z is output data of the first convolution batch processing layer sub-module, w is a weight parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping, b is a bias parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping, x is the feature mapping of the image data, m is a preset mean parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping, δ is a preset standard deviation parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping, s is a preset scale parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping, and t is a preset offset parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping.

Optionally, the depth separation convolution module includes: a depth separation convolution batch processing layer submodule for carrying out NxN depth separation convolution and batch processing normalization on the data input to the depth separation convolution module by adopting the following formula, wherein N is>M, and N is a positive integer;

the second limited linear unit layer submodule is used for performing limited linear processing on the output data of the deep separation convolution batch processing layer submodule to obtain second feature mapping; wherein z is ₁ For the second feature mapping, w ₁ Weight parameter, x, for deep split convolutional layer sub-module determined based on the first feature map ₁ For the first feature mapping, b ₁ Bias for a deep split convolutional batch layer submodule determined based on the first feature mapSetting of parameter, m ₁ A predetermined mean parameter, δ, for a deeply separated convolutional layer sub-module determined based on the first feature mapping ₁ A predetermined standard deviation parameter, s, for a deeply separated convolutional layer sub-module determined based on the first feature mapping ₁ A preset scale parameter, t, for the deeply separated convolved batch layer submodule determined based on the first feature map ₁ A preset offset parameter for a depth separation convolution batch layer sub-module determined based on the first feature map.

Optionally, the channel compression module includes: a second convolution batch processing layer sub-module for determining (e.Min) as the number of input channels of the second convolution batch processing layer sub-module, and performing Mout times of M × M convolution and batch processing normalization on the data input to the channel compression module by adopting the following formula, wherein Mout is a positive integer and represents the number of output channels of the channel compression module,

wherein z is ₂ For the third feature mapping, w ₂ A weight parameter, x, for a second convolution batch layer sub-module determined based on the second feature map ₂ For the second feature mapping, b ₂ Bias parameters, m, for a second convolution batch layer sub-module determined based on the second feature map ₂ A predetermined mean parameter, δ, of a second convolution batch layer sub-module determined based on the second feature map ₂ A predetermined standard deviation parameter, s, for a second convolution batch layer sub-module determined based on the second feature map ₂ A predetermined scale parameter, t, for a second convolution batch layer sub-module determined based on the second feature map ₂ A preset offset parameter for a second convolution batch layer sub-module determined based on the second feature map.

Alternatively, M =1,n =3.

Optionally, the convolution device further includes: a residual module, configured to calculate a sum of each data element of the feature map and each data element of the output data when the number of channels of the feature map input to the channel expansion module is equal to the number of channels of the output data of the channel compression module.

Optionally, the convolution device further includes: and the point-by-point convolution module is suitable for performing point-by-point convolution on the data input to the point-by-point convolution module.

To solve the foregoing technical problem, an embodiment of the present invention further provides a CNN network device, including an input layer module, and a first convolution layer module connected to the input layer module, where the CNN network device further includes: and a convolution device for performing convolution operation on the feature map of the image data output by the first convolution layer module, wherein the convolution device is the convolution device.

Optionally, the CNN network device further includes: and the second convolution layer module is used for receiving the third feature mapping output by the convolution device and performing point-by-point convolution on the third feature mapping.

Optionally, the CNN network device further includes: a third convolutional layer module connected to the second convolutional layer module, the third convolutional layer module including a plurality of cascaded third convolutional layer sub-modules, each third convolutional layer sub-module being configured to perform nxn convolution or mxm convolution with a sliding step size of P, P being a positive integer greater than 1, and M, N being a positive integer.

Optionally, the CNN network device further includes: and the feature layer extracting module comprises a plurality of cascaded feature layer extracting submodules, and each feature layer extracting submodule is respectively used for receiving the convolution results output by the second convolution layer module and each third convolution layer submodule and carrying out N multiplied by N convolution on each convolution result so as to extract feature information of the image data.

In order to solve the above technical problem, an embodiment of the present invention further provides an image target detection apparatus, including: a feature extraction module adapted to extract feature information of image data based on the CNN network device; the prediction module is suitable for predicting a preset anchor point window based on the characteristic information to obtain a prediction result; and the suppression module is suitable for carrying out non-extreme suppression processing on the prediction result to obtain each detection target.

In order to solve the above technical problem, an embodiment of the present invention further provides a method for extracting features of an image, including: performing convolution operation on the feature mapping of the image data, and expanding the channel number of the feature mapping obtained by convolution to obtain a first feature mapping; performing depth separation convolution on the first feature mapping to obtain a second feature mapping; and performing convolution operation on the second feature mapping, and compressing the channel number of the data subjected to the convolution operation to obtain a third feature mapping, so that the channel number of the third feature mapping is smaller than the channel number of the first feature mapping.

Optionally, the performing convolution operation on the feature map of the image data and expanding the number of channels of the feature map obtained by convolution to obtain the first feature map includes: determining (e.Min) as the channel number of the first feature mapping, wherein e represents a preset expansion coefficient, e >1, and e and Min are positive integers, and Min represents the channel number of the feature mapping; performing (e.Min) times of M × M convolution on the feature mapping to obtain a first convolution result, wherein M is a positive integer; carrying out batch processing normalization on the first convolution result to obtain a first normalization result; and performing limited linear processing on the first normalization result to obtain the first feature mapping.

Optionally, the performing depth separation convolution on the first feature map to obtain a second feature map includes: performing N × N deep separation convolution on the first feature mapping to obtain a second convolution result, wherein N > M and N is a positive integer; carrying out batch processing normalization on the second convolution result to obtain a second normalization result; and performing limited linear processing on the second normalization result to obtain the second feature mapping.

Optionally, the performing convolution operation on the second feature map and compressing the number of output channels of the data after convolution operation includes: determining Mout as the number of channels of the third feature mapping, wherein Mout is a positive integer; conducting Mout times of M multiplied by M convolution on the second feature mapping to obtain a third convolution result; and carrying out batch processing normalization on the third convolution result to obtain the third feature mapping.

Optionally, the performing convolution operation on the feature map of the image data and expanding the number of channels of the feature map obtained by convolution to obtain the first feature map includes: determining (e.Min) as the number of channels of the first feature map, e representing a preset expansion coefficient, e>1, and e and Min are positive integers, wherein Min represents the number of channels of the feature mapping; performing (e.Min) times of M multiplied by M convolution on the feature mapping by adopting the following formula, and performing batch processing normalization, wherein M is a positive integer;

performing limited linear processing on the output data after batch processing normalization to obtain the first feature mapping; wherein z is the first feature mapping, w is a weight parameter determined by the feature mapping, b is a bias parameter corresponding to the feature data, x is the feature mapping of the image data, m is a preset mean parameter, δ is a preset standard deviation parameter, s is a preset scale parameter, and t is a preset offset parameter.

Optionally, the performing depth separation convolution on the first feature map to obtain a second feature map includes: performing NxN depth separation convolution and batch normalization on the first feature map by adopting the following formula, wherein N is>M, and N is a positive integer;

performing limited linear processing on the output data after batch processing normalization to obtain the second feature mapping; wherein z is ₁ For the second feature mapping, w ₁ Mapping corresponding weight parameters, x, based on the first feature ₁ For the first feature mapping, b ₁ Mapping corresponding bias parameters, m, based on the first characteristics ₁ For the preset mean parameter, δ ₁ To preset standard deviation parameters, s ₁ To preset scale parameters, t ₁ Is a preset offset parameter.

Optionally, the rolling the second feature mapThe product operation and the compression of the output channel number of the data after the convolution operation comprise: determining Mout as the number of channels of the third feature mapping, wherein Mout is a positive integer and represents the number of output channels of the channel compression module; the second feature map is convolved M x M times and normalized in batch using the following formula,

wherein z is ₂ For the third feature mapping, w ₂ Weight parameter, x, determined for the second feature map ₂ For the second feature mapping, b ₂ Bias parameters, m, determined for the second feature map ₂ For a predetermined mean parameter, δ ₂ To preset standard deviation parameters, s ₂ Is a predetermined scale parameter, t ₂ Is a preset offset parameter.

Alternatively, M =1,n =3.

Optionally, the feature extraction method further includes: when the number of channels of the feature map is equal to the number of channels of the third feature map, calculating a sum of each data element of the feature map and each data element of the third feature map to obtain a fourth feature map.

Optionally, the feature extraction method further includes: and performing point-by-point convolution on the fourth feature map to obtain a fifth feature map.

Optionally, the feature extraction method further includes: and performing point-by-point convolution on the third feature map to obtain a sixth feature map.

In order to solve the above technical problem, an embodiment of the present invention further provides a method for detecting an image target, including: extracting feature information of the image data based on the feature extraction method of the image; predicting a preset anchor point window based on the characteristic information to obtain a prediction result; and carrying out non-extreme value suppression processing on the prediction result to obtain each detection target.

In order to solve the foregoing technical problem, an embodiment of the present invention further provides a terminal, including a memory and a processor, where the memory stores computer instructions executable on the processor, and the processor executes the computer instructions to perform the steps of the foregoing method.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

an embodiment of the present invention provides an image convolution apparatus, including: the channel expansion module is used for performing convolution operation on the feature mapping of the image data and expanding the channel number of the feature mapping obtained by convolution to obtain a first feature mapping; the depth separation convolution module is used for performing depth separation convolution on the first feature mapping output by the channel expansion module to obtain a second feature mapping; and the channel compression module is used for receiving the second feature mapping output by the depth separation convolution module, carrying out convolution operation on the second feature mapping, and compressing the channel number of the data subjected to the convolution operation to obtain a third feature mapping, wherein the channel number of the third feature mapping is smaller than that of the first feature mapping. The technical scheme provided by the embodiment of the invention can carry out convolution processing on the feature mapping of the image data, and after the channel expansion module expands the number of the channels of the data, the depth separation convolution operation is carried out based on the depth separation convolution module, so that more feature information can be extracted, the number of the channels of the third feature mapping obtained after the operation is compressed, the convolution operation scale can be reduced under the condition of keeping higher detection precision, the convolution operation complexity is reduced, and the possibility of carrying out light-weight feature extraction on the mobile terminal is provided.

Further, an embodiment of the present invention provides a CNN network device, which includes an input layer module, a first convolution layer module connected to the input layer module, and further includes: and a convolution device for performing convolution operation on the feature map of the image data output by the first convolution layer module, wherein the convolution device is the convolution device. Compared with the prior art, the CNN network device provided by the embodiment of the invention has smaller convolution operation scale, is easy to realize light-weight feature extraction on the mobile terminal, and can reduce the calculation complexity of the CNN network forward reasoning due to the smaller operation scale.

Further, an embodiment of the present invention provides an image target detection apparatus, including: a feature extraction module adapted to extract feature information of image data based on the CNN network device; the prediction module is suitable for predicting a preset anchor point window based on the characteristic information to obtain a prediction result; and the suppression module is suitable for carrying out non-extreme suppression processing on the prediction result to obtain each detection target. The target detection device provided by the embodiment of the invention adopts the convolution device with lower calculation complexity as the CNN basic network, so that the target detection complexity can be reduced on the premise of keeping higher detection precision, and the target detection device is favorable for being applied to mobile terminal equipment.

Further, determining (e · Min) as the number of channels of the first feature map, e representing a preset expansion coefficient, e >1, and e, min being positive integers, min representing the number of channels of the feature map; performing (e.Min) times of M multiplied by M convolution on the feature mapping and performing batch normalization by adopting the following formula, wherein M is a positive integer, e represents a preset expansion coefficient, e is greater than 1, e is a positive integer, and Min represents the number of channels of the feature mapping;

performing limited linear processing on the output data after batch processing normalization to obtain the first feature mapping; wherein z is the first feature mapping, w is a weight parameter of the feature mapping, b is a bias parameter of the image data, x is a feature mapping of the image data, m is a preset mean parameter, δ is a preset standard deviation parameter, s is a preset scale parameter, and t is a preset offset parameter. By the technical scheme provided by the embodiment of the invention, when the CNN network is adopted for image processing, the batch processing normalization layer and the convolution layer associated with the batch processing normalization layer can be merged, so that multiplication and division operations can be reduced, the calculation complexity of feature extraction is reduced, and the calculation scale is reduced.

Drawings

FIG. 1 is a schematic diagram of a deep separable convolution module of the mobile Internet according to the prior art;

FIG. 2 is a schematic structural diagram of a convolution device according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a specific structure of the convolution device shown in FIG. 2;

FIG. 4 is a schematic diagram of a functional decomposition of the convolution device shown in FIG. 3;

FIG. 5 is a schematic diagram of another embodiment of the convolution device shown in FIG. 2;

fig. 6 is a schematic structural diagram of a CNN network according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of an image object detection apparatus according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a classification network according to an embodiment of the present invention;

FIG. 9 is a flowchart illustrating a method for extracting features of an image according to an embodiment of the present invention;

fig. 10 is a flowchart illustrating an image target detection method according to an embodiment of the present invention.

Detailed Description

As mentioned in the background, the prior art still has various disadvantages, and needs to optimize the convolution processing of the image and the target detection method.

Specifically, a deep learning method based on a Convolutional Neural Network (CNN) can be applied to the field of target detection. One of them is a detection method based on target candidate window, and typically represents a Faster detection method based on a convolutional neural network (fast Regions with CNN, abbreviated as fast R-CNN). The main principle is that on a shared image, a regional candidate window Network (RPN for short) is adopted to calculate a plurality of target candidate windows, and then characteristic information in the target candidate windows is classified and regressed to obtain target category information and position information, so that a target detection task is completed.

The detection method based on the Faster R-CNN can obtain higher detection precision. However, since the target candidate window is acquired by relying on a Region pro-portal Network (RPN for short), the detection time is long, and the method is not suitable for occasions with high real-time requirements.

The other type is a detection method of candidate window independence (Proposal Free), and a typical candidate window independence method mainly comprises a Single-time multi-window detection (SSD) detection method and a real-time target (You Only Look one, YOLO) detection method. The SSD detection method and the YOLO detection method do not need to additionally calculate a target candidate window, and do not have a corresponding feature resampling process. When the target detection is carried out, the SSD and the YOLO can directly preset a plurality of Anchor point windows (Anchor Box) with different scales and aspect ratios in a full-image area, the whole CNN network only needs to be transmitted forward during detection, then the confidence coefficient of a target category is calculated for each Anchor point window, and meanwhile, the offset is adjusted on the basis of the Anchor point windows to obtain an accurate target position. Compared with YOLO, SSD has a main difference that SSD extracts more complete multi-scale image information for prediction, so SSD has higher detection accuracy.

In the prior art, the detection method based on the YOLO relies on a small amount of images for classification and regression, loses more available information, has poor detection effect on small targets, and has lower positioning precision on the targets.

The SSD-based detection method uses a plurality of images for classification and regression, and compared with YOLO, the detection method has a better effect on small targets and improves the positioning accuracy of the targets. Specifically, when the SSD detector is used to detect the target, the information of the plurality of images may be selected to predict the preset anchor point window based on the forward-propagating convolutional neural network, and post-processing such as Non Maximum Suppression (NMS) may be performed to obtain a final detection result. The predicted variables may include, among other things, the confidence of the target class and the offset of the target location. The classic SSD detector uses a Visual Geometry Group (VGG 16 for short) classification network of oxford university as a basic CNN network, and is high in computational complexity and not suitable for a mobile terminal or an embedded device.

Further, the industry proposes an improved SSD detector. The improved SSD detector is based on a mobile Network (MobileNet) as a Base Network (Base Network). The MobileNet network uses a deep separable convolution module as shown in figure 1. The depth separable convolution module 100 includes a depth separation convolution module 101 and a 1 x 1 convolution module 102. The deep separation convolution module 101 is composed of a 3 × 3 deep separation convolution layer, a batch normalization layer and a limited linear unit layer; the 1 × 1 convolution module 102 consists of a 1 × 1 convolution layer, a batch normalization layer, and a constrained linear element layer. The computational complexity of the depth-separable convolution module 100 can be reduced by typically an order of magnitude compared to a standard convolution layer, and the convolution network constructed by the depth-separable convolution module 100 can still maintain a high degree of accuracy, as described in detail in reference [1]. Among them, reference [1]: andrew G.Howard, menglong Zhu, bo Chen, et al.MobileNet: effective capacitive Neural Networks for Mobile Vision applications.Arxiv2017. However, the complexity of the depth separable convolution module 100 still leaves room for simplification.

An embodiment of the present invention provides an image convolution apparatus, including: the channel expansion module is used for performing convolution operation on the feature mapping of the image data and expanding the channel number of the feature mapping obtained by convolution to obtain a first feature mapping; the depth separation convolution module is used for performing depth separation convolution on the first feature mapping output by the channel expansion module to obtain a second feature mapping; and the channel compression module is used for receiving the second feature mapping output by the depth separation convolution module, carrying out convolution operation on the second feature mapping, and compressing the channel number of the data subjected to the convolution operation to obtain a third feature mapping, wherein the channel number of the third feature mapping is smaller than that of the first feature mapping.

The technical scheme provided by the embodiment of the invention can carry out convolution processing on the feature mapping of the image data, and after the channel expansion module expands the number of the channels of the data, the depth separation convolution operation is carried out based on the depth separation convolution module, so that more feature information can be extracted, the number of the channels of the third feature mapping obtained after the operation is compressed, the convolution operation scale can be reduced under the condition of keeping higher detection precision, the convolution operation complexity is reduced, and the possibility is provided for realizing light-weight feature extraction on the mobile terminal.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

Fig. 2 is a schematic structural diagram of a convolution device according to an embodiment of the present invention. The convolution device 200 may be used in a CNN network as a convolution layer of the CNN network and performs convolution operation on input data. The convolution device 200 may include a channel expansion module 201, a depth separation convolution module 202, and a channel compression module 203.

In an implementation, the channel expansion module 201 may be configured to perform a convolution operation on a feature map (also called a feature map or a feature map) of the image data input to the convolution device 200. The feature map is obtained by convolving the original image data, and the dimension of the feature map is [ the number of height width channels ]. Typically, the number of channels of the feature map is much higher than the number of channels of the image data.

Those skilled in the art understand that, for the CNN network, the convolutional layer parameters include the number of convolutional kernels, step size and padding (padding), which together determine the size of the feature map output by the convolutional layer, and are important hyper-parameters of the CNN network. Where the number of convolution kernels can be specified as an arbitrary value smaller than the size of the input image, the larger the number of convolution kernels, the more complex the extractable input features. The number of channels of the first feature mapping is increased, and more feature information of the image can be extracted beneficially. To extract more feature information of the image data, the channel expansion module 201 may expand the number of channels of the data (i.e., the first feature map) obtained by convolution to obtain the first feature map with more feature information.

The depth separation convolution module 202 may receive the first feature map from the channel expansion module 201, and perform depth separation convolution on the first feature map output by the channel expansion module 201 to obtain a second feature map. The specific convolution operation step of the deep separation convolution can be referred to in reference [1], and is not described in detail here.

The channel compression module 203 may receive the second feature map output by the depth separation convolution module 202, perform convolution operation on the second feature map, and compress the output channel number of the data after the convolution operation to obtain a third feature map, so that the channel number of the third feature map is smaller than the channel number of the feature map, so as to extract significant feature information, and reduce the dimensionality of features (i.e., the third feature map) to reduce the amount of computation.

As a non-limiting example, referring to fig. 3, the channel expansion module 201 may include: a first rolling layer submodule 2011, a first batch normalization layer submodule 2012, and a first restricted linear unit layer submodule 2013.

Specifically, the first convolution layer sub-module 2011 may perform M × M convolution (e · Min) times on the feature map input to the channel expansion module 201, where M is a positive integer, e represents a preset expansion coefficient, e >1, e and Min are positive integers, and Min represents the number of channels of the feature map of the image data. In general, M =1, the feature map is convolved point by point, which is beneficial to reduce the computational complexity. Also, the first convolution layer sub-module 2011 may determine (e · Min) as the number of output channels of the first convolution layer sub-module 2011. Then, the output result of the first convolution layer submodule 2011 is input to the first Batch Normalization layer submodule 2012, so as to perform Batch Normalization (BN) on the output result of the first convolution layer submodule 2011. The first restricted linear unit layer submodule 2013 may be configured to perform restricted linear processing on the data output by the first batch normalization layer submodule 2012 to obtain the first feature map, where the number of channels of the first feature map is (e · Min).

In a specific implementation, the depth separation convolution module 202 may include: a depth-separated convolutional layer submodule 2021, a second batch normalization layer submodule 2022, and a second limited linear cell layer submodule 2023. In particular, the depth-separation convolution layer module 2021 may be configured to perform N × N depth-separation convolution on the first feature map, where N > M and N is a positive integer, e.g., N =3. The second batch normalization layer submodule 2022 may be configured to perform batch normalization on the convolution result obtained by the depth separation convolution layer submodule 2021; the second restricted linear unit layer submodule 2023 may be configured to perform restricted linear processing on the data obtained by the second batch processing normalization layer submodule 2022 to obtain the second feature map. The depth separation convolution module 202 can reduce the computational complexity of the convolution device 200 while maintaining a higher degree of accuracy.

In a specific implementation, the channel compression module 203 may include: a second convolutional layer submodule 2031 and a third batch normalization layer submodule 2032. Specifically, the second convolutional layer submodule 2031 may be configured to determine (e · Min) as the number of input channels of the second convolutional layer submodule 2031, and perform Mout M × M convolutions on the second feature map. Preferably, M =1. The third batch normalization layer sub-module 2032 may be configured to perform batch normalization on the convolution result output by the second convolution layer sub-module 2031 to obtain a third feature map, where the number Mout of channels of the third feature map is smaller than the number (e · Min) of channels of the first feature map.

Further, the convolution apparatus 200 may further include: a residual block 204. In a specific implementation, when the number of channels of the feature map of the image data input to the channel expansion module 201 is equal to the number of channels of the third feature map output by the channel compression module 203, the residual module 204 may be configured to calculate a sum of each data element of the feature map and each data element of the third feature map. As a variation, when the number of channels of the feature map input to the channel expansion module 201 is not equal to the number of channels of the third feature map output by the channel compression module 203, the convolution module 200 does not include the residual module 204. Those skilled in the art understand that the residual error module 204 can reduce the training difficulty of the CNN network, improve the generalization ability of the model, improve the efficiency of the deep neural network during reverse propagation, and effectively avoid Gradient disappearance (Gradient cancellation).

As a preferred embodiment, M =1,n =3. At this time, the function of each module and/or sub-module in the convolution device 200 may be as shown in fig. 4. Referring to fig. 4, the channel expansion module 201 may be used to perform 1 × 1 convolution, batch normalization, and constrained linear processing; the depth separation convolution module 202 may be configured to perform 3 × 3 depth separation convolution, batch normalization, and constrained linear processing; the channel compression module 203 may be used to perform 1 x 1 convolution and batch normalization. The convolution device 200 may include the residual module 204 when the number of channels of the feature map of the image data is equal to the number of channels of the third feature map. The residual module 204 may add each data element of the feature map and the data element of the third feature map output by the channel compression module 203 to obtain an output result of the convolution device 200.

In a specific implementation, the data dimension of the feature map of the image data input to the convolution device 200 may be three-dimensional data [ Fh, fw, min ]. Fh represents the height of the feature map, fw represents the width of the feature map, min represents the number of channels of the feature map, and Fh, fw and Min are positive integers. If the expansion coefficient is e, e >1 and the data dimension of the feature map of the image data input to the convolution device 200 is [ Fh, fw, min ], it means that the dimension of the first convolution layer sub-module 2011 in the channel expansion module 201 can be represented as [1, min, e × Min ]. With reference to fig. 4, the data dimension of the first feature map obtained by performing 1 × 1 convolution on the feature map of the image data is [ Fh, fw, e × Min ], and the number of channels input to the first convolution layer sub-module 2011 is expanded by the expansion coefficient. The dimension of the second convolution layer sub-module 2031 in the channel compression module 203 is [1,1, e × Min, mout ], and Mout is a positive integer, at this time, the data dimension of the third feature mapping obtained by performing 1 × 1 convolution on the second feature mapping output by the depth separation convolution module 202 is [ Fh, fw, mout ], and the number of output channels is compressed. In addition, the channel compression module 203 omits the constrained linear unit layer sub-module because more feature information can be retained using linear mapping without constrained linear processing.

Further, multiplicative coefficients β, β >0 may also be set. The data dimension of the convolution device 200 can be adjusted using a multiplicative parameter beta. Specifically, the input channel number and the output channel number of all modules (including sub-modules) of the convolution device 200 can be multiplied by the multiplicative coefficient, that is, the dimension of the output image data of a certain module is [ Fh, fw, mout × β ]. When β =1, it can be regarded as a standard case. When β changes, the number of parameters of the convolution device 200 also changes, and the corresponding calculation amount also changes. In a specific implementation, the β value may be determined according to a trade-off (trade-off) between a model size, a computational complexity, and a recognition accuracy of the convolution device 200 and the CNN network device in which the convolution device is located.

Further, the convolution device 200 includes many convolution operations and batch normalization operations on the convolution operation results, and multiplication and division operations are required to be performed when the convolution operations are implemented, which is time-consuming. Considering that the parameters of each batch normalization layer sub-module are fixed after the convolution device 200 completes training, the following formula can be used:

w is a weight parameter of the sub-module used for convolution operation, b is a bias parameter of the sub-module used for convolution operation, m is a mean value parameter of each batch processing normalization layer sub-module after training, delta is a standard deviation parameter of each batch processing normalization layer sub-module after training, s is a scale parameter of each batch processing normalization layer sub-module after training, and t is an offset parameter of each batch processing normalization layer sub-module after training. Therefore, in order to simplify the operation complexity, the convolution operation and the batch normalization operation can be combined. Specifically, the formula (1) is a convolution operation formula, where x is the image data, and y is an output result of the sub-module after performing convolution operation. And (3) a combination result is shown as a formula (3), wherein z is output data of the first convolution batch processing layer submodule and is output of the submodule obtained by combining convolution operation and batch processing normalization. With equation (3), the parameter calculation can be done off-line, i.e.:

y＝w·x+b (1)

in specific implementation, when convolution and batch normalization are performed by using formula (3), w represents a weight parameter corresponding to each module or sub-module, z represents the merging result, b represents a bias parameter determined by the feature mapping, and m, δ, s, and t are fixed values and can represent preset parameters of each module or sub-module. m represents a preset mean parameter, δ represents a preset standard deviation parameter, s represents a preset scale parameter, and t represents a preset offset parameter.

Based on the above optimization method, the convolution device 200 can be simplified. Specifically, referring to fig. 5, the channel expansion module 201 may include: a first rolling batch layer sub-module 2011 'and a first restricted linear cell layer sub-module 2012'.

In a specific implementation, the first convolution batch layer sub-module 2011' performs M × M convolution (e · Min) times on the image data input to the channel expansion module and performs batch normalization, where M is a positive integer, e represents a preset expansion coefficient, e >1, e and Min are positive integers, and Min represents the number of channels of the feature map, and then (e · Min) may be determined as the number of output channels of the first convolution batch layer sub-module:

in a specific implementation, z is output data of the first convolution batch layer sub-module 2011', w is a weight parameter of the first convolution batch layer sub-module corresponding to the feature mapping, b is a bias parameter of the first convolution batch layer sub-module 2011' corresponding to the feature mapping, x is the feature mapping of the image data, m is a preset mean parameter of the channel expansion module 201, δ is a preset standard deviation parameter of the channel expansion module 201, s is a preset scale parameter of the channel expansion module 201, and t is a preset bias parameter of the channel expansion module 201.

The first constrained linear unit layer sub-module 2012 'may be configured to perform constrained linear processing on the output data of the first convolutional batch layer sub-module 2011' to obtain the first feature map.

In a specific implementation, the depth separation convolution module 202 may include: the deep split convolution batch layer sub-module 2021 'and the second constrained linear cell layer sub-module 2022'.

Specifically, the deep separation convolution batch layer sub-module 2021' may be configured to perform N × N deep separation convolution and batch normalization on the data input to the deep separation convolution module by using the following formula, where N > M, and N is a positive integer:

wherein z is ₁ For the second feature mapping, w ₁ Weight parameter, x, for a deeply separated convolutional batch layer submodule determined based on the first feature map ₁ For the first feature mapping, b ₁ Bias parameters, m, for deeply separated convolutional layer sub-modules determined based on the first feature map ₁ A predetermined mean parameter, δ, for said deep separation convolution module ₁ A predetermined criterion for the deep separation convolution moduleDifference parameter, s ₁ A predetermined scale parameter, t, for said deep separation convolution module ₁ And presetting an offset parameter for the depth separation convolution module. The second constrained linear unit layer submodule 2022' may be configured to perform constrained linear processing on the output data of the deep separation convolution batch layer submodule to obtain the second feature map.

In a specific implementation, the channel compression module 203 may include: the second convolution batch layer sub-module 2031'. Specifically, the second convolution batch layer sub-module 2031' may be configured to determine (e · Min) as the number of input channels of the second convolution batch layer sub-module, and perform Mout times of M × M convolution on the data input to the channel compression module and perform batch normalization using the following formula, where Mout is a positive integer and represents the number of output channels of the channel compression module,

wherein z is ₂ For the third feature mapping, w ₂ A weight parameter, x, for a second convolution batch layer sub-module determined based on the second feature map ₂ For the second feature mapping, b ₂ Bias parameters, m, for a second convolution batch layer sub-module determined based on the second feature map ₂ A predetermined mean parameter, δ, of a second convolution batch layer sub-module determined based on the second feature map ₂ A predetermined standard deviation parameter, s, for a second convolution batch layer submodule determined based on said second feature map ₂ A predetermined scale parameter, t, for a second convolution batch layer sub-module determined based on the second feature map ₂ A preset offset parameter for a second convolution batch layer sub-module determined based on the second feature map. When M is 1, the M multiplied by M convolution is point-by-point convolution, and the operation amount can be reduced.

In a specific implementation, the convolution apparatus 200 may further include: a residual block 204. Specifically, when the number of channels of the feature map input to the channel expansion module 201 is equal to the number of channels of the output data of the channel compression module 203, the residual module 204 may calculate the sum of each data element of the feature map and each data element of the output data.

As a preferred embodiment, M is 1,N is 3, which may specifically refer to the embodiment shown in fig. 4 and is not described herein again.

Further, when the convolution device 200 does not include the residual error module 204, the convolution device 200 may further include a point-by-point convolution module (not shown). The point-by-point convolution module may be located after the channel compression module 203, and performs point-by-point convolution on the data output by the channel compression module 203. As a variation, when the convolution device 200 includes the residual block 204, the convolution device 200 may further include a point-by-point convolution block (not shown) located after the residual block 204. The point-by-point convolution module may perform point-by-point convolution on the data output by the residual error module 204, so as to further reduce the dimension of the output data and reduce the computational complexity.

Fig. 6 is a schematic structural diagram of a CNN network device according to an embodiment of the present invention. Referring to fig. 6, the CNN network device 300 may include an input layer module 301, a first convolution layer module 302 connected to the input layer module 301, and the convolution device 200 shown in fig. 2 to 5. The convolution device 200 can perform convolution operation on the image data output by the first convolution layer module 302 to extract feature information and reduce data dimensionality.

In a specific implementation, the CNN network device 300 may further include a second convolutional layer module 303. The second convolutional layer module 303 may receive the image data output by the convolutional device and perform point-by-point convolution on the image data.

In a specific embodiment, the second convolutional layer module 303 may be further connected to a third convolutional layer module 304. The third convolutional layer module 304 may include a plurality of cascaded third convolutional layer submodules 3041, each third convolutional layer submodule 3041 may be configured to perform an N × N convolution or an M × M convolution with a sliding step size P, where P is a positive integer greater than 1 and M, N is a positive integer. For example, the third convolutional layer submodule 3041 may perform a 3 × 3 convolution with a sliding step size of 2. The sliding step refers to the distance between positions of the convolution kernels when the convolution kernels are adjacent to and scan the feature maps twice, when the sliding step is 1, the convolution kernels can scan the elements of the feature maps one by one, and when the sliding step is n, the convolution kernels can skip (n-1) pixels in the next scanning.

Further, the CNN network device 300 may further include an extraction feature layer module 305. The extracted feature layer module 305 may include a plurality of cascaded extracted feature layer submodules 3051, and each extracted feature layer submodule 3051 may be configured to receive convolution results output by the second convolutional layer module 303 and each third convolutional layer submodule 3041, and perform N × N convolution on each convolution result to extract feature information of the image data, for example, N =3.

Those skilled in the art will understand that, in the CNN network device 300, the convolution operation and batch normalization may also be combined by using the above optimization method to reduce the computational complexity, and will not be described in detail here.

Fig. 7 is a schematic structural diagram of an image object detection apparatus according to an embodiment of the present invention. Based on the target detection device 400, multi-target detection can be performed based on a mobile terminal.

Specifically, the object detection apparatus 400 may include a feature extraction module 401 adapted to extract feature information of image data based on the CNN network apparatus 300 shown in fig. 6; a prediction module 402, adapted to predict a preset anchor point window based on the feature information to obtain a prediction result; the Suppression module 403 is adapted to perform Non-extreme Suppression (NMS) processing on the prediction result to obtain each detection target.

Those skilled in the art will understand that the target detection device 400 based on CNN network device will generally cut out the basic CNN network as the feature extraction module 401 based on the classification network for feature extraction. Specifically, in the target detection, on a forward propagation basis CNN network, information of a plurality of feature extraction submodules is selected to predict a preset anchor point window, and the prediction variables include the confidence of a target category and the offset of a target position, and then non-extreme value suppression is performed to obtain a final detection result.

Fig. 8 is a schematic structural diagram of a classification network according to an embodiment of the present invention. As shown in fig. 8, the classification network 500 is used to train an underlying CNN network 501. As a non-limiting example, the underlying CNN network 501 may include a 3 x 3 convolutional layer module 5011, a plurality of cascaded convolutional devices 5012, and a 1 x 1 convolutional layer module 5013. In the specific implementation, when the depth separation convolution module in the cascade convolution device 5012 performs N × N depth separation convolution, the sliding step may be 1 or 2. The sliding step is larger than 1, and the space domain scale of the depth separation convolution result can be reduced.

Training the underlying CNN network 501 may be pre-trained on an image network database (ImageNet) data set, which may specifically refer to the prior art and is not described in detail here.

After pre-training the classification network, the underlying CNN network 501 may be pruned out for use in detection devices. The number of convolution devices 5012 in the underlying CNN network 501 may be adjusted according to the specific task. It should be noted that, in order to obtain the high-resolution convolution characteristic result, the output data of the partial cascaded convolution device 5012 therein may be used as a high-resolution convolution characteristic layer in a subsequent processing module (not shown).

Those skilled in the art will appreciate that after pre-training is completed, and after the underlying CNN network 501 is obtained, other modules may be added to obtain the target detection apparatus. The target detection device may then be trained.

The training objective function of the object detection device can comprise a plurality of object classes, and the simultaneous detection of the objects of the plurality of classes is realized. Specifically, it is possible to set

Is an indicator as a result of the matching of the ith anchor window and the jth annotation window of target class p. If the overlapping rate of the two windows is higher than the threshold value T0

Is 1, otherwise is 0. Matching policy allows

So that multiple anchor windows can match the same annotation window. The trained global target loss function is a weighted sum of the confidence loss function and the localization loss function, as shown in equation (4):

where N is the number of anchor windows matched. If N is 0, the target loss is 0.α is a weight coefficient of the positioning loss. f denotes the indicator vector, c denotes the confidence vector, t denotes the prediction window position vector, g denotes the target annotation window vector, L _conf (f, c) represents a confidence loss function, L _loc (f, t, g) represents a localization loss function.

In a specific implementation, the confidence loss function is to calculate a flexible maximum transfer function (Softmax) loss for confidence of a plurality of classes, as shown in equations (5) and (6):

wherein log represents a logarithmic function, exp represents an exponential function,

is the confidence that the ith prediction window belongs to the target class p. Pos represents a positive sample set and Neg represents a negative sample set. And when the overlapping rate of the anchor point window and all the target labeling windows is less than T0, the anchor point window is a negative sample. P =0 represents the background category, i.e.A negative sample category.

In a specific implementation, the localization loss function is a quantitative estimate of the difference between the prediction window and the target annotation window. Before calculating the positioning loss function, encoding the target labeling window by using the anchor point window, as shown in formula (7):

wherein,

the abscissa, ordinate, width and height of the central position of the ith anchor point window;

the abscissa, ordinate, width, height of the center position of the jth target marking window;

the abscissa, ordinate, width, height of the center position after the jth target marking window is coded.

The positioning loss function may then be calculated using the smoothed first order norm, as shown in equation (8):

wherein, m epsilon (cx, cy, w, h) is a window position parameter which is respectively an abscissa, an ordinate, a width and a height of the central position.

Is the m-th position parameter of the i-th prediction window,

is the m-th position parameter after the j-th target marking window is coded. Smoothed first order norm H _L1 As shown in formula (9):

those skilled in the art will appreciate that training the target detection device may use training data as input, forward propagate the entire network, and calculate loss values according to equation (4). And then the model parameters of the whole network are updated through back propagation. In specific implementation, iterative optimization can be performed by using a Stochastic Gradient Descent (SGD) method, so as to obtain each model parameter. Further, after the training is completed, the new image may be subjected to target detection using the model parameters obtained by the training.

In an implementation, with reference to fig. 6 and 7, each of the third convolutional layer sub-modules 3041 in the third convolutional layer module 304 can be used to perform 3 × 3 convolution with a sliding step size of 2 and point-by-point convolution, so that the data dimension of the third convolutional layer sub-module 3041 is gradually reduced, and the output result corresponds to different data dimensions. Accordingly, the extracted feature layer sub-module 3051 connected thereto is used to perform a 3 × 3 convolution, so that the prediction data in the prediction module 402 can be generated. The prediction data includes a confidence of the object class and an offset of the object location.

For example, taking fig. 6 as an example, the output data Xi of a third convolutional layer submodule 3041 has a data dimension [ Hi, wi, ci ], and the dimension values respectively represent the height, width, and number of channels of the output data Xi; the data of the corresponding feature extraction sub-module 3051 is Fi, the data dimension is [ Kh, kw, ci, p +4], and Kh, kw, ci respectively represent the height, width, number of input channels, and number of output channels of the feature extraction sub-module 3051, where p represents the number of object categories, and 4 represents four position parameters of an object. Convolution of Xi with Fi yields the predicted data Yi, with data dimensions [ Hi, wi, p +4].

For the selected third convolution feature layer sub-module 3041, since the objects in the actual scene have different scales and aspect ratiosAt either location, several anchor windows may be generated. Thus, the scale-specific parameter s of the target can be calculated according to the index k of the selected third convolution feature layer sub-module 3041 _k As shown in formula (11):

wherein s is _min Is the minimum dimension, s _max Is the largest scale, m represents the number of selected third convolution feature layer sub-modules 3041, s _k Is the target scale for the kth layer in the selected third convolutional feature layer sub-module.

Further, the sequence a of the aspect ratio may be set _r E {1,2,3,1/2,1/3}, where the width of any anchor point window of the kth layer third convolution feature layer submodule 3041 is equal to

Gao Wei

Fig. 9 is a schematic flowchart of a method for extracting features of an image according to an embodiment of the present invention. The feature extraction method may be performed by using the CNN network device shown in fig. 6. Specifically, the feature extraction method may include the steps of:

step S101, carrying out convolution operation on the feature mapping of the image data, and expanding the channel number of the feature mapping obtained by convolution to obtain a first feature mapping;

step S102, carrying out depth separation convolution on the first feature mapping to obtain a second feature mapping;

step S103, performing convolution operation on the second feature map, and compressing the number of channels of the data after the convolution operation to obtain a third feature map, so that the number of channels of the third feature map is smaller than the number of channels of the first feature map.

Specifically, in step S101, the image data may be convolved to obtain a feature map of the image, and then the feature map input to the convolution device may be convolved, and the number of channels of the convolved feature map may be expanded to obtain the first feature map.

In a specific implementation, it may be determined that (e · Min) is the number of channels of the first feature map, e represents a preset expansion coefficient, e >1, and e, min are positive integers, and Min represents the number of channels of the image data; performing (e.Min) M × M convolution on the image data to obtain a first convolution result, wherein M is a positive integer; then, carrying out batch processing normalization on the first convolution result to obtain a first normalization result; further, the first normalization result is subjected to limited linear processing to obtain the first feature map.

As a variation, (e · Min) may be determined as the number of channels of the first feature map, e denotes a preset expansion coefficient, e >1, and e is a positive integer, min denotes the number of channels of the image data; performing (e.Min) times of M multiplied by M convolution on the image data and performing batch processing normalization by adopting the following formula, wherein M is a positive integer, e represents a preset expansion coefficient, e is greater than 1, e and Min are positive integers, and Min represents the number of channels of the feature mapping;

performing limited linear processing on the output data after batch processing normalization to obtain the first feature mapping; wherein z is the first feature mapping, w is a weight parameter corresponding to the first feature mapping, b is a bias parameter corresponding to the first feature mapping, x is the feature mapping of the image data, m is a preset mean parameter, δ is a preset standard deviation parameter, s is a preset scale parameter, and t is a preset offset parameter.

In step S102, a depth separation convolution may be performed on the first feature map to obtain a second feature map. Specifically, the first feature map may be subjected to N × N depth separation convolution to obtain a second convolution result, where N > M and N is a positive integer; carrying out batch processing normalization on the second convolution result to obtain a second normalization result; and performing limited linear processing on the second normalization result to obtain the second feature mapping.

As a variation, the first feature map may be subjected to nxn depth separation convolution and batch normalization using the following formula, where N > M and N is a positive integer;

performing limited linear processing on the output data after batch processing normalization to obtain the second feature mapping; wherein z is ₁ For the second feature mapping, w ₁ Weight parameter, x, determined for the first feature map ₁ For the first feature mapping, b ₁ Bias parameters, m, determined for the first feature map ₁ For the preset mean parameter, δ ₁ To preset standard deviation parameters, s ₁ To preset scale parameters, t ₁ Is a preset offset parameter.

In step S103, a convolution operation may be performed on the second feature map, and the number of channels of the data after the convolution operation is compressed to obtain a third feature map, so that the number of channels of the third feature map is smaller than the number of channels of the first feature map.

Specifically, mout may be determined as the number of channels of the third feature map; conducting Mout times of M multiplied by M convolution on the second feature mapping to obtain a third convolution result; and carrying out batch processing normalization on the third convolution result to obtain the third feature mapping.

As a variation, mout may be determined as the number of channels of the third feature map, mout times M × M convolution and batch normalization may be performed on the second feature map using the following formula,

wherein z is ₂ For the third feature mapping, w ₂ Weight parameter, x, determined for the second feature map ₂ For the second feature mapping, b ₂ Bias parameters, δ, determined for the second feature map ₂ To preset standard deviation parameters, s ₂ To preset scale parameters, t ₂ Is a preset offset parameter.

As a preferred embodiment, M =1,n =3.

Further, when the number of channels of the feature map of the image data is equal to the number of channels of the third feature map, a sum of each data element of the feature map and each data element of the third feature map is calculated to obtain a fourth feature map.

Further, a point-by-point convolution may be performed on the fourth feature map to obtain a fifth feature map.

Further, a point-by-point convolution may be performed on the third feature map to obtain a sixth feature map.

Fig. 10 is a flowchart illustrating an image target detection method according to an embodiment of the present invention. The object detection method can be used for multi-object detection of image data and can be applied to mobile terminals. Specifically, the target detection method may include:

step S201, extracting feature information of the image data based on the feature extraction method of the image shown in fig. 9;

step S202, predicting a preset anchor point window based on the characteristic information to obtain a prediction result;

step S203, carrying out non-extremum inhibition processing on the prediction result to obtain each detection target.

In a specific implementation, step S201 may be performed, namely, feature information of the image data is extracted according to the feature extraction method of the image shown in fig. 9.

In step S202, a preset anchor point window may be predicted based on the characteristic information to obtain a prediction result.

In step S203, non-extremum suppression processing may be performed on the prediction result to obtain each detection target.

For performance comparison, the target detection device provided by the embodiment of the invention is trained and tested on a computer vision standard data set (PASCAL VOC) data set. In specific implementation, a data set VOC2012trainval and a data set VOC2007trainval are used as training sets; and data set VOC2007test was used as the test set. The input image data is 300 pixels, and 17 convolution devices with expansion coefficient e =6 and multiplicative coefficient β = [1,0.75] were used in the basic CNN network for experimental simulation. It should be noted that training the basic CNN network is performed on a single-block image processor (GPU) Titan X GPU.

In specific implementation, the VOC data set has 20 types of targets, and the index for evaluating the detection performance is an Average accuracy mean (mAP), as shown in formula (12):

where r denotes a Recall rate (Recall), p (r) denotes an accuracy (Precision) corresponding to a certain Recall rate, and p _interp (r) represents the maximum accuracy when the recall rate is greater than r, AP represents the calculation accuracy mean value when the recall rate is {0,0.1, … …,1.0}, and mAP represents the average result of calculation accuracy mean values for a plurality of types of objects, and the detected object Q =20.

TABLE 1

Table 1 shows a comparison of the performance of the target detection apparatus provided by the embodiment of the present invention and the performance of the conventional MobileNet-SSD detector. In the first embodiment of the present invention, the multiplicative coefficient β =1, and in the second embodiment of the present invention, the multiplicative coefficient β =0.75. It can be found from the table that the average precision mean of the target detection device provided by the first embodiment of the present invention is slightly lower than that of the MobileNet-SSD detector, but the model size (unit, megabyte, abbreviated as MB) is about half of that of the MobileNet-SSD detector. The average accuracy mean of the second embodiment of the present invention is reduced a little more, but the model size is about one-third of that of the MobileNet-SSD detector. The embodiment of the invention respectively tests the running speeds (the Unit is frame per second (fps)) on a CPU (Central Processing Unit, CPU for short) with the model number of i7 5930K and a Titan X GPU (frame per second), which are faster than those of a MobileNet-SSD detector. Experiments prove that the model scale and the recognition accuracy can be balanced by adjusting the multiplicative coefficient.

Therefore, the embodiment of the invention can provide the convolution device with lower computational complexity, and the convolution neural network and the target detection device with lower computational complexity can be obtained based on the convolution device.

Further, the embodiment of the present invention also discloses a storage medium, on which computer instructions are stored, and when the computer instructions are executed, the technical solutions of the methods described in the embodiments shown in fig. 9 and fig. 10 are executed. Preferably, the storage medium may include a computer-readable storage medium such as a Non-Volatile (Non-Volatile) memory or a Non-Transitory (Non-transient) memory. The storage medium may include ROM, RAM, magnetic or optical disks, etc.

Further, an embodiment of the present invention further discloses a terminal, which includes a memory and a processor, where the memory stores a computer instruction capable of running on the processor, and the processor executes the method technical solution described in the embodiments shown in fig. 1 to 5 when running the computer instruction.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An apparatus for convolving an image, comprising:

the channel expansion module is used for performing convolution operation on the feature mapping of the image data and expanding the channel number of the feature mapping obtained by convolution to obtain a first feature mapping; the channel expansion module comprises: the first convolution layer sub-module is used for determining (e & Min) as the number of output channels of the first convolution layer sub-module, and performing M multiplied by M convolution for (e & Min) times on the feature mapping input to the channel expansion module, wherein M, min is a positive integer, e represents a preset expansion coefficient, e is greater than 1, e is a positive integer, and Min represents the number of channels of the feature mapping; the first batch normalization layer submodule is used for carrying out batch normalization on the output result of the first convolution layer submodule; the first restricted linear unit layer submodule is used for performing restricted linear processing on the data output by the first batch normalization layer submodule to obtain the first feature mapping;

the depth separation convolution module is used for performing depth separation convolution on the first feature mapping output by the channel expansion module to obtain a second feature mapping;

and the channel compression module is used for receiving the second feature mapping output by the depth separation convolution module, carrying out convolution operation on the second feature mapping, and compressing the channel number of the data subjected to the convolution operation to obtain a third feature mapping, wherein the channel number of the third feature mapping is smaller than that of the first feature mapping.

2. The convolution device of claim 1, wherein the depth separation convolution module comprises:

a depth separation convolution layer sub-module, configured to perform nxn depth separation convolution on the first feature mapping, where N > M and N is a positive integer;

the second batch processing normalization layer submodule is used for carrying out batch processing normalization on the convolution result obtained by the depth separation convolution layer submodule;

and the second limited linear unit layer submodule is used for performing limited linear processing on the data obtained by the second batch processing normalization layer submodule to obtain the second feature mapping.

3. The convolution device of claim 2, wherein the channel compression module comprises: a second convolutional layer submodule for determining (e · Min) as the number of input channels of the second convolutional layer submodule, and performing Mout M × M convolution on the second feature mapping, where Mout is a positive integer and represents the number of output channels of the channel compression module;

and the third batch normalization layer submodule is used for carrying out batch normalization on the convolution result output by the second convolution layer submodule.

4. The convolution device of claim 1, wherein the channel expansion module comprises: a first convolution batch processing layer sub-module, configured to determine (e · Min) as the number of output channels of the first convolution batch processing layer sub-module, and perform (e · Min) M × M convolution on the feature mapping input to the channel expansion module and perform batch processing normalization by using the following formula, where M is a positive integer, e represents a preset expansion coefficient, e >1, and e, min are positive integers, min represents the number of channels of the feature mapping,

the first limited linear unit layer submodule is used for performing limited linear processing on the output data of the first volume batch processing layer submodule to obtain the first feature mapping;

wherein z is output data of the first convolution batch processing layer sub-module, w is a weight parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping, b is a bias parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping, x is the feature mapping of the image data, m is a preset mean parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping, δ is a preset standard deviation parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping, s is a preset scale parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping, and t is a preset bias parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping.

5. The convolution device of claim 4, wherein the depth separation convolution module comprises:

a deep separation convolution batch processing layer submodule for carrying out NxN deep separation convolution on the data input to the deep separation convolution module by adopting the following formula and carrying out batch processing normalization, wherein N is greater than M and is a positive integer,

the second limited linear unit layer submodule is used for performing limited linear processing on the output data of the deep separation convolution batch processing layer submodule to obtain second feature mapping;

wherein z is ₁ For the second feature mapping, w ₁ Weight parameter, x, for deep split convolutional layer sub-module determined based on the first feature map ₁ For the first feature mapping, b ₁ Bias parameters, m, for deeply separated convolutional layer sub-modules determined based on the first feature map ₁ A predetermined mean parameter, δ, for a deeply separated convolutional layer sub-module determined based on the first feature mapping ₁ A predetermined standard deviation parameter, s, for a deeply separated convolutional layer sub-module determined based on the first feature mapping ₁ A predetermined scale parameter, t, for a deeply separated convolutional layer sub-module determined based on the first feature mapping ₁ And presetting an offset parameter of the deep separation convolution batch layer submodule determined based on the first feature mapping.

6. The convolution device of claim 5, wherein the channel compression module comprises: a second convolution batch processing layer sub-module for determining (e.Min) as the number of input channels of the second convolution batch processing layer sub-module, and performing Mout times of M × M convolution and batch processing normalization on the data input to the channel compression module by adopting the following formula, wherein Mout is a positive integer and represents the number of output channels of the channel compression module,

wherein z is ₂ For the third feature mapping, w ₂ A weight parameter, x, for a second convolution batch layer sub-module determined based on the second feature map ₂ For the second feature mapping, b ₂ Bias parameters, m, for a second convolution batch layer sub-module determined based on the second feature map ₂ A predetermined mean parameter, δ, of a second convolution batch layer sub-module determined based on the second feature map ₂ A predetermined standard deviation parameter, s, for a second convolution batch layer sub-module determined based on the second feature map ₂ A predetermined scale parameter, t, for a second convolution batch layer submodule determined based on the second feature map ₂ A preset offset parameter for a second convolution batch layer sub-module determined based on the second feature map.

7. The convolution device of claim 2 or 3 or 5 or 6, wherein M =1,n =3.

8. The convolution device of claim 1, further comprising:

a residual module, configured to calculate a sum of each data element of the feature map and each data element of the output data when the number of channels of the feature map input to the channel expansion module is equal to the number of channels of the output data of the channel compression module.

9. The convolution device according to any one of claims 1 to 6 or 8, further comprising:

and the point-by-point convolution module is suitable for performing point-by-point convolution on the data input to the point-by-point convolution module.

10. A CNN network device, comprising an input layer module, a first convolutional layer module connected to the input layer module, and further comprising:

convolution means for performing a convolution operation on a feature map of image data output by the first convolution layer module, the convolution means being the convolution means according to any one of claims 1 to 9.

11. The CNN network device of claim 10, further comprising:

and the second convolution layer module is used for receiving the third feature mapping output by the convolution device and performing point-by-point convolution on the third feature mapping.

12. The CNN network device of claim 11, further comprising:

a third convolutional layer module connected to the second convolutional layer module, the third convolutional layer module including a plurality of cascaded third convolutional layer sub-modules, each third convolutional layer sub-module being configured to perform nxn convolution or mxm convolution with a sliding step size of P, P being a positive integer greater than 1, and M, N being a positive integer.

13. The CNN network device of claim 12, further comprising:

and the feature layer extracting module comprises a plurality of cascaded feature layer extracting submodules, and each feature layer extracting submodule is used for receiving the convolution results output by the second convolution layer module and each third convolution layer submodule and carrying out NxN convolution on each convolution result so as to extract the feature information of the image data.

14. An object detection apparatus for an image, comprising:

a feature extraction module adapted to extract feature information of image data based on the CNN network device of any one of claims 10 to 13;

the prediction module is suitable for predicting a preset anchor point window based on the characteristic information to obtain a prediction result;

and the suppression module is suitable for carrying out non-extreme suppression processing on the prediction result to obtain each detection target.

15. An image feature extraction method applied to the convolution device according to any one of claims 1 to 9, comprising:

performing convolution operation on the feature map of the image data, and expanding the channel number of the feature map obtained by convolution to obtain a first feature map, including: determining (e.Min) as the channel number of the first feature mapping, wherein e represents a preset expansion coefficient, e >1, and e and Min are positive integers, and Min represents the channel number of the feature mapping; performing (e.Min) M × M convolution on the feature map to obtain a first convolution result, wherein M is a positive integer; carrying out batch processing normalization on the first convolution result to obtain a first normalization result; performing restricted linear processing on the first normalization result to obtain the first feature mapping;

performing depth separation convolution on the first feature mapping to obtain a second feature mapping;

and performing convolution operation on the second feature mapping, and compressing the channel number of the data subjected to the convolution operation to obtain a third feature mapping, so that the channel number of the third feature mapping is smaller than the channel number of the first feature mapping.

16. The feature extraction method of claim 15, wherein the depth separation convolution of the first feature map to obtain a second feature map comprises:

performing NxN deep separation convolution on the first feature mapping to obtain a second convolution result, wherein N is greater than M and is a positive integer;

carrying out batch processing normalization on the second convolution result to obtain a second normalization result;

and performing limited linear processing on the second normalization result to obtain the second feature mapping.

17. The feature extraction method of claim 16, wherein performing convolution operation on the second feature map and compressing the number of output channels of the convolution-operated data comprises:

determining Mout as the number of channels of the third feature mapping, wherein Mout is a positive integer;

conducting Mout times of M multiplied by M convolution on the second feature mapping to obtain a third convolution result;

and carrying out batch processing normalization on the third convolution result to obtain the third feature mapping.

18. The feature extraction method of claim 15, wherein performing a convolution operation on the feature map of the image data and expanding the number of channels of the feature map obtained by the convolution to obtain the first feature map comprises:

determining (e.Min) as the channel number of the first feature mapping, wherein e represents a preset expansion coefficient, e >1, and e and Min are positive integers, and Min represents the channel number of the feature mapping;

performing (e.Min) times of M multiplied by M convolution on the feature mapping by adopting the following formula, and performing batch processing normalization, wherein M is a positive integer;

performing limited linear processing on the output data after batch processing normalization to obtain the first feature mapping;

wherein z is the first feature mapping, w is a weight parameter determined by the feature mapping, b is a bias parameter corresponding to the feature mapping, x is the feature mapping of the image data, m is a preset mean parameter, δ is a preset standard deviation parameter, s is a preset scale parameter, and t is a preset offset parameter.

19. The feature extraction method of claim 18, wherein the depth separation convolution of the first feature map to obtain a second feature map comprises:

carrying out NxN depth separation convolution on the first feature mapping by adopting the following formula and carrying out batch normalization, wherein N is greater than M and is a positive integer;

performing limited linear processing on the output data after batch processing normalization to obtain the second feature mapping;

wherein z is ₁ For the second feature mapping, w ₁ Mapping corresponding weight parameters, x, based on the first features ₁ For the first feature mapping, b ₁ Mapping corresponding bias parameters, m, based on the first characteristics ₁ For the preset mean parameter, δ ₁ To preset standard deviation parameters, s ₁ To preset scale parameters, t ₁ Is a preset offset parameter.

20. The feature extraction method of claim 19, wherein performing a convolution operation on the second feature map and compressing the number of output channels of the convolved data comprises:

determining Mout as the number of channels of the third feature mapping, wherein Mout is a positive integer and represents the number of output channels of a channel compression module;

the second feature map is convolved M x M times and normalized in batch using the following formula,

wherein z is ₂ For the third feature mapping, w ₂ Weight parameter, x, determined for the second feature map ₂ For the second feature mapping, b ₂ Bias parameters, m, determined for the second feature map ₂ For a predetermined mean parameter, δ ₂ Is a preset standard deviation parameter, s ₂ Is a predetermined scale parameter, t ₂ Is a preset offset parameter.

21. The feature extraction method according to claim 16 or 17 or 19 or 20, wherein M =1,n =3.

22. The feature extraction method according to claim 15, characterized by further comprising:

when the number of channels of the feature map is equal to the number of channels of the third feature map, calculating a sum of each data element of the feature map and each data element of the third feature map to obtain a fourth feature map.

23. The feature extraction method according to claim 22, further comprising:

and performing point-by-point convolution on the fourth feature map to obtain a fifth feature map.

24. The feature extraction method according to any one of claims 15 to 20, further comprising:

and performing point-by-point convolution on the third feature map to obtain a sixth feature map.

25. An object detection method for an image, comprising:

extracting feature information of the image data based on the feature extraction method of the image according to any one of claims 15 to 24;

predicting a preset anchor point window based on the characteristic information to obtain a prediction result;

and carrying out non-extreme value suppression processing on the prediction result to obtain each detection target.

26. A terminal comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, wherein the processor, when executing the computer instructions, performs the steps of the method of any one of claims 15 to 24 or claim 25.