WO2020010975A1 - 图像目标检测方法、装置、存储介质及电子设备 - Google Patents

图像目标检测方法、装置、存储介质及电子设备 Download PDF

Info

Publication number
WO2020010975A1
WO2020010975A1 PCT/CN2019/090406 CN2019090406W WO2020010975A1 WO 2020010975 A1 WO2020010975 A1 WO 2020010975A1 CN 2019090406 W CN2019090406 W CN 2019090406W WO 2020010975 A1 WO2020010975 A1 WO 2020010975A1
Authority
WO
WIPO (PCT)
Prior art keywords
depth
feature
level
convolution layer
image
Prior art date
Application number
PCT/CN2019/090406
Other languages
English (en)
French (fr)
Inventor
赵世杰
李峰
左小祥
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP19833842.8A priority Critical patent/EP3742394A4/en
Publication of WO2020010975A1 publication Critical patent/WO2020010975A1/zh
Priority to US17/008,189 priority patent/US11176404B2/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • Image target detection method, device, storage medium and electronic device This application requires that China be filed with the Chinese Patent Office on July 11, 2018, with the application number 201810754633.X, and the invention name is "Image target detection method, device and storage medium” in China The priority of a patent application, the entire contents of which are incorporated herein by reference.
  • TECHNICAL FIELD The present application relates to the field of image processing, and in particular, to an image target detection method, device, storage medium, and electronic device.
  • BACKGROUND With the development of science and technology, object recognition technology in images has become an important issue for computer vision. That is to say, in a given picture, an object to be identified is marked, such as a person, a car, a house, or the like.
  • Deep learning algorithms such as Faster-RCNN (Fast Region Convolutional Neural Network, Faster Regions with Convolutional Neural Networks features) and YOLO (You Only Look Once).
  • Embodiments of the present application provide an image target detection method and device, storage medium, and electronic equipment that run faster and require less configuration resources, so as to solve the slow operation speed of the existing image target detection methods and devices. And it cannot be implemented on a mobile terminal with a small resource allocation.
  • An embodiment of the present application provides an image target detection method, including:
  • n-level depth feature map frame an n-level depth feature map frame, and an m-level non-depth feature map frame, where n is an integer greater than or equal to 2 and m is an integer greater than or equal to 1; where the feature map frame includes the output feature size and dimension Degree;
  • i-level depth feature map framework Based on the depth feature extraction model, use the i-level depth feature map framework to perform depth feature extraction on the (i-1) -level features of the detection image ⁇ to obtain the i-level features of the detection image, where i is a positive less than or equal to n Integer
  • the j-level non-depth feature map framework is used to
  • (j-1 + n) level features are extracted from non-depth features to obtain (j + n) level features of the detection image, where j is a positive integer less than or equal to m;
  • information regression operations are performed on the a-level features to (m + n) -level features of the detection image to obtain the target type and target position of the detection image, where a is less than n and greater than or equal to 2. Integer.
  • An embodiment of the present application further provides an image target detection device, including:
  • An image and frame acquisition module configured to acquire a detection image, an n-level depth feature map frame, and an m-level non-depth feature map framework, where n is an integer greater than or equal to 2 and m is an integer greater than or equal to 1; where the feature map frame includes the output Feature size and dimensions;
  • a depth feature extraction module is used to extract depth features from the (i-1) level features of the detection image using an i-level depth feature map framework based on the depth feature extraction model to obtain the i-level features of the detection image.
  • i is a positive integer less than or equal to n;
  • a non-depth feature extraction module is configured to extract a (j-1 + n) level feature of the detection image using a j-level non-depth feature map framework based on the non-depth feature extraction model to obtain the detection image. (J + n) level features, where j is a positive integer less than or equal to m;
  • An object detection module is configured to perform an information regression operation on the a-level features to (m + n) -level features of the detection image based on a feature prediction model, so as to obtain a target type and a target position of the detection image, where a is less than An integer of n and greater than 2.
  • An embodiment of the present application further provides a storage medium that stores processor-executable instructions, and when the instructions are executed by one or more processors, the foregoing image target detection method is implemented.
  • An embodiment of the present application further provides an electronic device including one or more processors and a storage device; the storage device is configured to store one or more executable program instructions;
  • FIG. I is a flowchart of an image target detection method according to an embodiment of the present application.
  • FIG. 2 is a flowchart of an image target detection method according to another embodiment of the present application.
  • FIG. 3a is a flowchart of step S202 of the image target detection method shown in FIG. 2 according to an embodiment of the present application
  • FIG. 3b is a schematic diagram of feature extraction in step S202 of the image target detection method shown in FIG. 2 according to an embodiment of the present application ;
  • FIG. 4a is a flowchart of step S203 of the image target detection method shown in FIG. 2 according to an embodiment of the present application
  • FIG. 4b is a schematic diagram of feature extraction of step S203 of the image target detection method shown in FIG. 2 according to an embodiment of the present application ;
  • FIG. 5a is a flowchart of step S204 of the image target detection method shown in FIG. 2 according to an embodiment of the present application
  • FIG. 5b is a schematic diagram of feature extraction in step S204 of the image target detection method shown in FIG. 2 according to an embodiment of the present application
  • FIG. 6 is a schematic structural diagram of an image target detection device according to an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an image target detection device according to another embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a depth feature earphone module of the image target detection device shown in FIG. 7 according to an embodiment of the present application;
  • FIG. 9 is a schematic structural diagram of a non-depth feature extraction module of the image target detection device shown in FIG. 7 according to an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a target detection module of the image target detection device shown in FIG. 7 according to an embodiment of the present application;
  • FIG. 11 is a schematic diagram of a specific implementation of an image target detection method and an image target detection device according to an embodiment of the present application;
  • FIG. 12 is a schematic diagram of a working environment structure of an electronic device in which an image target detection device is located according to an embodiment of the present application.
  • DETAILED DESCRIPTION Please refer to the drawings, wherein the same component symbols represent the same components, and the principle of the present application is exemplified by being implemented in an appropriate computing environment. The following description is based on the exemplified specific embodiments of the present application, which should not be construed as limiting other specific embodiments not detailed herein.
  • the image target detection method and the image target detection device in the embodiments of the present application can be set in any electronic device, and are used to detect and identify objects such as people, cars, and houses in pictures or photos.
  • the electronic device includes, but is not limited to, a wearable device, a head-mounted device, a healthcare platform, a personal computer, a server computer, a handheld or laptop device, and a mobile device (such as a mobile phone, personal digital assistant (PDA, Personal Digital Assistant) , Media players, etc.), multiprocessor systems, consumer electronic devices, small computers, mainframe computers, distributed computing environments including any of the systems or devices mentioned above, and so on.
  • the electronic device may be a mobile terminal on which an image target recognition application is installed. The mobile terminal can quickly extract target features in an image, and requires less configuration resources of the mobile terminal itself.
  • FIG. 1 is a flowchart of an image target detection method according to an embodiment of the present application.
  • the image target detection method of this embodiment may be implemented by using the foregoing electronic device.
  • the image target detection method of this embodiment includes:
  • Step S101 Obtain a detection image, an n-level depth feature map frame, and an m-level non-depth feature map frame, where n is an integer greater than or equal to 2, and m is an integer greater than or equal to 1.
  • the feature map frame includes the output feature size and dimensions. ;
  • Step S102 Based on the depth feature extraction model, use the i-level depth feature map framework to perform depth feature extraction on the (i-1) -level features of the detection image to obtain the i-level features of the detection image, where i is a positive integer less than or equal to n ;
  • Step S103 Based on the non-depth feature extraction model, use a j-level non-depth feature map framework to perform non-depth feature extraction on the (j-1 + n) -level features of the detection image to obtain (j + n) -level features of the detection image.
  • j is a positive integer less than or equal to m;
  • Step S104 Based on the feature prediction model, perform information regression operations on the a-level features to (m + n) -level features of the detection image to obtain the target type and target position of the detection image, where a is less than n and greater than or equal to 2. Integer.
  • the image target detection process of the image target detection method of this embodiment is described in detail below.
  • the image target detection device in the following embodiments is an electronic device that can execute an image target detection method.
  • step S101 the image target detection device acquires a detection image for which target detection is required, and an n-level depth feature map frame and an m-level non-depth feature map frame for performing target detection on the detected image.
  • n is an integer of 2 or more
  • m is an integer of 1 or more. That is, at least three feature extraction operations are performed on the detection image.
  • multi-level features need to be performed on the detection image to perform ear lifting operations, such as m + n level. Since the feature size of the lower level must be smaller than the feature size of the upper level, the feature extraction operation of the lower level can be performed on the features output by the feature extraction operation of the upper level. Large-scale feature lifting operation Since there are fewer previous-level feature extraction operations, it is necessary to use a deep feature extraction model and a deep feature map framework for feature extraction. Small-scale feature extraction operations have previously performed multiple higher-level feature extraction operations, so only non-depth feature extraction models and non-depth feature map frames are required for feature extraction.
  • the depth feature map framework is a recognition parameter for performing feature recognition on a detection image or a lower-level feature corresponding to the detection image, and the depth feature map framework may include feature sizes and dimensions output by each depth feature level.
  • the non-depth feature frame is a recognition parameter for performing feature recognition on the lower-level features corresponding to the detection image, and the non-depth feature map frame may include feature sizes and dimensions output at each non-depth feature level.
  • step S102 the image target detection device uses the i-level depth feature map frame obtained in step S101 to perform depth feature extraction on the (i-1) -level features of the detected image based on the preset depth feature extraction model, so as to obtain ear detection.
  • An i-level feature of the image where i is a positive integer less than or equal to n.
  • the image target detection device performs depth feature extraction on the pixels of the detection image based on a preset depth feature extraction model to obtain the first-level features of the detection image corresponding to the first-level depth feature map frame; Level 1 features are subjected to depth feature extraction to obtain level 2 features of the detection image corresponding to the level 2 depth feature map frame; finally, the image target detection device performs depth feature extraction on the (n-1) level features of the detected image to obtain level n The n-level features of the detection image corresponding to the depth feature map frame. In this way, the first-level features to n-level features of the detection image are obtained.
  • step S103 the image target detection device uses the j-level non-depth feature map frame obtained in step S101 to perform non-depth feature extraction based on the preset non-depth feature extraction model.
  • the (j + n) level feature of the image is detected by the ear, where j is a positive integer less than or equal to m.
  • the image target detection device based on a preset non-depth feature extraction model, Performing non-depth feature extraction to obtain (n + 1) -level features of the detection image corresponding to the 1-level non-depth feature map frame; then the image target detection device performs non-depth feature extraction on (n + 1) -level features of the detection image, To obtain (n + 2) level features of the detection image corresponding to the 2-level non-depth feature map frame; finally, the image target detection device performs depth feature extraction on the (n + m-1) level features of the detection image to obtain (m + (m + n) -level features of the detection image corresponding to the n-level depth feature map frame. In this way, (n + 1) -level features to (m + n) -level features of the detection image are obtained.
  • step S104 the image target detection device performs an information regression operation on the a-level features to (m + n) -level features of the detection images obtained in steps S102 and S103 based on a preset feature prediction model, thereby obtaining a target of the detected image.
  • Type and target position where a is an integer less than n and greater than or equal to 2.
  • the image target detection device since the feature sizes of the first-level features to (a-1) -level features of the detected image are large, it does not have the significance of feature classification and recognition, so the image target detection device will detect the first-level features of the image to (a- 1) Class features are discarded directly.
  • the image target detection device performs feature classification and recognition on a-level features to (m + n) -level features of the detection image, so as to obtain a target type (such as a person, a car, a house, etc.) and a target position (such as a target) of the detection image corresponding to the feature.
  • the image target detection method of this embodiment is based on a depth feature extraction model and a non-depth feature extraction model, and performs ear extraction and feature recognition on multiple different size features of the same detection image. Because the small size features of the detection image can be directly detected Extraction is based on the large-size features of the image, so the overall feature lift speed is faster, and the demand for configuration resources is lower.
  • FIG. 2 is a flowchart of an image target detection method according to another embodiment of the present application.
  • the image target detection method of this embodiment may be implemented by using the foregoing electronic device.
  • the image target detection method of this embodiment includes:
  • Step S201 Obtain a detection image, an n-level depth feature map frame, and an m-level non-depth feature map frame, where n is an integer greater than or equal to 2 and m is an integer greater than or equal to 1; wherein the feature map framework includes the output feature size and dimensions;
  • Step S202 Based on the depth feature extraction model, use the i-level depth feature map framework to perform depth feature extraction on the (i-1) -level features of the detection image to obtain the i-level features of the detection image, where i is a positive integer less than or equal to n ;
  • Step S203 Based on the non-depth feature extraction model, use the j-level non-depth feature map framework to perform non-depth feature extraction on the (j-1 + n) level features of the detection image to obtain the (j + n) level features of the detection image.
  • j A positive integer less than or equal to m;
  • Step S204 Based on the feature prediction model, perform information regression operations on the a-level features to (m + n) -level features of the detection image to obtain the target type and target position of the detection image, where a is an integer less than n and greater than or equal to 2. .
  • step S201 the image target detection device acquires a detection image for which target detection is required, and an n-level depth feature map frame and an m-level non-depth feature map frame for performing target detection on the detected image.
  • n is an integer of 2 or more
  • m is an integer of 1 or more. That is, at least three feature extraction operations are performed on the detection image.
  • the depth feature map framework is a recognition parameter for performing feature recognition on a detection image or a lower-level feature corresponding to the detection image, and the depth feature map framework may include feature sizes and dimensions output by each depth feature level.
  • the non-depth feature frame is a recognition parameter for performing feature recognition on the lower-level features corresponding to the detection image, and the non-depth feature map frame may include feature sizes and dimensions output at each non-depth feature level.
  • step S202 the image target detection device uses the i-level depth feature map frame obtained in step S101 to perform depth feature extraction on the (i-1) -level features of the detected image based on the preset depth feature extraction model, so as to obtain ear detection.
  • An i-level feature of the image where i is a positive integer less than or equal to n.
  • the depth feature extraction model includes a depth input convolution layer, a depth first non-linear conversion convolution layer, a depth second non-linear conversion convolution layer, and a depth output convolution layer.
  • FIG. 3a is a flowchart of step S202 of the image target detection method shown in FIG. 2 according to an embodiment of the present application.
  • FIG. 3b is an image target shown in FIG. 2 according to an embodiment of the present application.
  • step S301 the image target detection device uses a depth input convolution layer of the depth feature extraction model to perform a dimensionality upgrading operation on the (i-1) level features of the detection image to obtain the i-level dimensionality enhancement features of the detection image.
  • the depth input convolution layer is a standard convolution layer with a 1 * 1 convolution kernel size and a non-linear activation function.
  • the depth input convolution layer can set a larger number of channels, such as 4-6. This can increase the dimension of the input features while ensuring the feature size of the detected image, thereby solving the problem of missing features of the detected image.
  • the number of channels of the depth input convolution layer is used to indicate the number of feature extraction modes for feature extraction from the low-level features of the detected image
  • the size of the convolution kernel of the depth input convolution layer is used to adjust the complexity of the deep neural network model.
  • the (i-1) level feature of the input detection image is a feature point matrix of 32 * 32 * 3, where 3 is the number of input channels of the detection image, such as the red pixel brightness value, the blue pixel brightness value, and the green Pixel brightness values, etc .;
  • Set the size of the convolution kernel of the depth input convolution layer to 1 * 1, then the output feature size of the depth input convolution layer is 32 * 32, that is, the 1 * 1 convolution kernel is used to traverse 32 in sequence.
  • the feature point matrix of * 32 can get the feature map of size 32 * 32. If the number of channels of the depth input convolution layer is 6, the output of the obtained depth input convolution layer is 32 * 32 * 18 i-level upgraded features. . In this way, without changing the size of the output features, a higher-dimensional feature of the detected image with a higher dimension is obtained.
  • the depth input convolutional layer uses a non-linear activation function, such as a linear rectification function (ReLU, Rectified Linear Unit), to perform nonlinear processing on the output i-level dimensionality characteristics to ensure that the output of the depth input convolutional layer is differentiable , Thereby improving the accuracy of subsequent output features.
  • a non-linear activation function such as a linear rectification function (ReLU, Rectified Linear Unit)
  • step S302 the image target detection device uses a depth first non-linear transformation convolution layer of the depth feature extraction model to perform a first feature extraction operation on the i-level dimensional feature of the detection image obtained in step S301 to obtain the i-level of the detection image.
  • the first convolution feature is a depth first non-linear transformation convolution layer of the depth feature extraction model to perform a first feature extraction operation on the i-level dimensional feature of the detection image obtained in step S301 to obtain the i-level of the detection image.
  • the first convolution feature is a depth first non-linear transformation convolution layer of the depth feature extraction model to perform a first feature extraction operation on the i-level dimensional feature of the detection image obtained in step S301 to obtain the i-level of the detection image.
  • the first non-linearly transformed convolutional layer is a deep separable convolutional layer with a 3 * 3 convolution kernel size and a non-linear activation function.
  • the setting of the deeply separable convolutional layer can make the first non-linearly transformed convolutional layer
  • the calculation amount of the layer is greatly reduced, which further reduces the size of the depth feature extraction model.
  • the depth-first non-linear transformation convolutional layer first performs a first feature extraction operation on the i-level raised-dimensional features of the detection image, and then the depth-first non-linear transformation convolutional layer uses a non-linear activation function, such as linear A rectification function (ReLU, Rectified Linear Unit) performs non-linear processing on the output i-level first convolution features to ensure that the output of the depth first non-linear conversion convolution layer is differentiable, thereby improving the accuracy of subsequent output features Sex.
  • a non-linear activation function such as linear A rectification function (ReLU, Rectified Linear Unit) performs non-linear processing on the output i-level first convolution features to ensure that the output of the depth first non-linear conversion convolution layer is differentiable, thereby improving the accuracy of subsequent output features Sex.
  • Step S303 The image target detection device uses a depth second non-linear conversion convolution layer of the depth feature extraction model to perform a second feature extraction operation on the i-level first convolution feature of the detection image obtained in step S302, so that An i-level second convolution feature of the detection image is obtained.
  • the depth second non-linear conversion convolutional layer is a deep separable hole convolutional layer (atrous convolutions) with a 3 * 3 convolution kernel size and a non-linear activation function.
  • the setting of the depth separable hole convolutional layer can make While the calculation amount of the second non-linear transformation convolution layer is greatly reduced, the receptive field of each feature basic unit of the detection image can be increased, thereby further improving the i-level second output of the second non-linear transformation convolution layer. Accuracy of convolutional features.
  • the hole convolution can set a "dilation rate" parameter in the convolution operation.
  • the expansion rate defines the interval between the data when the convolution layer processes the data. For example, a standard convolution layer of 5 * 5 convolution kernel size needs to set 25 parameters; but if a 3 * 3 convolution kernel size and an expansion convolution layer of 2 are used, only 9 parameters need to be set. That is, on the basis of the 5 * 5 size convolution kernel, one row of data is deleted every other row and one column of data is deleted every other column. Therefore, under the same calculation conditions, the hole convolutional layer can provide a larger receptive field without increasing the amount of computation.
  • the hole convolution layer is set at the second depth non-linear transformation convolution layer. Based on the initial depth feature extraction already performed by the first depth non-linear transformation convolution layer, the depth feature extraction can be performed again with fewer resources. The problem of smaller receptive field in the first feature extraction operation can be well compensated.
  • the deep second nonlinear transformation convolution layer first performs a second feature extraction operation on the i-level first convolution feature of the detection image, and then the deep second nonlinear transformation convolution layer uses a non-linear activation function.
  • a linear rectification function (ReLU, Rectified Linear Unit) performs non-linear processing on the output i-level second convolution feature to ensure that the output of the deep second non-linear conversion convolution layer is differentiable, thereby improving subsequent output characteristics.
  • ReLU Rectified Linear Unit
  • step S304 the image target detection device uses the depth output convolution layer of the depth feature extraction model to perform a dimension reduction operation on the i-level second convolutional feature of the detection image obtained in step S303 to obtain the i-level feature of the detection image.
  • the depth output convolution layer is a standard convolution layer with a 1 * 1 convolution kernel size and no activation function.
  • the depth output convolution layer can restore the dimension added in step S301 to the dimension input to the depth input convolution layer; and no activation function is set in the depth output convolution layer to avoid loss of output features caused by the activation function.
  • the level i features of the detection image output by the depth output convolution layer should be consistent with the level i depth feature map framework.
  • step S203 the image target detection device uses the j-level non-depth feature map frame obtained in step S201 to perform non-depth feature extraction based on the preset non-depth feature extraction model.
  • the (j + n) level feature of the image is detected with the ear, where j is a positive integer less than or equal to m.
  • the non-depth feature extraction model includes a non-depth input convolution layer, a non-depth non-linear transformation convolution layer, and a non-depth output convolution layer.
  • FIG. 4a is a flowchart of step S203 of the image target detection method shown in FIG. 2 according to an embodiment of the present application.
  • FIG. 4b is an image target shown in FIG. 2 according to an embodiment of the present application.
  • a schematic diagram of feature extraction in step S203 of the detection method. This step S203 includes:
  • Step S401 The image target detection device uses a non-depth input convolution layer of the non-depth feature extraction model to perform a dimension-up operation on the (j-1 + n) level features of the detection image to obtain the (j + n) level of the detection image. Ascension characteristics.
  • the non-depth input convolution layer is a standard convolution layer with a 1 * 1 convolution kernel size and a non-linear activation function.
  • the non-depth input convolution layer can set a larger number of channels, such as 4-6. This can increase the dimension of the input features while ensuring the feature size of the detected image, thereby solving the problem of missing features of the detected image.
  • the non-depth input convolutional layer will use a non-linear activation function, such as a linear rectification function (ReLU, Rectified Linear Unit), to perform nonlinear processing on the output (j + n) -level raised dimension features to ensure non-depth input convolution.
  • a non-linear activation function such as a linear rectification function (ReLU, Rectified Linear Unit)
  • ReLU Rectified Linear Unit
  • step S402 the image target detection device uses a non-depth non-linear conversion convolution layer of the non-depth feature extraction model to perform a feature extraction operation on the (j + n) level-escalated features of the detection image obtained in step S401 to obtain the detection image. (j + n) level convolution features.
  • the non-depth non-linearly transformed convolutional layer is a deep separable convolutional layer with a 3 * 3 convolution kernel size and a non-linear activation function.
  • the setting of the deep separable convolutional layer can make non-depth non-linearly transformed convolutional
  • the calculation amount of the layer is greatly reduced, which further reduces the size of the depth feature extraction model.
  • the non-depth non-linear conversion convolution layer here may also be a deep separable hole convolution layer.
  • the non-depth feature extraction model here only needs to use a non-depth non-linear transformation convolution layer for feature extraction. Need to set multiple non-linear transformation convolution layers for feature extraction.
  • the non-depth non-linear transformation convolutional layer first performs feature extraction operations on the (j + n) -level raised-dimensional features of the detection image, and then the non-depth non-linear transformation convolutional layer uses a non-linear activation function, such as linear Whole Stream function (ReLU, Rectified Linear Unit) and other non-linear processing on the output (j + n) level convolution features to ensure that the output of the non-depth non-linear conversion convolution layer is differentiable, thereby improving the subsequent output features. accuracy.
  • a non-linear activation function such as linear Whole Stream function (ReLU, Rectified Linear Unit)
  • Step S403 The image target detection device uses a non-depth output convolution layer of the non-depth feature extraction model to perform a dimensionality reduction operation on the (j + n) level convolution features of the detection image obtained in step S402 to obtain the (j + n) level features.
  • the non-deep output convolution layer is a standard convolution layer with a 1 * 1 convolution kernel size and no activation function.
  • the non-depth output convolution layer can restore the dimension added in step S401 to the dimension input to the non-depth input convolution layer; and no activation function is set in the non-depth output convolution layer to avoid the output caused by the activation function. Loss of features.
  • the (j + n) level features of the detection image output by the non-depth output convolution layer should conform to the j-level non-depth feature map framework.
  • Steps S401 to S403 are repeated to obtain (n + 1) -level features to (j + n) -level features of the ear detection image.
  • the depth of feature extraction based on the detected image uses a depth feature extraction model and a non-depth feature extraction model, respectively, which can greatly reduce the calculation amount of the feature extraction operation.
  • the value of n can be set according to user requirements. If the amount of calculation for feature extraction is large, the size of n can be appropriately reduced. If the accuracy of feature extraction needs to be improved, the size of n can be appropriately increased.
  • the image target detection device performs an information regression operation on the a-level features to (m + n) -level features of the detection images obtained in steps S202 and S203 based on a preset feature prediction model, thereby acquiring a target of the detected image.
  • Type and target position where a is an integer less than n and greater than or equal to 2.
  • the function of the feature prediction model is equivalent to a regressor, which is used to obtain the target type and target position of the target in the image.
  • the target type here is identified by classification probability. For example, a target has a probability of 80% for a cat. 20% probability for a dog etc.
  • the feature prediction model includes a feature classification convolution layer and a feature output convolution layer.
  • FIG. 5a is a flowchart of step S204 of the image target detection method shown in FIG. 2 according to an embodiment of the present application
  • FIG. 5b is an image target shown in FIG. 2 according to an embodiment of the present application.
  • Step S501 The image target detection device uses a feature classification convolution layer of a feature prediction model to perform a feature extraction operation on a-level features to (m + n) -level features of the detected image to obtain a classification recognition feature of the detected image. Sign.
  • the feature classification convolution layer is a deep separable convolution layer with a 3 * 3 convolution kernel size and no activation function. Because the feature sizes of the first-level features to (a-1) -level features of the detection image are large, they will not generally be the target of the detected image, so all the previous-level features of the a-level features of the detection image are discarded here.
  • the image target detection device uses the a-level features to (m + n) -level features of the detection image to perform feature lifting and operation, so as to obtain the classification and recognition features of the ear-detection image, so as to perform subsequent detection of the target type and target position of the image Prediction operation.
  • a part of the features from level a to level (m + n) can be selected for feature extraction operation, thereby further reducing the calculation amount of the feature extraction operation.
  • step S502 the image target detection device uses the feature output convolution layer of the feature prediction model to perform a dimension reduction operation on the classification and recognition features of the detected image obtained in step S501 to obtain the target type and target position of the detected image.
  • the feature output convolution layer is a standard convolution layer with a 1 * 1 convolution kernel size and no activation function. There is no activation function set in the feature output convolution layer to avoid the loss of output features caused by the activation function.
  • the target type output here can be people, cars, houses, etc.
  • the output target position can be parameters such as the center coordinates of the target and the length and width of the target box.
  • the depth feature extraction model and the non-depth feature extraction model in the image target detection method of this embodiment adopt different structures, and the first non-linear transformation in the depth feature extraction model
  • the convolution layer and the second non-linear conversion convolution layer also use different results. This can maximize the extraction speed of the target features of the detection image, further reduce the need for configuration resources, and thus achieve the landing of the target detection function on the mobile terminal.
  • An embodiment of the present application further provides an image target detection device.
  • FIG. 6, is a schematic structural diagram of an image target detection device according to an embodiment of the present application.
  • the image target detection device of this embodiment may implement the above-mentioned image target detection method shown in FIG. 1.
  • the image target detection device 60 of this embodiment includes an image and frame acquisition module 61, a depth feature extraction module 62, and a non-depth feature extraction. Module 63 and target detection module 64.
  • the image and frame acquisition module 61 is configured to acquire a detection image, an n-level depth feature map frame, and an m-level non-depth feature map frame, where n is an integer greater than or equal to 2 and m is an integer greater than or equal to 1;
  • the frame includes the output feature size and dimensions;
  • the depth feature extraction module 62 is used to extract the model based on the depth feature and use the i-level depth feature map framework to perform depth feature extraction on the (i-1) -level features of the detection image to obtain the detection image.
  • the non-depth feature extraction module 63 is used to extract a model based on the non-depth feature and use the j-level non-depth feature map framework to detect the (j-1 + n) level features of the image Perform non-deep feature extraction to obtain (j + n) level features of the ear and detect the image, where j is a positive integer less than or equal to m;
  • the target detection module 64 is used to predict a-level features of the detected image based on the feature prediction model to (M + n) level features perform information regression operations to obtain the target type and target position of the detection image, where a is an integer less than n and greater than or equal to 2.
  • the image and frame acquisition module 61 acquires a detection image for which target detection is required, and an n-level depth feature map frame and m-level non-depth feature map for target detection on the detected image. frame.
  • n is an integer of 2 or more
  • m is an integer of 1 or more. That is, at least three feature extraction operations are performed on the detection image.
  • multi-level features need to be performed on the detection image to perform ear lifting operations, such as m + n level. Since the feature size of the lower level must be smaller than the feature size of the upper level, the feature extraction operation of the lower level can be performed on the features output by the feature extraction operation of the upper level. Large-scale feature extraction operation Because there are fewer previous-level feature extraction operations, it is necessary to use a deep feature extraction model and a deep feature map framework for feature extraction. Small-scale feature extraction operations have previously performed multiple higher-level feature extraction operations, so only non-depth feature extraction models and non-depth feature map frames are required for feature extraction.
  • the depth feature map framework is a recognition parameter for performing feature recognition on a detection image or a lower-level feature corresponding to the detection image, and the depth feature map framework may include feature sizes and dimensions output by each depth feature level.
  • the non-depth feature frame is a recognition parameter for performing feature recognition on the lower-level features corresponding to the detection image, and the non-depth feature map frame may include feature sizes and dimensions output at each non-depth feature level.
  • the depth feature extraction module 62 uses the i-level depth feature map framework to perform depth feature extraction on the (i-1) -level features of the detection image based on a preset depth feature extraction model, to obtain ear-level detection image i-level features, where i is a positive integer less than or equal to n.
  • the depth feature extraction module 62 performs depth feature extraction on the pixels of the detection image based on a preset depth feature extraction model to obtain the level 1 features of the detection image corresponding to the level 1 depth feature map frame; then the depth feature extraction module performs the detection image on the detection image. Depth feature extraction of the first-level features of the image to obtain the second-level features of the detection image corresponding to the second-level depth feature map frame; finally, the image target detection device (n-1) level of the detection image The feature is subjected to depth feature extraction to obtain the n-level features of the detection image corresponding to the ear and n-level depth feature map frames. In this way, the first-level features to the n-level features of the detection image are obtained.
  • the non-depth feature extraction module 63 uses the j-level non-depth feature map framework to perform non-depth feature extraction on the (j-1 + n) level features of the detection image based on a preset non-depth feature extraction model to obtain the ( j + n) level features, where j is a positive integer less than or equal to m.
  • the non-depth feature extraction module 63 performs non-depth feature extraction on the n-level features of the detection image based on a preset non-depth feature extraction model to obtain (n + 1) of the detection image corresponding to the 1-level non-depth feature map frame. ) Level features; then the non-depth feature extraction module performs non-depth feature extraction on the (n + 1) level features of the detection image to obtain the (n + 2) level features of the detection image corresponding to the 2-level non-depth feature map frame; Finally, the image target detection device performs depth feature extraction on the (n + m-1) level features of the detection image to obtain (m + n) level features of the detection image corresponding to the (m + n) level depth feature map frame. In this way, the (n + 1) -level features of the detection image are (m + n) -level features.
  • the target detection module 64 performs information regression operations on the a-level features to (m + n) -level features of the detected image based on a preset feature prediction model, thereby obtaining the target type and target position of the detected image, where a is less than n and greater than An integer equal to 2.
  • the target detection module 64 will detect the first-level features of the image to (a- 1) Class features are discarded directly.
  • the target detection module 64 performs feature classification and recognition on a-level features to (m + n) -level features of the detection image, so as to obtain a target type (such as a person, a car, a house, etc.) and a target position (such as a target) of the detection image corresponding to the feature. The center coordinates of the target and the length and width of the target box, etc.).
  • the image target detection device of this embodiment is based on a deep feature extraction model and a non-deep feature extraction model, and performs ear recognition and feature recognition on multiple features of different sizes in the same detection image. Because the small size features of the detection image can be directly detected Extraction is based on the large-size features of the image, so the overall feature lift speed is faster, and the demand for configuration resources is lower.
  • FIG. 7 is a schematic structural diagram of an image target detection device according to another embodiment of the present application.
  • the image target detection device of this embodiment may implement the image target detection method shown in FIG. 2 described above.
  • the image target detection device 70 of this embodiment includes an image and frame acquisition module 71, a depth feature extraction module 72, and a non-depth feature extraction. Module 73 and target detection module 74.
  • the image and frame acquisition module 71 is used to acquire a detection image, an n-level depth feature map frame, and an m-level non-image Depth feature map framework, where the feature map framework includes the output feature size and dimensions; the depth feature extraction module 72 is used to extract the model based on the depth feature, and use the i-level depth feature map framework to depth the (i-1) -level features of the detection image Feature extraction to obtain the i-level features of the detection image; the non-depth feature extraction module 73 is used to extract the model based on the non-depth feature and use the j-level non-depth feature map framework to perform non-depth feature (j-1 + n) -level features of the detection image.
  • the target detection module 74 is configured to perform information regression operations on the a-level features to (m + n) -level features of the detection image based on the feature prediction model to obtain Detects the target type and target position of the image.
  • FIG. 8 is a schematic structural diagram of a depth feature extraction module of the image target detection device shown in FIG. 7 according to an embodiment of the present application.
  • the depth feature extraction module 72 includes a depth dimensionality improvement operation unit 81, a first depth feature extraction unit 82, a second depth feature extraction unit 83, and a depth dimensionality reduction operation unit 84.
  • the depth feature extraction model includes a depth input convolution layer, a depth first non-linear conversion convolution layer, a depth second non-linear conversion convolution layer, and a depth output convolution layer.
  • the depth upgrading operation unit 81 is configured to use a depth input convolution layer to perform a maintenance operation on the (i-1) -level features of the detection image to obtain the i-level upgrading features of the detection image.
  • the first depth feature extraction unit 82 is used
  • a first non-linear transformation convolution layer is used to perform a first feature extraction operation on the i-level raised feature of the detection image to obtain the i-level first convolution feature of the detection image
  • the second depth feature extraction unit 83 uses the depth
  • the second non-linear transformation convolution layer performs a second feature extraction operation on the i-level first convolutional feature of the detection image to obtain the i-level second convolutional feature of the detection image
  • the depth reduction operation unit 84 is configured to use the depth Output a convolution layer, and perform a dimensionality reduction operation on the i-level second convolution feature of the detection image to obtain the i-level feature of the detection image.
  • FIG. 9 is a schematic structural diagram of a non-depth feature extraction module of the image target detection device shown in FIG. 7 according to an embodiment of the present application.
  • the non-depth feature extraction module 73 includes a non-depth feature enhancement operation unit 91, a non-depth feature extraction unit 92, and a non-depth feature reduction operation unit 93.
  • the non-depth feature extraction model includes a non-depth input convolution layer, a non-depth non-linear transformation convolution layer, and a non-depth output convolution layer.
  • the non-depth-level upgrading operation unit 91 is configured to use a non-depth input convolution layer to perform a dimension-up operation on a (j-1 + n) level feature of the detection image to obtain a (j + n) level-up feature of the detection image;
  • the non-depth feature extraction unit 92 is configured to perform a feature extraction operation on the (j + n) -level raised dimension feature of the detection image using a non-depth non-linear conversion convolution layer to obtain a (j + n) -level convolution feature of the detection image.
  • the non-depth dimensionality reduction operation unit 93 is configured to perform a dimensionality reduction operation on the (j + n) -level convolutional features of the detection image using a non-depth output convolution layer to obtain the (j + n) -level features of the detection image.
  • FIG. 10 is a schematic structural diagram of a target detection module of the image target detection device shown in FIG. 7 according to an embodiment of the present application.
  • the target detection module 74 includes a feature classification unit 101 and a feature output unit 102.
  • the feature prediction model includes a feature classification convolution layer and a feature output convolution layer.
  • the feature classification unit 101 is configured to use a feature classification convolution layer to perform feature extraction and operation on the a-level features to (m + n) -level features of the detection image to obtain the classification and recognition features of the detection image;
  • the feature output unit 102 is configured to:
  • the feature output convolution layer is used to perform a dimensionality reduction operation on the classification and recognition features of the detection image to obtain the target type and position of the detection image.
  • the image and frame acquisition module 71 acquires a detection image for which target detection is required, and an n-level depth feature map frame and m-level non-depth feature map for target detection on the detected image. frame.
  • n is an integer of 2 or more
  • m is an integer of 1 or more. That is, at least three feature extraction operations are performed on the detection image.
  • the depth feature map framework is a recognition parameter for performing feature recognition on a detection image or a lower-level feature corresponding to the detection image, and the depth feature map framework may include feature sizes and dimensions output by each depth feature level.
  • the non-depth feature frame is a recognition parameter for performing feature recognition on the lower-level features corresponding to the detection image, and the non-depth feature map frame may include feature sizes and dimensions output at each non-depth feature level.
  • the depth feature extraction module 72 uses the i-level depth feature map framework to perform depth feature extraction on the (i-1) -level features of the detection image based on a preset depth feature extraction model to obtain ear-level detection image i-level features, where i is a positive integer less than or equal to n.
  • the specific deep feature extraction process includes:
  • the depth upgrading operation unit 81 of the depth feature extraction module 72 uses the depth input convolution layer of the depth feature extraction model to perform a dimension upgrading operation on the (i-1) -level features of the detected image to obtain the i-level upgraded features of the detected image. .
  • the depth input convolution layer is a standard convolution layer with a 1 * 1 convolution kernel size and a non-linear activation function.
  • the depth input convolution layer can set a larger number of channels, such as 4-6. This can increase the dimension of the input features while ensuring the feature size of the detected image, thereby solving the problem of missing features of the detected image.
  • the number of channels of the depth input convolution layer is used to indicate the number of feature extraction modes for feature extraction from the low-level features of the detected image
  • the size of the convolution kernel of the depth input convolution layer is used to adjust the complexity of the deep neural network model.
  • the (i-1) level feature of the input detection image is a feature point matrix of 32 * 32 * 3, where 3 is the number of input channels of the detection image, such as the red pixel brightness value, the blue pixel brightness value, and the green Pixel brightness values, etc .;
  • Set the size of the convolution kernel of the depth input convolution layer to 1 * 1, then the output feature size of the depth input convolution layer is 32 * 32, that is, the 1 * 1 convolution kernel is used to traverse 32 in sequence.
  • the feature point matrix of * 32 can get the feature map of size 32 * 32. If the number of channels of the depth input convolution layer is 6, the output of the obtained depth input convolution layer is 32 * 32 * 18 i-level upgraded features. . In this way, without changing the size of the output features, a higher-dimensional feature of the detected image with a higher dimension is obtained.
  • the depth input convolutional layer uses a non-linear activation function, such as a linear rectification function (ReLU, Rectified Linear Unit), to perform nonlinear processing on the output i-level dimensionality characteristics to ensure that the output of the depth input convolutional layer is differentiable , Thereby improving the accuracy of subsequent output features.
  • a non-linear activation function such as a linear rectification function (ReLU, Rectified Linear Unit)
  • the first depth feature extraction unit 82 of the depth feature extraction module 72 uses the depth first nonlinear transformation convolution layer of the depth feature extraction model to perform a first feature extraction operation on the i-level raised features of the detection image to obtain the detection image's i-level first convolution feature.
  • the first non-linearly transformed convolutional layer is a deep separable convolutional layer with a 3 * 3 convolution kernel size and a non-linear activation function.
  • the setting of the deeply separable convolutional layer can make the first non-linearly transformed convolutional layer
  • the calculation amount of the layer is greatly reduced, which further reduces the size of the depth feature extraction model.
  • the depth-first non-linear transformation convolutional layer first performs a first feature extraction operation on the i-level raised feature of the detection image, and then the depth-first non-linear transformation convolutional layer uses a non-linear activation function, such as a linear rectification function (ReLU, Rectified Linear Unit, etc.) perform non-linear processing on the output i-level first convolution feature to ensure that the output of the depth-first non-linear conversion convolution layer is differentiable, thereby improving the accuracy of subsequent output features.
  • a non-linear activation function such as a linear rectification function (ReLU, Rectified Linear Unit, etc.) perform non-linear processing on the output i-level first convolution feature to ensure that the output of the depth-first non-linear conversion convolution layer is differentiable, thereby improving the accuracy of subsequent output features.
  • the second depth feature extraction unit 83 of the depth feature extraction module 72 uses the depth second nonlinear transformation convolution layer of the depth feature extraction model to perform a second feature extraction operation on the i-level first convolution feature of the detected image to obtain a detection The i-level second convolution feature of the image.
  • the depth second non-linear conversion convolutional layer is a deep separable hole convolutional layer (atrous convolutions) with a 3 * 3 convolution kernel size and a non-linear activation function.
  • the setting of the depth separable hole convolutional layer can make At the same time that the calculation amount of the second non-linear transformation convolution layer is greatly reduced, the receptive field of each feature basic unit of the detection image can be increased, thereby further improving the i-level second volume of the second non-linear transformation convolution layer output.
  • the receptive field of each feature basic unit of the detection image can be increased, thereby further improving the i-level second volume of the second non-linear transformation convolution layer output.
  • the hole convolution can set a "dilation rate" parameter in the convolution operation.
  • the expansion rate defines the interval between the data when the convolution layer processes the data. For example, a standard convolution layer of 5 * 5 convolution kernel size needs to set 25 parameters; but if a 3 * 3 convolution kernel size and an expansion convolution layer of 2 are used, only 9 parameters need to be set. That is, on the basis of the 5 * 5 size convolution kernel, one row of data is deleted every other row and one column of data is deleted every other column. Therefore, under the same calculation conditions, the hole convolutional layer can provide a larger receptive field without increasing the amount of computation.
  • the hole convolution layer is set at the second depth non-linear transformation convolution layer. Based on the initial depth feature extraction already performed by the first depth non-linear transformation convolution layer, the depth feature extraction can be performed again with fewer resources. The problem of smaller receptive field in the first feature extraction operation can be well compensated.
  • the deep second nonlinear transformation convolution layer first performs a second feature extraction operation on the i-level first convolution feature of the detection image, and then the deep second nonlinear transformation convolution layer uses a non-linear activation function.
  • a linear rectification function (ReLU, Rectified Linear Unit) performs non-linear processing on the output i-level second convolution feature to ensure that the output of the deep second non-linear conversion convolution layer is differentiable, thereby improving subsequent output characteristics.
  • ReLU Rectified Linear Unit
  • the depth dimensionality reduction operation unit 84 of the depth feature extraction module 72 uses the depth output convolution layer of the depth feature extraction model to perform a dimensionality reduction operation on the i-level second convolution feature of the detection image to obtain the i-level features of the detection image.
  • the depth output convolution layer is a standard convolution layer with a 1 * 1 convolution kernel size and no activation function.
  • the depth output convolution layer can restore the increased dimension to the dimension input to the depth input convolution layer; and no activation function is set in the depth output convolution layer to avoid the loss of output features caused by the activation function.
  • the i-level features of the detection image output by the depth output convolution layer should conform to the i-level depth feature map framework.
  • the non-depth feature extraction module 73 uses a j-level non-depth based on a preset non-depth feature extraction model.
  • the feature map framework performs non-depth feature extraction on the (j-1 + n) level features of the detection image to obtain the (j + n) level features of the detection image, where j is a positive integer less than or equal to m.
  • the specific non-deep feature extraction process includes:
  • the non-deep feature extraction operation unit 91 of the non-deep feature extraction module 73 uses the non-depth input convolution layer of the non-deep feature extraction model to perform a dimension up operation on the (j-1 + n) level features of the detection image to obtain a detection image. (J + n) level ascending dimension feature.
  • the non-depth input convolution layer is a standard convolution layer with a 1 * 1 convolution kernel size and a non-linear activation function.
  • the non-depth input convolution layer can set a larger number of channels, such as 4-6. This can increase the dimension of the input features while ensuring the feature size of the detected image, thereby solving the problem of missing features of the detected image.
  • the non-depth input convolutional layer will use a non-linear activation function, such as a linear rectification function (ReLU, Rectified Linear Unit), to perform nonlinear processing on the output (j + n) -level raised dimension features to ensure non-depth input convolution.
  • a non-linear activation function such as a linear rectification function (ReLU, Rectified Linear Unit)
  • ReLU Rectified Linear Unit
  • the non-depth feature extraction unit 92 of the non-depth feature extraction module 73 uses a non-depth non-linear conversion convolution layer of the non-depth feature extraction model to perform a feature extraction operation on the (j + n) level-up feature of the detection image to obtain detection (J + n) level convolutional features of the image.
  • the non-depth non-linearly transformed convolutional layer is a deep separable convolutional layer with a 3 * 3 convolution kernel size and a non-linear activation function.
  • the setting of the deep separable convolutional layer can make non-depth non-linearly transformed convolutional
  • the calculation amount of the layer is greatly reduced, which further reduces the size of the depth feature extraction model.
  • the non-depth non-linear conversion convolution layer here may also be a deep separable hole convolution layer.
  • the non-depth feature extraction model here only needs to use a non-depth non-linear transformation convolution layer for feature extraction. Need to set multiple non-linear transformation convolution layers for feature extraction.
  • the non-depth non-linear transformation convolutional layer first performs feature extraction operations on the (j + n) -level raised-dimensional features of the detection image, and then the non-depth non-linear transformation convolutional layer uses a non-linear activation function, such as linear Rectified function (ReLU, Rectified Linear Unit) and other non-linear processing on the output (j + n) level convolution features to ensure that the output of non-deep non-linear transformation convolution layers is differentiable, thereby improving the subsequent output features. accuracy.
  • a non-linear activation function such as linear Rectified function (ReLU, Rectified Linear Unit)
  • the non-depth feature reduction operation unit 93 of the non-depth feature extraction module 71 uses the non-depth output convolution layer of the non-depth feature extraction model to perform a dimensionality reduction operation on the (j + n) level convolution features of the detection image to obtain a detection map. (J + n) level features.
  • the non-deep output convolution layer is a standard convolution layer with a 1 * 1 convolution kernel size and no activation function.
  • the non-depth output convolution layer can restore the previously added dimensions to the dimensions input to the non-depth input convolution layer; and no activation function is set in the non-depth output convolution layer to avoid the loss of output features caused by the activation function.
  • the (j + n) level features of the detection image output by the non-depth output convolution layer should be consistent with the j-level non-depth feature map framework.
  • the depth of feature extraction based on the detected image uses a depth feature extraction model and a non-depth feature extraction model, respectively, which can greatly reduce the calculation amount of the feature extraction operation.
  • the value of n can be set according to user requirements. If the amount of calculation for feature extraction is large, the size of n can be appropriately reduced. If the accuracy of feature extraction needs to be improved, the size of n can be appropriately increased.
  • the target detection module 74 performs information regression operations on the a-level features to (m + n) -level features of the detected image based on a preset feature prediction model, thereby obtaining the target type and target position of the detected image, where a is less than n and An integer greater than or equal to 2.
  • the function of the feature prediction model is equivalent to a regressor, which is used to obtain the target type and target position of the target in the detection image.
  • the target type here is identified by the classification probability. For example, a target has a 80% probability for a cat and 20% The probability of waiting for a dog.
  • the specific target detection process includes:
  • the feature classification unit 101 of the target detection module 74 uses the feature classification convolution layer of the feature prediction model to perform feature extraction and operation on the a-level features to (m + n) -level features of the detection image to obtain the classification and recognition features of the detection image. .
  • the feature classification convolution layer is a deep separable convolution layer with a 3 * 3 convolution kernel size and no activation function. Because the feature sizes of the first-level features to (a-1) -level features of the detection image are large, they will not generally be the target of the detected image. Therefore, all the pre-level features of the a-level features of the detection image are discarded here.
  • the feature classification unit 101 uses the a-level features to (m + n) -level features of the detection image to perform feature lifting and operation, so as to obtain the classification and recognition features of the ear-detection image, so as to perform subsequent detection of the target type of the image and the target position. Forecast operation.
  • the feature classification unit may select part of the features from the level a feature to the (m + n) level feature to perform the feature extraction operation according to the user's needs, thereby further reducing the calculation amount of the feature extraction operation.
  • the feature output unit 102 of the target detection module 74 uses the feature output convolution layer of the feature prediction model to perform a dimension reduction operation on the classification and recognition features of the detection image to obtain the target type and target position of the detection image.
  • the feature output convolution layer is a standard convolution layer with a 1 * 1 convolution kernel size and no activation function. There is no activation function set in the feature output convolution layer to avoid the loss of output features caused by the activation function.
  • the target type output here can be people, cars, houses, etc.
  • the output target position can be parameters such as the center coordinates of the target and the length and width of the target box.
  • the depth feature extraction model and the non-depth feature extraction model in the image target detection device of this embodiment adopt different structures, and the first non-linear transformation in the depth feature extraction model
  • the convolution layer and the second non-linear conversion convolution layer also use different results. This can maximize the extraction speed of the target features of the detection image, further reduce the need for configuration resources, and thus achieve the landing of the target detection function on the mobile terminal.
  • FIG. 11 is a specific embodiment of an image target detection method and an image target detection device according to an embodiment of the application. Schematic of the use.
  • the image target detection device of this specific embodiment may be provided in an electronic device, such as a mobile terminal installed with an image target recognition application.
  • the mobile terminal can quickly extract target features in the image, and configure the mobile terminal's own configuration resources. Demand is lower.
  • the step of performing image target detection by the image target detection apparatus of this embodiment includes:
  • Step S1101 Obtain a detection image, an n-level depth feature map frame, and an m-level non-depth feature map frame.
  • the depth feature map framework includes 128 * 128 * 12, 64 * 64 * 24, 32 * 32 * 48, 16 * 16 * 64 and other feature maps with different levels of dimensions and dimensions.
  • 128 * 128 refers to the feature size of the feature map frame
  • 12 refers to the dimension of the feature map frame.
  • more depth feature map frames can be included here.
  • 64 * 64 * 32 and 64 * 64 * 40 depth feature map frames can be added between 64 * 64 * 24 and 32 * 32 * 48.
  • a 4-level deep feature map framework and a 4-level non-depth feature map framework are obtained.
  • Step S1102 Based on the depth feature extraction model, use the i-level depth feature map framework to perform depth feature extraction on the (i-1) -level features of the detection image to obtain the i-level features of the detection image.
  • the detection image is set with 4 levels of depth features. In actual use, the depth feature level of the detection image should be greater than 4.
  • the image target detection device is based on a depth feature extraction model, and And the dimension is 256 * 256 * 3) to perform depth feature extraction to obtain the level 1 feature of the detection image corresponding to the level 1 depth feature map frame (the feature size and dimension are 128 * 128 * 12); then the image target detection device pairs Detect the level 1 feature of the detection image to extract the depth feature to obtain the level 2 feature of the detection image corresponding to the level 2 depth feature map frame (the feature size and dimensions are 64 * 64 * 24); then the image target detection device Level 2 features are subjected to depth feature extraction to obtain level 3 features of the detection image corresponding to the level 3 depth feature map frame (the feature size and dimensions are 32 * 32 * 48). Finally, the image target detection device performs the level 3 features of the detected image. Depth feature extraction is performed to obtain the 4-level features of the detection image corresponding to the 4-level depth feature map frame (the feature size and dimensions are 16 * 16 * 64).
  • the process of deep feature extraction includes:
  • the image target detection device uses a depth input convolution layer to perform a dimensionality-up operation on a 2-level feature of the detection image (the feature size and dimensions are 64 * 64 * 24) to obtain a 3-level dimensionality upgrade feature of the detection image (the feature size And the dimensions are 64 * 64 * 144).
  • the depth input convolution layer is a standard convolution layer Convl with a 1 * 1 convolution kernel size and a non-linear activation function Relu
  • the image target detection device uses the first non-linear conversion convolution layer to perform a first feature extraction operation on the 3-level raised feature of the detection image to obtain the 3-level first convolution feature of the detected image (the feature size and dimensions are 32 * 32 * 144).
  • the first non-linear conversion convolution layer here is a depth-separate standard convolution layer Dwise2 with a 3 * 3 convolution kernel size and a non-linear activation function Relu. Since the size of the first convolution feature of level 3 is reduced, therefore The convolution step stride of the depth-separable standard convolution layer Dwise2 is 2.
  • the image target detection device uses a second non-linear conversion convolution layer to perform a second feature extraction operation on the 3rd level first convolution feature of the detection image to obtain a 3rd level second convolution feature of the detection image (the feature size and dimension Is 32 * 32 * 144).
  • the second non-linearly transformed convolutional layer here is a depth-separate hole convolutional layer Dwise3 with a 3 * 3 convolution kernel size and a non-linear activation function Relu, where the expansion rate of the depth-separate hole convolutional layer Dwise3 is 2 .
  • the image target detection device uses a depth output convolution layer to perform a dimensionality reduction operation on the third-level second convolution feature of the detection image to obtain the third-level feature of the detection image (the feature size and dimensions are 32 * 32 * 48).
  • the depth output convolution layer here is a standard convolution layer Conv4 with 1 * 1 convolution kernel size and no activation function.
  • Step S1103, based on the non-depth feature extraction model uses a j-level non-depth feature map framework to detect the image.
  • (J-1 + n) -level features are extracted from non-depth features to obtain (j + n) -level features of the detected image.
  • a 4-level non-depth feature is set in the detection image, that is, a total of 8-level features are set in the detection image. In international use, the number of non-depth feature levels of the detected image should be greater than 4.
  • the image target detection device performs non-depth feature extraction on the 4-level features of the detection image based on the non-depth feature extraction model to obtain the 5-level features of the detection image corresponding to the 1-level non-depth feature map frame (the feature size and dimensions are 8 * 8 * 144); Then the image target detection device performs non-depth feature extraction on the 5-level features of the detection image to obtain the 6-level features of the detection image corresponding to the 2-level non-depth feature map frame (the feature size and dimensions are 4 * 4 * 256); Then the image target detection device performs non-depth feature extraction on the 6-level features of the detection image to obtain the 7-level features of the detection image corresponding to the 3-level non-depth feature map frame (the feature size and dimensions are 2 * 2 * 256); Finally, the image target detection device performs non-depth feature extraction on the 7-level features of the detection image to obtain the 8-level features of the detection image corresponding to the 4-level non-depth feature map frame (the feature size and dimensions are 1 * 1 * 256 ).
  • the non-depth feature extraction process includes:
  • the image target detection device uses a non-depth input convolution layer to perform a dimensionality-up operation on a 6-level feature of the detection image (the feature size and dimensions are 4 * 4 * 256) to obtain a 7-level dimensionality upgrade feature of the detection image (the feature The size and dimensions are 4 * 4 * 1536).
  • the non-depth input convolutional layer here is a standard convolutional layer Conv5 with a 1 * 1 convolution kernel size and a non-linear activation function Relu
  • the image target detection device uses a non-depth non-linear conversion convolution layer to perform a feature extraction operation on the 7-level raised feature of the detection image to obtain the 7-level convolution feature of the detection image (the feature size and dimensions are 2 * 2 * 1536 ).
  • the non-depth non-linear conversion convolution layer here is a depth-separate hole convolution layer Dwise6 with a 3 * 3 convolution kernel size and a non-linear activation function Relu. Since the size of the 7-level convolution feature is reduced, here
  • the convolution step stride of the depth-separable convolutional layer Dwise6 is 2, and the expansion ratio of the depth-separable convolutional layer Dwise6 is 2.
  • the image target detection device uses a non-depth output convolution layer to perform a dimensionality reduction operation on the 7-level convolution features of the detection image to obtain the 7-level features of the detection image (the feature size and dimensions are 2 * 2 * 256).
  • the non-deep output convolution layer here is a standard convolution layer Conv7 with a 1 * 1 convolution kernel size and no activation function.
  • Step S1104 based on the feature prediction model, perform information regression on the 3rd to 8th level features of the detected image. Operation to obtain the target type and target position of the image.
  • the process of getting the ear and detecting the target type and target position of the image includes:
  • the image target detection device uses a feature classification convolution layer to perform information regression operations on the 3-level to 8-level features of the detected image to obtain the classification and recognition features of the detected image.
  • the feature classification convolutional layer has 3 * 3 Convolution kernel size and deep separable convolution layers without activation functions.
  • the image target detection device uses a feature output convolution layer to perform a dimensionality reduction operation on the classification and recognition features of the detected image to obtain the target type and target position of the detected image.
  • the feature output convolution layer is a standard convolution layer with a 1 * 1 convolution kernel size and no activation function.
  • the image target detection method and the target detection process of the image target detection device of this specific embodiment simultaneously optimize three parts of a depth feature extraction model, a non-depth feature extraction model, and a feature prediction model, so that the original feature extraction model is reduced from 100Mb to Less than 1Mb, and the running speed has been improved by more than 10 times.
  • the image target detection method, device, and storage medium of the embodiments of the present application extract multiple features of different sizes and identify features based on the depth feature extraction model and non-depth feature extraction model. Due to the small size of the detection image Features can be extracted directly based on the large-scale features of the detected image, so the overall feature extraction speed is faster, and the demand for configuration resources is lower; the existing image target detection methods and device operations are effectively solved Technical issues that are slow and cannot be implemented on mobile terminals with smaller resource allocation.
  • a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable application, a thread of execution, a program, and / or a computer.
  • a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable application, a thread of execution, a program, and / or a computer.
  • the application running on the controller and the controller can be a component.
  • One or more components can reside within a process and / or thread of execution and a component may be localized on one computer and / or distributed between two or more computers.
  • An embodiment of the present application further provides an electronic device, including: one or more processors and a storage device; the storage device is configured to store one or more executable program instructions; the one or more processors, And configured to execute one or more executable program instructions in the storage device to implement the image target detection method described in the foregoing embodiment.
  • FIG. 12 and the following discussion provide a short, general description of the working environment of the electronic device in which the image target detection device described in the embodiment of the present application is implemented.
  • the working environment of FIG. 12 is only one example of a suitable working environment and is not intended to suggest any limitation as to the scope of the use or function of the working environment.
  • Example electronic devices 1212 include, but are not limited to, wearable devices, head-mounted devices, medical health platforms, personal computers, Server computers, handheld or laptop devices, mobile devices (such as mobile phones, personal digital assistants (PDAs), media players, etc.), multiprocessor systems, consumer electronic devices, small computers, mainframe computers, including the above Distributed computing environment for any system or device, etc.
  • Computer-readable instructions may be distributed via computer-readable media (discussed below).
  • Computer-readable instructions can be implemented as program modules, such as functions, objects, application programming interfaces (APIs), data structures, etc. that perform particular tasks or implement particular abstract data types.
  • program modules such as functions, objects, application programming interfaces (APIs), data structures, etc. that perform particular tasks or implement particular abstract data types.
  • APIs application programming interfaces
  • data structures such as lists, etc. that perform particular tasks or implement particular abstract data types.
  • the functions of the computer-readable instructions can be freely combined or distributed in various environments.
  • FIG. 12 illustrates an example of an electronic device 1212 including one or more embodiments in an image target detection device of the present application.
  • the electronic device 1212 includes at least one processing unit 1216 and a memory 1218.
  • the memory 1218 may be volatile (such as RAM), non-volatile (such as ROM, flash memory, etc.) or some combination of the two. This configuration is illustrated in FIG. 12 by a dotted line 1214.
  • the processing unit 1216 may be a processor, such as a CPU.
  • the electronic device 1212 may include additional features and / or functions.
  • the device 1212 may also include additional storage devices (such as removable and / or non-removable), including but not limited to magnetic storage devices, optical storage devices, and the like.
  • additional storage devices such as removable and / or non-removable
  • FIG. 12 Such an additional storage device is illustrated in FIG. 12 by a storage device 1220.
  • computer-readable instructions for implementing one or more embodiments provided herein may be in the storage device 1220.
  • the storage device 1220 may also store other computer-readable instructions for implementing an operating system, application programs, and the like. Computer readable instructions may be loaded into memory 1218 and executed by, for example, processing unit 1216.
  • Computer-readable medium includes computer storage media.
  • the computer-readable medium may be included in the electronic device described in the foregoing embodiments; or it may exist alone without being assembled into the electronic device.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable implementations in any method or technology for storage of information such as computer-readable instructions, or processor-executable instructions, or other data Remove the media.
  • the electronic device is caused to implement the image target detection method as described in the above embodiments.
  • the memory 1218 and the storage device 1220 are examples of a computer storage medium.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disk (DVD) or other optical storage devices, cassette tapes, magnetic tapes, disk storage devices or other magnetic storage devices, Or any other medium that can be used to store desired information and can be accessed by the electronic device 1212. Any of this Such a computer storage medium may be part of the electronic device 1212.
  • the electronic device 1212 may also include a communication connection 1226 that allows the electronic device 1212 to communicate with other devices.
  • the communication connection 1226 may include, but is not limited to, a modem, a network interface card (NIC), an integrated network interface, a radio frequency transmitter / receiver, an infrared port, a USB connection, or other interface for connecting the electronic device 1212 to other electronic devices.
  • the communication connection 1226 may include a wired connection or a wireless connection.
  • the communication connection 1226 may transmit and / or receive a communication medium.
  • Computer-readable medium may include communication media.
  • Communication media typically contains computer readable instructions or other data in a “modulated data signal” such as a carrier wave or other transmission mechanism, and includes any information delivery media.
  • modulated data signal may include signals in which one or more of the signal characteristics are set or changed in such a manner as to encode information into the signal.
  • the electronic device 1212 may include an input device 1224, such as a keyboard, a mouse, a pen, a voice input device, a touch input device, an infrared camera, a video input device, and / or any other input device.
  • the device 1212 may also include an output device 1222, such as one or more displays, speakers, printers, and / or any other output device.
  • the input device 1224 and the output device 1222 may be connected to the electronic device 1212 via a wired connection, a wireless connection, or any combination thereof. In one embodiment, an input device or output device from another electronic device may be used as the input device 1224 or output device 1222 of the electronic device 1212.
  • the components of the electronic device 1212 may be connected through various interconnections, such as a bus.
  • Such interconnects may include Peripheral Component Interconnect (PCI) (such as Express PCI), Universal Serial Bus (USB), FireWire (IEEE 1394), optical bus structures, and so on.
  • PCI Peripheral Component Interconnect
  • USB Universal Serial Bus
  • FireWire IEEE 1394
  • optical bus structures and so on.
  • the components of the electronic device 1212 may be interconnected through a network.
  • the memory 1218 may be composed of multiple physical memory units located in different physical locations and interconnected through a network.
  • storage devices used to store computer readable instructions may be distributed across a network.
  • the electronic device 1230 accessible via the network 1228 may store computer-readable instructions for implementing one or more embodiments provided by the present application.
  • the electronic device 1212 may access the electronic device 1230 and download some or all of the computer-readable instructions for execution.
  • the electronic device 1212 may download multiple computer-readable instructions as needed, or some instructions may be executed at the electronic device 1212 and some instructions may be executed at the electronic device 1230.
  • the one or more operations may constitute computer-readable instructions stored on one or more computer-readable media, which when executed by an electronic device will cause a computing device to perform the operations.
  • the order in which some or all operations are described should not be interpreted as implying that These operations must be sequence related. Those skilled in the art will understand alternative rankings having the benefits of this specification. Moreover, it should be understood that not all operations need to be present in every embodiment provided herein.
  • Each functional unit in the embodiment of the present application may be integrated into one processing module, or each unit may exist separately physically, or two or more units may be integrated into one module.
  • the above integrated modules may be implemented in the form of hardware or software functional modules. If the integrated module is implemented in the form of a software functional module and sold or used as an independent product, it may also be stored in a computer-readable storage medium.
  • the aforementioned storage medium may be a read-only memory, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例提供一种图像目标检测方法,其包括:获耳又检测图像、n级深度特征图框架以及m级非深度特征图框架,其中特征图框架包括输出的特征尺寸以及维度;基于深度特征提取模型,使用i级深度特征图框架对检测图像的(i-1)级特征进行深度特征提取,以获取检测图像的i级特征;基于非深度特征提取模型,使用j级非深度特征图框架对检测图像的(j-1+n)级特征进非深度特征提取,以获取检测图像的(j+n)级特征;以及基于特征预测模型,对检测图像的a级特征至(m+n)级特征进行信息回归操作,从而获取检测图像的目标类型以及目标位置。本申请实施例还提供一种图像目标检测装置、存储介质及电子设备,本申请实施例整体特征提取速度较快,且对配置资源的需求较低。

Description

图像目标检测方法、 装置、 存储介质及电子设备 本申请要求于 2018年 7月 11 日提交中国专利局、 申请号为 201810754633.X、 发明名称为“图像目标检测方法、 装置及存储介质”的中国专利申请的优先权, 其全 部内容通过引用结合在本申请中。 技术领域 本申请涉及图像处理领域, 特别涉及一种图像目标检测方法、 装置、 存储介质 及电子设备。 背景技术 随着科技的发展,图像中的目标识别技术已越来越成为计算机视觉的重要问题。 即需要在给定的图片中, 标注出要识别的物体, 如在图片上标识出人、 车、 房子等 物体。
随着最近深度学习的兴起, 深度学习在图像目标检测领域中的应用得到巨大的 突破, 一系列基于深度学习算法的图像目标学习方法被提出来。如 Faster-RCNN (快 速区域卷积神经网络, Faster- Regions with Convolutional Neural Networks features )和 YOLO ( You Only Look Once ) 等深度学习算法。
但是现有的图像目标检测算法的技术重心是放在检测结果的准确率上, 因此现 有的图像目标检测算法的运行速度并不能满足实际场景的需要, 一些相对比较高效 的图像目标检测系统的模型大小均超过了 100Mb, 使得现有的图像目标检测系统运 行速度较慢且无法在资源配置较小的移动终端上实施。 技术内容 本申请实施例提供一种运行速度较快且对配置资源需求较低的图像目标检测方 法及装置、 存储介质及电子设备; 以解决现有的图像目标检测方法及装置的运行速 度较慢且无法在资源配置较小的移动终端上实施的技术问题。
本申请实施例提供一种图像目标检测方法, 其包括:
获取检测图像、 n级深度特征图框架以及 m级非深度特征图框架, n为大于等 于 2的整数, m为大于等于 1的整数; 其中特征图框架包括输出的特征尺寸以及维 度;
基于深度特征提取模型, 使用 i级深度特征图框架对所述检测图像的 ( i-1 ) 级 特征进行深度特征提取^ 以获取所述检测图像的 i级特征, 其中 i为小于等于 n的正 整数;
基于非深度特征提取模型, 使用 j 级非深度特征图框架对所述检测图像的
( j-1+n ) 级特征进非深度特征提取, 以获取所述检测图像的 ( j+n ) 级特征, 其中 j 为小于等于 m的正整数; 以及
基于特征预测模型, 对所述检测图像的 a级特征至 ( m+n ) 级特征进行信息回 归操作, 从而获取所述检测图像的目标类型以及目标位置, 其中 a为小于 n且大于 等于 2的整数。
本申请实施例还提供一种图像目标检测装置, 其包括:
图像以及框架获取模块, 用于获取检测图像、 n级深度特征图框架以及 m级非 深度特征图框架, n为大于等于 2的整数, m为大于等于 1的整数; 其中特征图框 架包括输出的特征尺寸以及维度;
深度特征提取模块, 用于基于深度特征提取模型, 使用 i级深度特征图框架对 所述检测图像的 ( i-1 )级特征进行深度特征提取, 以获耳又所述检测图像的 i级特征, 其中 i为小于等于 n的正整数;
非深度特征提取模块, 用于基于非深度特征提取模型, 使用 j 级非深度特征图 框架对所述检测图像的 ( j-1+n )级特征进非深度特征提取, 以获取所述检测图像的 ( j+n ) 级特征, 其中 j为小于等于 m的正整数;
目标检测模块, 用于基于特征预测模型, 对所述检测图像的 a级特征至 ( m+n ) 级特征进行信息回归操作, 从而获取所述检测图像的目标类型以及目标位置, 其中 a为小于 n且大于等于 2的整数。
本申请实施例还提供一种存储介质, 其内存储有处理器可执行指令, 所述指令 由一个或一个以上处理器执行时, 实现上述的图像目标检测方法。
本申请实施例还提供一种电子设备, 包括一个或多个处理器和存储装置; 所述存储装置, 用于存储一个或多个可执行程序指令;
所述一个或多个处理器, 用于执行所述存储装置中的一个或多个可执行程序指 令, 以实现上述的图像目标检测方法。 附图简要说明 图 i为本申请一实施例的图像目标检测方法的流程图;
图 2为本申请另一实施例的图像目标检测方法的流程图;
图 3a为本申请一实施例的图 2所示的图像目标检测方法的步骤 S202的流程图; 图 3b为本申请一实施例的图 2所示的图像目标检测方法的步骤 S202的特征提 取示意图;
图 4a为本申请一实施例的图 2所示的图像目标检测方法的步骤 S203的流程图; 图 4b为本申请一实施例的图 2所示的图像目标检测方法的步骤 S203的特征提 取示意图;
图 5a为本申请一实施例的图 2所示的图像目标检测方法的步骤 S204的流程图; 图 5b为本申请一实施例的图 2所示的图像目标检测方法的步骤 S204的特征提 取示意图;
图 6为本申请一实施例的图像目标检测装置的结构示意图;
图 7为本申请另一实施例的图像目标检测装置的结构示意图;
图 8为本申请一实施例的图 7所示的图像目标检测装置的深度特征提耳又模块的 结构示意图;
图 9为本申请一实施例的图 7所示的图像目标检测装置的非深度特征提取模块 的结构示意图;
图 10为本申请一实施例的图 7所示的图像目标检测装置的目标检测模块的结构 示意图;
图 11 为本申请一实施例的图像目标检测方法及图像目标检测装置的具体实施 的使用示意图;
图 12 为本申请一实施例的图像目标检测装置所在的电子设备的工作环境结构 示意图。 具体实施方式 请参照图式, 其中相同的组件符号代表相同的组件, 本申请的原理是以实施在 一适当的运算环境中来举例说明。 以下的说明是基于所例示的本申请具体实施例, 其不应被视为限制本申请未在此详述的其它具体实施例。
在以下的说明中, 本申请的具体实施例将参考由一部或多部计算机所执行之作 业的步骤及符号来说明, 除非另有述明。 因此, 其将可了解到这些步骤及操作, 其 中有数次提到为由计算机执行, 包括了由代表了以一结构化型式中的数据之电子信 号的计算机处理单元所操纵。 此操纵转换该数据或将其维持在该计算机之内存系统 中的位置处, 其可重新配置或另外以本领域技术人员所熟知的方式来改变该计算机 之运作。 该数据所维持的数据结构为该内存之实体位置, 其具有由该数据格式所定 义的特定特性。 但是, 本申请实施例的原理以上述文字来说明, 其并不代表为一种 限制,本领域技术人员将可了解到以下所述的多种步骤及操作亦可实施在硬件当中。
本申请实施例的图像目标检测方法以及图像目标检测装置可设置在任何的电子 设备中, 用于对图片或照片中的人、 车、 房子等目标进行检测识别操作。 该电子设 备包括但不限于可穿戴设备、 头戴设备、 医疗健康平台、 个人计算机、 服务器计算 机、 手持式或膝上型设备、 移动设备 (比如移动电话、 个人数字助理 (PDA, Personal Digital Assistant)、 媒体播放器等等) 、 多处理器系统、 消费型电子设备、 小型计算 机、 大型计算机、 包括上述任意系统或设备的分布式计算环境, 等等。 例如, 该电 子设备可以为安装有图像目标识别应用的移动终端, 该移动终端可对图像中的目标 特征进行快速提取, 且对移动终端自身的配置资源的需求较低。
请参照图 1, 图 1 为本申请一实施例的图像目标检测方法的流程图; 本实施例 的图像目标检测方法可使用上述的电子设备进行实施, 本实施例的图像目标检测方 法包括:
步骤 S 101, 获取检测图像、 n级深度特征图框架以及 m级非深度特征图框架, n为大于等于 2的整数, m为大于等于 1的整数; 其中特征图框架包括输出的特征 尺寸以及维度;
步骤 S102,基于深度特征提取模型,使用 i级深度特征图框架对检测图像的 ( i-1 ) 级特征进行深度特征提取, 以获取检测图像的 i级特征, 其中 i为小于等于 n的正整 数;
步骤 S103, 基于非深度特征提取模型, 使用 j级非深度特征图框架对检测图像 的 ( j-1+n ) 级特征进行非深度特征提取, 以获取检测图像的 ( j+n ) 级特征, 其中 j 为小于等于 m的正整数;
步骤 S 104, 基于特征预测模型, 对检测图像的 a级特征至 ( m+n ) 级特征进行 信息回归操作, 从而获取检测图像的目标类型以及目标位置, 其中 a为小于 n且大 于等于 2的整数。 下面详细说明本实施例的图像目标检测方法的图像目标检测过程。 下述实施例 中的图像目标检测装置为可以执行图像目标检测方法的电子设备。
在步骤 S101中, 图像目标检测装置获取需要进行目标检测的检测图像, 以及对 该检测图像进行目标检测的 n级深度特征图框架以及 m级非深度特征图框架。 在本 实施例中 n为大于等于 2的整数, m为大于等于 1的整数。 即检测图像至少要进行 3次特征提取操作。
这里为了对检测图像进行准确全面的目标检测, 需要对检测图像进行多级特征 提耳又操作, 如 m+n级等。 由于下一级的特征尺寸一定小于上一级的特征尺寸, 因此 下级的特征提取操作可在上级特征提取操作输出的特征上进行。 大尺寸的特征提耳又 操作由于之前进行的上级特征提取操作较少, 因此需要使用深度特征提取模型以及 深度特征图框架进行特征提取。 小尺寸的特征提取操作由于之前已经进行了多次上 级特征提取操作, 因此仅需要使用非深度特征提取模型以及非深度特征图框架进行 特征提取即可。
这里深度特征图框架是对检测图像或检测图像对应的下级特征进行特征识别的 识别参数, 该深度特征图框架可包括每个深度特征级别输出的特征尺寸以及维度。 非深度特征框架是对检测图像对应的下级特征进行特征识别的识别参数, 该非深度 特征图框架可包括每个非深度特征级别输出的特征尺寸以及维度。
在步骤 S102 中, 图像目标检测装置基于预设的深度特征提取模型, 使用步骤 S101获取的 i级深度特征图框架对检测图像的 ( i-1)级特征进行深度特征提取, 以 获耳又检测图像的 i级特征, 其中 i为小于等于 n的正整数。
即图像目标检测装置基于预设的深度特征提取模型, 对检测图像的像素进行深 度特征提取, 以获取 1级深度特征图框架对应的检测图像的 1级特征; 随后图像目 标检测装置对检测图像的 1级特征进行深度特征提取, 以获取 2级深度特征图框架 对应的检测图像的 2级特征 ; 最后图像目标检测装置对检测图像的 ( n-1)级特 征进行深度特征提取, 以获取 n级深度特征图框架对应的检测图像的 n级特征。 这 样获取了检测图像的 1级特征至 n级特征。
在步骤 S103中, 图像目标检测装置基于预设的非深度特征提取模型,使用步骤 S101获取的 j级非深度特征图框架对检测图像的(j-1+n)级特征进行非深度特征提 取, 以获耳又检测图像的 ( j+n) 级特征, 其中 j为小于等于 m的正整数。
即图像目标检测装置基于预设的非深度特征提取模型, 对检测图像的 n级特征 进行非深度特征提取, 以获取 1级非深度特征图框架对应的检测图像的( n+1)级特 征; 随后图像目标检测装置对检测图像的( n+1)级特征进行非深度特征提取, 以获 取 2级非深度特征图框架对应的检测图像的( n+2)级特征 ; 最后图像目标检测 装置对检测图像的 (n+m-1)级特征进行深度特征提取, 以获取(m+n)级深度特征 图框架对应的检测图像的 ( m+n)级特征。 这样获取了检测图像的 ( n+1)级特征至 ( m+n) 级特征。
在步骤 S104中, 图像目标检测装置基于预设的特征预测模型, 对步骤 S102和 步骤 S103获取的检测图像的 a级特征至( m+n)级特征进行信息回归操作, 从而获 取检测图像的目标类型以及目标位置, 其中 a小于 n且大于等于 2的整数。
具体的, 由于检测图像的 1级特征至( a-1)级特征的特征尺寸较大, 因此不具 有进行特征分类识别的意义,因此图像目标检测装置将检测图像的 1级特征至(a-1) 级特征直接丟弃。 图像目标检测装置对检测图像的 a级特征至 (m+n) 级特征进行 特征分类识别, 从而获取该特征对应的检测图像的目标类型 (如人、 车、 房子等) 以及目标位置 (如目标的中心坐标以及目标方框的长宽等) 。
这样即完成了本实施例的图像目标检测方法的图像目标检测过程。
本实施例的图像目标检测方法基于深度特征提取模型以及非深度特征提取模 型, 对同一检测图像的多个不同尺寸的特征进行提耳又以及特征识别, 由于检测图像 的小尺寸特征可直接在检测图像的大尺寸特征的基础上进行提取, 因此整体特征提 耳又速度较快, 且对配置资源的需求较低。
请参照图 2, 图 2为本申请另一实施例的图像目标检测方法的流程图; 本实施 例的图像目标检测方法可使用上述的电子设备进行实施, 本实施例的图像目标检测 方法包括:
步骤 S201 , 获取检测图像、 n级深度特征图框架以及 m级非深度特征图框架, n为大于等于 2的整数, m为大于等于 1的整数; 其中特征图框架包括输出的特征 尺寸以及维度;
步骤 S202,基于深度特征提取模型,使用 i级深度特征图框架对检测图像的(i-1) 级特征进行深度特征提取, 以获取检测图像的 i级特征, 其中 i为小于等于 n的正整 数;
步骤 S203 , 基于非深度特征提取模型, 使用 j级非深度特征图框架对检测图像 的 (j-1+n) 级特征进行非深度特征提取, 以获取检测图像的 (j+n) 级特征, 其中 j 为小于等于 m的正整数;
步骤 S204, 基于特征预测模型, 对检测图像的 a级特征至 (m+n) 级特征进行 信息回归操作, 从而获取检测图像的目标类型以及目标位置, 其中 a为小于 n且大 于等于 2的整数。
下面详细说明本实施例的图像目标检测方法的图像目标检测过程。
在步骤 S201中, 图像目标检测装置获取需要进行目标检测的检测图像, 以及对 该检测图像进行目标检测的 n级深度特征图框架以及 m级非深度特征图框架。 在本 实施例中 n为大于等于 2的整数, m为大于等于 1的整数。 即检测图像至少要进行 3次特征提取操作。
这里深度特征图框架是对检测图像或检测图像对应的下级特征进行特征识别的 识别参数, 该深度特征图框架可包括每个深度特征级别输出的特征尺寸以及维度。 非深度特征框架是对检测图像对应的下级特征进行特征识别的识别参数, 该非深度 特征图框架可包括每个非深度特征级别输出的特征尺寸以及维度。
在步骤 S202 中, 图像目标检测装置基于预设的深度特征提取模型, 使用步骤 S101获取的 i级深度特征图框架对检测图像的 ( i-1)级特征进行深度特征提取, 以 获耳又检测图像的 i级特征, 其中 i为小于等于 n的正整数。
在本实施例中深度特征提取模型包括深度输入卷积层、 深度第一非线性转换卷 积层、 深度第二非线性转换卷积层以及深度输出卷积层。
具体请参照图 3a和图 3b, 图 3a为本申请一实施例的图 2所示的图像目标检测 方法的步骤 S202的流程图, 图 3b为本申请一实施例的图 2所示的图像目标检测方 法的步骤 S202的特征提取示意图。 该步骤 S202包括:
步骤 S301, 图像目标检测装置使用深度特征提取模型的深度输入卷积层, 对检 测图像的 (i-1) 级特征进行升维操作, 以得到检测图像的 i级升维特征。
其中深度输入卷积层为具有 1*1卷积核尺寸以及具有非线性激活函数的标准卷 积层, 其中深度输入卷积层可设置较大的通道数, 如 4-6等。 这样可在保证检测图 像的特征尺寸的情况下, 增加输入特征的维度, 从而解决检测图像的特征丟失的问 题。
深度输入卷积层的通道数用于表示从检测图像的低级特征上进行特征提取的特 征提取模式的数量, 深度输入卷积层的卷积核尺寸用于调整深度神经网络模型的复 杂度。 如输入的检测图像的 ( i-1 ) 级特征为 32*32*3的特征点矩阵, 其中 3为检测图 像的输入通道数, 如红色的像素亮度值、 蓝色的像素亮度值以及绿色的像素亮度值 等; 设定深度输入卷积层的卷积核尺寸为 1*1, 则该深度输入卷积层的输出特征尺 寸为 32*32, 即使用 1*1的卷积核依次遍历 32*32的特征点矩阵可得到 32*32尺寸 的特征图, 如深度输入卷积层的通道数为 6, 则得到的深度输入卷积层的输出为 32*32*18的 i级升维特征。 这样在不改变输出特征尺寸的情况下得到了维度更高的 检测图像的升维特征。
随后深度输入卷积层会使用非线性激活函数,如线性整流函数( ReLU, Rectified Linear Unit )等对输出的 i级升维特征进行非线性处理, 以保证深度输入卷积层的输 出是可微的, 从而提高后续输出特征的准确性。
步骤 S302, 图像目标检测装置使用深度特征提取模型的深度第一非线性转换卷 积层, 对步骤 S301获取的检测图像的 i级升维特征进行第一特征提取操作, 以得到 检测图像的 i级第一卷积特征。
其中第一非线性转换卷积层为具有 3*3卷积核尺寸以及具有非线性激活函数的 深度可分离卷积层, 其中深度可分离卷积层的设置可使得第一非线性转换卷积层的 运算量大幅度减少, 进而使得深度特征提取模型的大小也大幅度减小。
其中深度可分离卷积层 ( depthwise separable convolution ) 可在保持通道分离的 前提下, 实现空间卷积。 如 3*3卷积核尺寸的标准卷积层, 输入通道数为 16, 输出 通道数为 32, 则 32个 3*3 大小的卷积核遍历 16个通道中的每个数据, 需要设置 16*32*3*3=4608个参数进行卷积运算。 如 3*3卷积核尺寸的深度可分离卷积层, 用 1个 3*3尺寸的卷积核遍历 16个通道的数据, 得到 16个特征图谱, 然后使用 32个 1*1尺寸的卷积核遍历这 16个特征图谱, 这样只需要设置 16*3*3+16+32+1+1=656 个参数就能完成卷积运算。
在本步骤中, 深度第一非线性转换卷积层首先对检测图像的 i级升维特征进行 第一特征提取操作, 随后深度第一非线性转换卷积层会使用非线性激活函数, 如线 性整流函数 ( ReLU, Rectified Linear Unit ) 等对输出的 i级第一卷积特征进行非线 性处理, 以保证深度第一非线性转换卷积层的输出是可微的, 从而提高后续输出特 征的准确性。
步骤 S303, 图像目标检测装置使用深度特征提取模型的深度第二非线性转换卷 积层, 对步骤 S302获取的检测图像的 i级第一卷积特征进行第二特征提取操作, 以 得到检测图像的 i级第二卷积特征。
其中深度第二非线性转换卷积层为具有 3*3卷积核尺寸以及具有非线性激活函 数的深度可分离空洞卷积层 ( atrous convolutions ) , 其中深度可分离空洞卷积层的 设置可使得第二非线性转换卷积层的运算量大幅度减少的同时, 还可增加检测图像 的每个特征基本单元的感受野, 从而进一步提高了第二非线性转换卷积层输出的 i 级第二卷积特征的准确性。
其中空洞卷积可在卷积操作中设置一 “扩展率 ( dilation rate ) ” 的参数, 该扩 展率定义卷积层处理数据时各个数据之间的间距。如 5*5卷积核尺寸的标准卷积层, 需要设置 25个参数; 但是如果设置 3*3的卷积核尺寸且扩展率为 2的空洞卷积层, 仅仅只需要设置 9个参数, 即在 5*5尺寸的卷积核的基础上, 每隔一行删除一行数 据以及每隔一列删除一列数据。 因此在相同的计算条件下, 空洞卷积层可在不增加 运算量的情况下提供更大的感受野。
这里将空洞卷积层设置在深度第二非线性转换卷积层, 可以在深度第一非线性 转换卷积层已经进行初步深度特征提取的基础上, 使用较少的资源再次进行深度特 征提取, 可以较好的弥补第一特征提取操作中的感受野较小的问题。
在本步骤中, 深度第二非线性转换卷积层首先对检测图像的 i级第一卷积特征 进行第二特征提取操作, 随后深度第二非线性转换卷积层会使用非线性激活函数, 如线性整流函数 ( ReLU, Rectified Linear Unit ) 等对输出的 i级第二卷积特征进行 非线性处理, 以保证深度第二非线性转换卷积层的输出是可微的, 从而提高后续输 出特征的准确性。
步骤 S304, 图像目标检测装置使用深度特征提取模型的深度输出卷积层, 对步 骤 S303获取的检测图像的 i级第二卷积特征进行降维操作, 以得到检测图像的 i级 特征。
其中深度输出卷积层为具有 1*1 卷积核尺寸以及不具有激活函数的标准卷积 层。这里深度输出卷积层可将在步骤 S301中增加的维度恢复至输入到深度输入卷积 层的维度; 且在深度输出卷积层中没有设置激活函数, 以避免激活函数导致的输出 特征的丟失。深度输出卷积层输出的检测图像的 i级特征应该与 i级深度特征图框架 符合。
这样即完成了使用 i级深度特征图框架对检测图像的 ( i-1 ) 级特征进行深度特 征提取, 以获取检测图像的 i级特征的过程。 重复步骤 S301至步骤 S304, 可获取 检测图像的 1级特征至 n级特征。
在步骤 S203中, 图像目标检测装置基于预设的非深度特征提取模型,使用步骤 S201获取的 j级非深度特征图框架对检测图像的 ( j-1+n )级特征进行非深度特征提 取, 以获耳又检测图像的 ( j+n ) 级特征, 其中 j为小于等于 m的正整数。
在本实施例中非深度特征提取模型包括非深度输入卷积层、 非深度非线性转换 卷积层以及非深度输出卷积层。
具体请参照图 4a和图 4b, 图 4a为本申请一实施例的图 2所示的图像目标检测 方法的步骤 S203的流程图, 图 4b为本申请一实施例的图 2所示的图像目标检测方 法的步骤 S203的特征提取示意图。 该步骤 S203包括:
步骤 S401, 图像目标检测装置使用非深度特征提取模型的非深度输入卷积层, 对检测图像的 ( j-1+n )级特征进行升维操作, 以得到检测图像的 ( j+n )级升维特征。
其中非深度输入卷积层为具有 1*1卷积核尺寸以及具有非线性激活函数的标准 卷积层, 其中非深度输入卷积层可设置较大的通道数, 如 4-6等。 这样可在保证检 测图像的特征尺寸的情况下, 增加输入特征的维度, 从而解决检测图像的特征丟失 的问题。
随后非深度输入卷积层会使用非线性激活函数, 如线性整流函数 ( ReLU, Rectified Linear Unit ) 等对输出的 ( j+n ) 级升维特征进行非线性处理, 以保证非深 度输入卷积层的输出是可微的, 从而提高后续输出特征的准确性。
步骤 S402, 图像目标检测装置使用非深度特征提取模型的非深度非线性转换卷 积层, 对步骤 S401获取的检测图像的 ( j+n ) 级升维特征进行特征提取操作, 以得 到检测图像的 ( j+n ) 级卷积特征。
其中非深度非线性转换卷积层为具有 3*3卷积核尺寸以及具有非线性激活函数 的深度可分离卷积层, 其中深度可分离卷积层的设置可使得非深度非线性转换卷积 层的运算量大幅度减少, 进而使得深度特征提取模型的大小也大幅度减小。 这里的 非深度非线性转换卷积层也可为深度可分离空洞卷积层。
由于非深度非线性转换卷积层直接输入深度非线性转换卷积层输出的特征, 因 此这里的非深度特征提取模型只需要使用一个非深度非线性转换卷积层进行特征提 取即可, 而不需要设置多个非线性转换卷积层进行特征提取。
在本步骤中, 非深度非线性转换卷积层首先对检测图像的 ( j+n )级升维特征进 行特征提取操作, 随后非深度非线性转换卷积层会使用非线性激活函数, 如线性整 流函数 ( ReLU, Rectified Linear Unit )等对输出的 ( j+n )级卷积特征进行非线性处 理, 以保证非深度非线性转换卷积层的输出是可微的, 从而提高后续输出特征的准 确性。
步骤 S403 , 图像目标检测装置使用非深度特征提取模型的非深度输出卷积层, 对步骤 S402获取的检测图像的 ( j+n ) 级卷积特征进行降维操作, 以得到检测图像 的 ( j+n ) 级特征。
其中非深度输出卷积层为具有 1*1卷积核尺寸以及不具有激活函数的标准卷积 层。这里非深度输出卷积层可将在步骤 S401中增加的维度恢复至输入到非深度输入 卷积层的维度; 且在非深度输出卷积层中没有设置激活函数, 以避免激活函数导致 的输出特征的丟失。 非深度输出卷积层输出的检测图像的 ( j+n ) 级特征应该与 j级 非深度特征图框架符合。
这样即完成了使用 j级非深度特征图框架对检测图像的 ( j-1+n ) 级特征进行非 深度特征提取,以获取检测图像的 ( j+n )级特征的过程。重复步骤 S401至步骤 S403 , 可获耳又检测图像的 ( n+1 ) 级特征至 ( j+n ) 级特征。
本实施例基于检测图像的特征提取深度分别采用深度特征提取模型以及非深度 特征提取模型, 可大大降低特征提取操作的计算量, 这里 n的数值可根据用户的要 求进行设定。 如特征提取操作的计算量较大, 可适当的降低 n的大小, 如需要提高 特征提取的准确性, 可适当的提升 n的大小。
在步骤 S204中, 图像目标检测装置基于预设的特征预测模型, 对步骤 S202和 步骤 S203获取的检测图像的 a级特征至 ( m+n )级特征进行信息回归操作, 从而获 取检测图像的目标类型以及目标位置, 其中 a为小于 n且大于等于 2的整数。 这里 特征预测模型的作用相当于回归器, 用于获耳又检测图像中目标的目标类型以及目标 位置, 这里的目标类型使用分类概率进行标识, 如某个目标 80%的概率为一只猫, 20%的概率为一只狗等。
在本实施例中特征预测模型包括特征分类卷积层以及特征输出卷积层。
具体请参照图 5a和图 5b, 图 5a为本申请一实施例的图 2所示的图像目标检测 方法的步骤 S204的流程图, 图 5b为本申请一实施例的图 2所示的图像目标检测方 法的步骤 S204的特征提取示意图。 该步骤 S204包括:
步骤 S501, 图像目标检测装置使用特征预测模型的特征分类卷积层, 对检测图 像的 a级特征至 ( m+n ) 级特征进行特征提取操作, 以得到检测图像的分类识别特 征。
其中特征分类卷积层为具有 3*3卷积核尺寸以及不具有激活函数的深度可分离 卷积层。 由于检测图像的 1级特征至 ( a-1) 级特征的特征尺寸较大, 一般不会成为 检测的图像目标, 因此这里将检测图像的 a级特征的前级特征全部丟弃。
随后图像目标检测装置使用检测图像的 a 级特征至 ( m+n) 级特征进行特征提 耳又操作, 从而获耳又检测图像的分类识别特征, 以便进行后续的检测图像的目标类型 以及目标位置的预测操作。
具体的, 这里可根据用户需要选择 a 级特征至 (m+n) 级特征中的部分特征进 行特征提取操作, 从而进一步的减少特征提取操作的计算量。
步骤 S502 , 图像目标检测装置使用特征预测模型的特征输出卷积层, 对步骤 S501获取的检测图像的分类识别特征进行降维操作, 以得到检测图像的目标类型以 及目标位置。
其中特征输出卷积层为具有 1 * 1 卷积核尺寸以及不具有激活函数的标准卷积 层。 这里特征输出卷积层中没有设置激活函数, 以避免激活函数导致的输出特征的 丟失。
这里输出的目标类型可为人、 车、 房子等物品, 输出的目标位置可为目标的中 心坐标以及目标方框的长宽等参数。
这样即完成了本实施例的图像目标检测方法的图像目标检测过程。
在图 1 所示的实施例的基础上, 本实施例的图像目标检测方法中深度特征提耳又 模型和非深度特征提取模型采用不同的结构, 且深度特征提取模型中的第一非线性 转换卷积层和第二非线性转换卷积层也采用不同的结果, 这样可以最大化检测图像 的目标特征的提取速度, 进一步降低配置资源的需求, 从而实现目标检测功能在移 动终端上的落地。
本申请实施例还提供一种图像目标检测装置, 请参照图 6, 图 6 为本申请一实 施例的图像目标检测装置的结构示意图。 本实施例的图像目标检测装置可对上述图 1所示的图像目标检测方法进行实施, 本实施例的图像目标检测装置 60包括图像以 及框架获取模块 61、 深度特征提取模块 62、 非深度特征提取模块 63 以及目标检测 模块 64。
图像以及框架获取模块 61用于获取检测图像、 n级深度特征图框架以及 m级非 深度特征图框架, n为大于等于 2的整数, m为大于等于 1 的整数; 其中特征图框 架包括输出的特征尺寸以及维度; 深度特征提取模块 62 用于基于深度特征提取模 型, 使用 i级深度特征图框架对检测图像的 (i-1) 级特征进行深度特征提取, 以获 取检测图像的 i级特征, 其中 i为小于等于 n的正整数; 非深度特征提取模块 63用 于基于非深度特征提取模型, 使用 j级非深度特征图框架对检测图像的 (j-1+n) 级 特征进行非深度特征提取, 以获耳又检测图像的 (j+n) 级特征, 其中 j为小于等于 m 的正整数;目标检测模块 64用于基于特征预测模型,对检测图像的 a级特征至( m+n) 级特征进行信息回归操作, 从而获取检测图像的目标类型以及目标位置, 其中 a为 小于 n且大于等于 2的整数。
本实施例的图像目标检测装置 60使用时, 首先图像以及框架获取模块 61获取 需要进行目标检测的检测图像, 以及对该检测图像进行目标检测的 n级深度特征图 框架以及 m级非深度特征图框架。 在本实施例中 n为大于等于 2的整数, m为大于 等于 1的整数。 即检测图像至少要进行 3次特征提取操作。
这里为了对检测图像进行准确全面的目标检测, 需要对检测图像进行多级特征 提耳又操作, 如 m+n级等。 由于下一级的特征尺寸一定小于上一级的特征尺寸, 因此 下级的特征提取操作可在上级特征提取操作输出的特征上进行。 大尺寸的特征提取 操作由于之前进行的上级特征提取操作较少, 因此需要使用深度特征提取模型以及 深度特征图框架进行特征提取。 小尺寸的特征提取操作由于之前已经进行了多次上 级特征提取操作, 因此仅需要使用非深度特征提取模型以及非深度特征图框架进行 特征提取即可。
这里深度特征图框架是对检测图像或检测图像对应的下级特征进行特征识别的 识别参数, 该深度特征图框架可包括每个深度特征级别输出的特征尺寸以及维度。 非深度特征框架是对检测图像对应的下级特征进行特征识别的识别参数, 该非深度 特征图框架可包括每个非深度特征级别输出的特征尺寸以及维度。
随后深度特征提取模块 62基于预设的深度特征提取模型, 使用 i级深度特征图 框架对检测图像的 (i-1) 级特征进行深度特征提取, 以获耳又检测图像的 i级特征, 其中 i为小于等于 n的正整数。
即深度特征提取模块 62基于预设的深度特征提取模型,对检测图像的像素进行 深度特征提取, 以获取 1级深度特征图框架对应的检测图像的 1级特征; 随后深度 特征提取模块对检测图像的 1级特征进行深度特征提取, 以获取 2级深度特征图框 架对应的检测图像的 2级特征 ; 最后图像目标检测装置对检测图像的 ( n-1)级 特征进行深度特征提取, 以获耳又 n级深度特征图框架对应的检测图像的 n级特征。 这样获耳又了检测图像的 1级特征至 n级特征。
然后非深度特征提取模块 63基于预设的非深度特征提取模型, 使用 j级非深度 特征图框架对检测图像的 ( j-1+n) 级特征进行非深度特征提取, 以获取检测图像的 (j+n) 级特征, 其中 j为小于等于 m的正整数。
即非深度特征提取模块 63 基于预设的非深度特征提取模型, 对检测图像的 n 级特征进行非深度特征提耳又,以获取 1级非深度特征图框架对应的检测图像的( n+1) 级特征;随后非深度特征提取模块对检测图像的( n+1)级特征进行非深度特征提取, 以获取 2级非深度特征图框架对应的检测图像的( n+2)级特征 . ; 最后图像目标 检测装置对检测图像的 (n+m-1)级特征进行深度特征提取, 以获取(m+n)级深度 特征图框架对应的检测图像的 (m+n)级特征。 这样获耳又了检测图像的 (n+1)级特 征至 ( m+n) 级特征。
最后目标检测模块 64 基于预设的特征预测模型, 对检测图像的 a 级特征至 ( m+n) 级特征进行信息回归操作, 从而获取检测图像的目标类型以及目标位置, 其中 a小于 n且大于等于 2的整数。
具体的, 由于检测图像的 1级特征至(a-1)级特征的特征尺寸较大, 因此不具 有进行特征分类识别的意义, 因此目标检测模块 64将检测图像的 1级特征至( a-1) 级特征直接丟弃。 目标检测模块 64对检测图像的 a级特征至( m+n)级特征进行特 征分类识别, 从而获取该特征对应的检测图像的目标类型 (如人、 车、 房子等) 以 及目标位置 (如目标的中心坐标以及目标方框的长宽等) 。
这样即完成了本实施例的图像目标检测装置 60的图像目标检测过程。
本实施例的图像目标检测装置基于深度特征提取模型以及非深度特征提取模 型, 对同一检测图像的多个不同尺寸的特征进行提耳又以及特征识别, 由于检测图像 的小尺寸特征可直接在检测图像的大尺寸特征的基础上进行提取, 因此整体特征提 耳又速度较快, 且对配置资源的需求较低。
请参照图 7, 图 7为本申请另一实施例的图像目标检测装置的结构示意图。 本 实施例的图像目标检测装置可对上述图 2所示的图像目标检测方法进行实施, 本实 施例的图像目标检测装置 70包括图像以及框架获取模块 71、深度特征提取模块 72、 非深度特征提取模块 73以及目标检测模块 74。
图像以及框架获取模块 71用于获取检测图像、 n级深度特征图框架以及 m级非 深度特征图框架, 其中特征图框架包括输出的特征尺寸以及维度; 深度特征提取模 块 72用于基于深度特征提取模型, 使用 i级深度特征图框架对检测图像的( i-1)级 特征进行深度特征提取, 以获取检测图像的 i级特征; 非深度特征提取模块 73用于 基于非深度特征提取模型, 使用 j级非深度特征图框架对检测图像的 (j-1+n) 级特 征进行非深度特征提取, 以获取检测图像的 (j+n) 级特征; 目标检测模块 74用于 基于特征预测模型, 对检测图像的 a级特征至 (m+n) 级特征进行信息回归操作, 从而获取检测图像的目标类型以及目标位置。
请参照图 8, 图 8为本申请一实施例的图 7所示的图像目标检测装置的深度特 征提取模块的结构示意图。 该深度特征提取模块 72包括深度升维操作单元 81、 第 一深度特征提取单元 82、 第二深度特征提取单元 83以及深度降维操作单元 84。
在本实施例中深度特征提取模型包括深度输入卷积层、 深度第一非线性转换卷 积层、 深度第二非线性转换卷积层以及深度输出卷积层。
深度升维操作单元 81用于使用深度输入卷积层, 对检测图像的(i-1)级特征进 行升维操作, 以得到检测图像的 i级升维特征; 第一深度特征提取单元 82用于使用 深度第一非线性转换卷积层, 对检测图像的 i级升维特征进行第一特征提取操作, 以得到检测图像的 i级第一卷积特征; 第二深度特征提取单元 83使用深度第二非线 性转换卷积层, 对检测图像的 i级第一卷积特征进行第二特征提取操作, 以得到检 测图像的 i级第二卷积特征; 深度降维操作单元 84用于使用深度输出卷积层, 对检 测图像的 i级第二卷积特征进行降维操作, 以得到检测图像的 i级特征。
请参照图 9, 图 9为本申请一实施例的图 7所示的图像目标检测装置的非深度 特征提取模块的结构示意图。 该非深度特征提取模块 73 包括非深度升维操作单元 91、 非深度特征提取单元 92以及非深度降维操作单元 93。
在本实施例中非深度特征提取模型包括非深度输入卷积层、 非深度非线性转换 卷积层以及非深度输出卷积层。
非深度升维操作单元 91 用于使用非深度输入卷积层, 对检测图像的 (j-1+n) 级特征进行升维操作, 以得到检测图像的 (j+n)级升维特征; 非深度特征提取单元 92用于使用非深度非线性转换卷积层, 对检测图像的(j+n)级升维特征进行特征提 取操作, 以得到检测图像的 (j+n) 级卷积特征; 非深度降维操作单元 93用于使用 非深度输出卷积层, 对检测图像的 ( j+n)级卷积特征进行降维操作, 以得到检测图 像的 (j+n) 级特征。 请参照图 10, 图 10为本申请一实施例的图 7所示的图像目标检测装置的目标 检测模块的结构示意图。该目标检测模块 74包括特征分类单元 101以及特征输出单 元 102。
在本实施例中特征预测模型包括特征分类卷积层以及特征输出卷积层。
特征分类单元 101用于使用特征分类卷积层, 对检测图像的 a级特征至( m+n) 级特征进行特征提耳又操作, 以得到检测图像的分类识别特征; 特征输出单元 102用 于使用特征输出卷积层, 对检测图像的分类识别特征进行降维操作, 以得到检测图 像的目标类型以及位置。
本实施例的图像目标检测装置 70使用时, 首先图像以及框架获取模块 71获取 需要进行目标检测的检测图像, 以及对该检测图像进行目标检测的 n级深度特征图 框架以及 m级非深度特征图框架。 在本实施例中 n为大于等于 2的整数, m为大于 等于 1的整数。 即检测图像至少要进行 3次特征提取操作。
这里深度特征图框架是对检测图像或检测图像对应的下级特征进行特征识别的 识别参数, 该深度特征图框架可包括每个深度特征级别输出的特征尺寸以及维度。 非深度特征框架是对检测图像对应的下级特征进行特征识别的识别参数, 该非深度 特征图框架可包括每个非深度特征级别输出的特征尺寸以及维度。
随后深度特征提取模块 72基于预设的深度特征提取模型, 使用 i级深度特征图 框架对检测图像的 (i-1) 级特征进行深度特征提取, 以获耳又检测图像的 i级特征, 其中 i为小于等于 n的正整数。
具体的深度特征提取流程包括:
深度特征提取模块 72的深度升维操作单元 81使用深度特征提取模型的深度输 入卷积层, 对检测图像的 (i-1) 级特征进行升维操作, 以得到检测图像的 i级升维 特征。
其中深度输入卷积层为具有 1*1卷积核尺寸以及具有非线性激活函数的标准卷 积层, 其中深度输入卷积层可设置较大的通道数, 如 4-6等。 这样可在保证检测图 像的特征尺寸的情况下, 增加输入特征的维度, 从而解决检测图像的特征丟失的问 题。
深度输入卷积层的通道数用于表示从检测图像的低级特征上进行特征提取的特 征提取模式的数量, 深度输入卷积层的卷积核尺寸用于调整深度神经网络模型的复 杂度。 如输入的检测图像的 ( i-1 ) 级特征为 32*32*3的特征点矩阵, 其中 3为检测图 像的输入通道数, 如红色的像素亮度值、 蓝色的像素亮度值以及绿色的像素亮度值 等; 设定深度输入卷积层的卷积核尺寸为 1*1, 则该深度输入卷积层的输出特征尺 寸为 32*32, 即使用 1*1的卷积核依次遍历 32*32的特征点矩阵可得到 32*32尺寸 的特征图, 如深度输入卷积层的通道数为 6, 则得到的深度输入卷积层的输出为 32*32*18的 i级升维特征。 这样在不改变输出特征尺寸的情况下得到了维度更高的 检测图像的升维特征。
随后深度输入卷积层会使用非线性激活函数,如线性整流函数( ReLU, Rectified Linear Unit )等对输出的 i级升维特征进行非线性处理, 以保证深度输入卷积层的输 出是可微的, 从而提高后续输出特征的准确性。
深度特征提取模块 72的第一深度特征提取单元 82使用深度特征提取模型的深 度第一非线性转换卷积层, 对检测图像的 i级升维特征进行第一特征提取操作, 以 得到检测图像的 i级第一卷积特征。
其中第一非线性转换卷积层为具有 3*3卷积核尺寸以及具有非线性激活函数的 深度可分离卷积层, 其中深度可分离卷积层的设置可使得第一非线性转换卷积层的 运算量大幅度减少, 进而使得深度特征提取模型的大小也大幅度减小。
其中深度可分离卷积层 ( depthwise separable convolution ) 可在保持通道分离的 前提下, 实现空间卷积。 如 3*3卷积核尺寸的标准卷积层, 输入通道数为 16, 输出 通道数为 32, 则 32个 3*3 大小的卷积核遍历 16个通道中的每个数据, 需要设置 16*32*3*3=4608个参数进行卷积运算。 如 3*3卷积核尺寸的深度可分离卷积层, 用 1个 3*3尺寸的卷积核遍历 16个通道的数据, 得到 16个特征图谱, 然后使用 32个 1*1尺寸的卷积核遍历这 16个特征图谱, 这样只需要设置 16*3*3+16+32+1+1=656 个参数就能完成卷积运算。
深度第一非线性转换卷积层首先对检测图像的 i级升维特征进行第一特征提取 操作, 随后深度第一非线性转换卷积层会使用非线性激活函数, 如线性整流函数 ( ReLU, Rectified Linear Unit ) 等对输出的 i级第一卷积特征进行非线性处理, 以 保证深度第一非线性转换卷积层的输出是可微的,从而提高后续输出特征的准确性。
深度特征提取模块 72的第二深度特征提取单元 83使用深度特征提取模型的深 度第二非线性转换卷积层, 对检测图像的 i级第一卷积特征进行第二特征提取操作, 以得到检测图像的 i级第二卷积特征。 其中深度第二非线性转换卷积层为具有 3*3卷积核尺寸以及具有非线性激活函 数的深度可分离空洞卷积层 ( atrous convolutions ) , 其中深度可分离空洞卷积层的 设置可使得第二非线性转换卷积层的运算量大幅度减少的同时, 还可增加检测图像 的每个特征基本单元的感受野 从而进一步提高了第二非线性转换卷积层输出的 i 级第二卷积特征的准确性。
其中空洞卷积可在卷积操作中设置一 “扩展率 ( dilation rate ) ” 的参数, 该扩 展率定义卷积层处理数据时各个数据之间的间距。如 5*5卷积核尺寸的标准卷积层, 需要设置 25个参数; 但是如果设置 3*3的卷积核尺寸且扩展率为 2的空洞卷积层, 仅仅只需要设置 9个参数, 即在 5*5尺寸的卷积核的基础上, 每隔一行删除一行数 据以及每隔一列删除一列数据。 因此在相同的计算条件下, 空洞卷积层可在不增加 运算量的情况下提供更大的感受野。
这里将空洞卷积层设置在深度第二非线性转换卷积层, 可以在深度第一非线性 转换卷积层已经进行初步深度特征提取的基础上, 使用较少的资源再次进行深度特 征提取, 可以较好的弥补第一特征提取操作中的感受野较小的问题。
在本步骤中, 深度第二非线性转换卷积层首先对检测图像的 i级第一卷积特征 进行第二特征提取操作, 随后深度第二非线性转换卷积层会使用非线性激活函数, 如线性整流函数 ( ReLU, Rectified Linear Unit ) 等对输出的 i级第二卷积特征进行 非线性处理, 以保证深度第二非线性转换卷积层的输出是可微的, 从而提高后续输 出特征的准确性。
深度特征提取模块 72的深度降维操作单元 84使用深度特征提取模型的深度输 出卷积层,对检测图像的 i级第二卷积特征进行降维操作, 以得到检测图像的 i级特 征。
其中深度输出卷积层为具有 1*1 卷积核尺寸以及不具有激活函数的标准卷积 层。 这里深度输出卷积层可将增加的维度恢复至输入到深度输入卷积层的维度; 且 在深度输出卷积层中没有设置激活函数, 以避免激活函数导致的输出特征的丟失。 深度输出卷积层输出的检测图像的 i级特征应该与 i级深度特征图框架符合。
这样即完成了使用 i级深度特征图框架对检测图像的 ( i-1 ) 级特征进行深度特 征提取, 以获取检测图像的 i级特征的过程。 重复上述升维操作、 第一特征提取操 作、 第二特征提取操作以及降维操作, 可获取检测图像的 1级特征至 n级特征。
然后非深度特征提取模块 73基于预设的非深度特征提取模型, 使用 j级非深度 特征图框架对检测图像的 ( j-1+n ) 级特征进行非深度特征提取, 以获取检测图像的 ( j+n ) 级特征, 其中 j为小于等于 m的正整数。
具体的非深度特征提取流程包括:
非深度特征提取模块 73的非深度升维操作单元 91使用非深度特征提取模型的 非深度输入卷积层, 对检测图像的 ( j-1+n )级特征进行升维操作, 以得到检测图像 的 ( j+n ) 级升维特征。
其中非深度输入卷积层为具有 1*1卷积核尺寸以及具有非线性激活函数的标准 卷积层, 其中非深度输入卷积层可设置较大的通道数, 如 4-6等。 这样可在保证检 测图像的特征尺寸的情况下, 增加输入特征的维度, 从而解决检测图像的特征丟失 的问题。
随后非深度输入卷积层会使用非线性激活函数, 如线性整流函数 ( ReLU, Rectified Linear Unit ) 等对输出的 ( j+n ) 级升维特征进行非线性处理, 以保证非深 度输入卷积层的输出是可微的, 从而提高后续输出特征的准确性。
非深度特征提取模块 73的非深度特征提取单元 92使用非深度特征提取模型的 非深度非线性转换卷积层, 对检测图像的 ( j+n )级升维特征进行特征提取操作, 以 得到检测图像的 ( j+n ) 级卷积特征。
其中非深度非线性转换卷积层为具有 3*3卷积核尺寸以及具有非线性激活函数 的深度可分离卷积层, 其中深度可分离卷积层的设置可使得非深度非线性转换卷积 层的运算量大幅度减少, 进而使得深度特征提取模型的大小也大幅度减小。 这里的 非深度非线性转换卷积层也可为深度可分离空洞卷积层。
由于非深度非线性转换卷积层直接输入深度非线性转换卷积层输出的特征, 因 此这里的非深度特征提取模型只需要使用一个非深度非线性转换卷积层进行特征提 取即可, 而不需要设置多个非线性转换卷积层进行特征提取。
在本步骤中, 非深度非线性转换卷积层首先对检测图像的 ( j+n )级升维特征进 行特征提取操作, 随后非深度非线性转换卷积层会使用非线性激活函数, 如线性整 流函数 ( ReLU, Rectified Linear Unit )等对输出的 ( j+n )级卷积特征进行非线性处 理, 以保证非深度非线性转换卷积层的输出是可微的, 从而提高后续输出特征的准 确性。
非深度特征提取模块 71的非深度降维操作单元 93使用非深度特征提取模型的 非深度输出卷积层, 对检测图像的 ( j+n )级卷积特征进行降维操作, 以得到检测图 像的 ( j+n) 级特征。
其中非深度输出卷积层为具有 1*1卷积核尺寸以及不具有激活函数的标准卷积 层。 这里非深度输出卷积层可将之前增加的维度恢复至输入到非深度输入卷积层的 维度; 且在非深度输出卷积层中没有设置激活函数, 以避免激活函数导致的输出特 征的丟失。 非深度输出卷积层输出的检测图像的 (j+n) 级特征应该与 j级非深度特 征图框架符合。
这样即完成了使用 j级非深度特征图框架对检测图像的 (j-1+n) 级特征进行非 深度特征提取, 以获取检测图像的 (j+n)级特征的过程。 重复上述升维操作、 特征 提耳又操作以及降维操作, 可获取检测图像的 (n+1) 级特征至 (j+n) 级特征。
本实施例基于检测图像的特征提取深度分别采用深度特征提取模型以及非深度 特征提取模型, 可大大降低特征提取操作的计算量, 这里 n的数值可根据用户的要 求进行设定。 如特征提取操作的计算量较大, 可适当的降低 n的大小, 如需要提高 特征提取的准确性, 可适当的提升 n的大小。
最后目标检测模块 74 基于预设的特征预测模型, 对检测图像的 a 级特征至 ( m+n) 级特征进行信息回归操作, 从而获取检测图像的目标类型以及目标位置, 其中 a为小于 n且大于等于 2的整数。 这里特征预测模型的作用相当于回归器, 用 于获取检测图像中目标的目标类型以及目标位置, 这里的目标类型使用分类概率进 行标识, 如某个目标 80%的概率为一只猫, 20%的概率为一只狗等。
具体的目标检测流程包括:
目标检测模块 74的特征分类单元 101使用特征预测模型的特征分类卷积层,对 检测图像的 a级特征至 (m+n) 级特征进行特征提耳又操作, 以得到检测图像的分类 识别特征。
其中特征分类卷积层为具有 3*3卷积核尺寸以及不具有激活函数的深度可分离 卷积层。 由于检测图像的 1级特征至( a-1)级特征的特征尺寸较大, 一般不会成为 检测的图像目标, 因此这里将检测图像的 a级特征的前级特征全部丟弃。
特征分类单元 101使用检测图像的 a级特征至 (m+n) 级特征进行特征提耳又操 作, 从而获耳又检测图像的分类识别特征, 以便进行后续的检测图像的目标类型以及 目标位置的预测操作。
具体的, 这里特征分类单元可根据用户需要选择 a级特征至 (m+n) 级特征中 的部分特征进行特征提取操作, 从而进一步的减少特征提取操作的计算量。 目标检测模块 74的特征输出单元 102使用特征预测模型的特征输出卷积层,对 检测图像的分类识别特征进行降维操作,以得到检测图像的目标类型以及目标位置。
其中特征输出卷积层为具有 1*1 卷积核尺寸以及不具有激活函数的标准卷积 层。 这里特征输出卷积层中没有设置激活函数, 以避免激活函数导致的输出特征的 丟失。
这里输出的目标类型可为人、 车、 房子等物品, 输出的目标位置可为目标的中 心坐标以及目标方框的长宽等参数。
这样即完成了本实施例的图像目标检测装置 70的图像目标检测过程。
在图 6所示的实施例的基础上, 本实施例的图像目标检测装置中深度特征提耳又 模型和非深度特征提取模型采用不同的结构, 且深度特征提取模型中的第一非线性 转换卷积层和第二非线性转换卷积层也采用不同的结果, 这样可以最大化检测图像 的目标特征的提取速度, 进一步降低配置资源的需求, 从而实现目标检测功能在移 动终端上的落地。
下面通过一具体实施例说明本申请的图像目标检测方法及图像目标检测装置的 工作原理, 请参照图 11, 图 11 为本申请一实施例的图像目标检测方法及图像目标 检测装置的具体实施例的使用示意图。
本具体实施例的图像目标检测装置可设置在电子设备中, 例如安装有图像目标 识别应用的移动终端中, 该移动终端可对图像中的目标特征进行快速提取, 且对移 动终端自身的配置资源的需求较低。 本具体实施例的图像目标检测装置进行图像目 标检测的步骤包括:
步骤 S1101, 获取检测图像、 n级深度特征图框架以及 m级非深度特征图框架。 在本实施例中, 深度特征图框架包括 128* 128*12、 64*64*24、 32*32*48、 16*16*64 等特征尺寸以及维度的不同级别的深度特征图框架。 这里 128*128是指特征图框架 的特征尺寸, 12是指特征图框架的维度。 当然这里还可包括更多的深度特征图框架, 如可在 64*64*24以及 32*32*48之间增加 64*64*32以及 64*64*40等深度特征图框 架。 在本实施例中, 获取了 4级深度特征图框架以及 4级非深度特征图框架。
步骤 S1102,基于深度特征提取模型,使用 i级的深度特征图框架对检测图像的 ( i-1 ) 级特征进行深度特征提取, 以获取检测图像的 i级特征。 如在本实施例中检 测图像设置了 4级深度特征, 在实际使用中, 检测图像的深度特征级数应大于 4。
图像目标检测装置基于深度特征提取模型, 对检测图像的像素 (该像素尺寸以 及维度为 256*256*3)进行深度特征提取, 以获取 1级深度特征图框架对应的检测 图像的 1级特征(该特征尺寸以及维度为 128*128*12) ; 随后图像目标检测装置对 检测图像的 1级特征进行深度特征提取, 以获取 2级深度特征图框架对应的检测图 像的 2级特征(该特征尺寸以及维度为 64*64*24) ; 然后图像目标检测装置对检测 图像的 2级特征进行深度特征提取, 以获取 3级深度特征图框架对应的检测图像的 3级特征(该特征尺寸以及维度为 32*32*48) , 最后图像目标检测装置对检测图像 的 3级特征进行深度特征提取, 以获取 4级深度特征图框架对应的检测图像的 4级 特征 (该特征尺寸以及维度为 16*16*64) 。
这里以获取检测图像的 3级特征为例,说明如何对检测图像进行深度特征提取。 该深度特征提取的流程包括:
图像目标检测装置使用深度输入卷积层, 对检测图像的 2级特征(该特征尺寸 以及维度为 64*64*24)进行升维操作, 以得到检测图像的 3级升维特征(该特征尺 寸以及维度为 64*64*144) 。 这里深度输入卷积层为具有 1*1卷积核尺寸以及具有 非线性激活函数 Relu的标准卷积层 Convl
图像目标检测装置使用第一非线性转换卷积层, 对检测图像的 3级升维特征进 行第一特征提取操作, 以得到检测图像的 3级第一卷积特征 (该特征尺寸以及维度 为 32*32*144) 。 这里的第一非线性转换卷积层为具有 3*3卷积核尺寸以及具有非 线性激活函数 Relu的深度可分离标准卷积层 Dwise2, 由于 3级第一卷积特征的尺 寸减少了, 因此这里的深度可分离标准卷积层 Dwise2的卷积步长 stride为 2。
图像目标检测装置使用第二非线性转换卷积层, 对检测图像的 3级第一卷积特 征进行第二特征提取操作, 以得到检测图像的 3级第二卷积特征 (该特征尺寸以及 维度为 32*32*144) 。 这里的第二非线性转换卷积层为具有 3*3卷积核尺寸以及具 有非线性激活函数 Relu的深度可分离空洞卷积层 Dwise3, 其中深度可分离空洞卷 积层 Dwise3的扩展率为 2。
图像目标检测装置使用深度输出卷积层, 对检测图像的 3级第二卷积特征进行 降维操作, 以得到检测图像的 3级特征(该特征尺寸以及维度为 32*32*48) 。 这里 的深度输出卷积层为具有 1 * 1卷积核尺寸以及不具有激活函数的标准卷积层 Conv4 步骤 S1103 ,基于非深度特征提取模型,使用 j级的非深度特征图框架对检测图 像的 (j-1+n)级特征进非深度特征提取, 以获取检测图像的 (j+n)级特征。 如在本 实施例中检测图像设置 4级非深度特征, 即该检测图像一共设置了 8级特征, 在实 际使用中, 检测图像的非深度特征级数应大于 4。
图像目标检测装置基于非深度特征提取模型, 对检测图像的 4级特征进行非深 度特征提取, 以获取 1级非深度特征图框架对应的检测图像的 5级特征(该特征尺 寸以及维度为 8*8*144) ; 随后图像目标检测装置对检测图像的 5级特征进行非深 度特征提取, 以获取 2级非深度特征图框架对应的检测图像的 6级特征(该特征尺 寸以及维度为 4*4*256) ; 然后图像目标检测装置对检测图像的 6级特征进行非深 度特征提取, 以获取 3级非深度特征图框架对应的检测图像的 7级特征(该特征尺 寸以及维度为 2*2*256) ; 最后图像目标检测装置对检测图像的 7级特征进行非深 度特征提取, 以获取 4级非深度特征图框架对应的检测图像的 8级特征(该特征尺 寸以及维度为 1*1*256) 。
这里以获取检测图像的 7级特征为例, 说明如何对检测图像进行非深度特征提 取。 该非深度特征提取的流程包括:
图像目标检测装置使用非深度输入卷积层, 对检测图像的 6级特征(该特征尺 寸以及维度为 4*4*256) 进行升维操作, 以得到检测图像的 7级升维特征(该特征 尺寸以及维度为 4*4*1536) 。 这里的非深度输入卷积层为具有 1*1卷积核尺寸以及 具有非线性激活函数 Relu的标准卷积层 Conv5
图像目标检测装置使用非深度非线性转换卷积层, 对检测图像的 7级升维特征 进行特征提取操作, 以得到检测图像的 7 级卷积特征 (该特征尺寸以及维度为 2*2*1536) 。 这里的非深度非线性转换卷积层为具有 3*3卷积核尺寸以及具有非线 性激活函数 Relu的深度可分离空洞卷积层 Dwise6, 由于 7级卷积特征的尺寸减少 了, 因此这里的深度可分离卷积层 Dwise6的卷积步长 stride为 2, 其中深度可分离 空洞卷积层 Dwise6的扩展率为 2。
图像目标检测装置使用非深度输出卷积层, 对检测图像的 7级卷积特征进行降 维操作, 以得到检测图像的 7级特征(该特征尺寸以及维度为 2*2*256) 。 这里的 非深度输出卷积层为具有 1*1卷积核尺寸以及不具有激活函数的标准卷积层 Conv7 步骤 S1104, 基于特征预测模型, 对检测图像的 3级特征至 8级特征进行信息 回归操作, 从而获耳又检测图像的目标类型以及目标位置。 获耳又检测图像的目标类型 以及目标位置的流程包括:
图像目标检测装置使用特征分类卷积层, 对检测图像的 3级特征至 8级特征进 行信息回归操作,以得到检测图像的分类识别特征。其中特征分类卷积层为具有 3*3 卷积核尺寸以及不具有激活函数的深度可分离卷积层。
图像目标检测装置使用特征输出卷积层, 对检测图像的分类识别特征进行降维 操作, 以得到检测图像的目标类型以及目标位置。 其中特征输出卷积层为具有 1*1 卷积核尺寸以及不具有激活函数的标准卷积层。
这样即完成了图像中目标类型以及目标位置的输出操作, 如图 11中的 1101所 示。
本具体实施例的图像目标检测方法及图像目标检测装置的目标检测过程同时对 深度特征提取模型、 非深度特征提取模型以及特征预测模型三个部分进行优化, 使 原来的特征提取模型从 100Mb缩小到小于 1Mb, 并且运行速度也得到了大于 10倍 的提升。
本申请实施例的图像目标检测方法、 装置及存储介质基于深度特征提取模型以 及非深度特征提取模型对同一检测图像的多个不同尺寸的特征进行提取以及特征识 另 ll,由于检测图像的小尺寸特征可直接在检测图像的大尺寸特征的基础上进行提耳又, 因此整体特征提取速度较快, 且对配置资源的需求较低; 有效的解决了现有的图像 目标检测方法及装置的运行速度较慢且无法在资源配置较小的移动终端上实施的技 术问题。
本申请实施例所使用的术语“组件”、 “模块”、 “系统”、 “接口”、 “进程”等等一般 地旨在指计算机相关实体: 硬件、 硬件和软件的组合、 软件或执行中的软件。 例如, 组件可以是但不限于是运行在处理器上的进程、 处理器、 对象、 可执行应用、 执行 的线程、 程序和 /或计算机。 通过图示, 运行在控制器上的应用和该控制器二者都 可以是组件。 一个或多个组件可以有在于执行的进程和 /或线程内, 并且组件可以 位于一个计算机上和 /或分布在两个或更多计算机之间。
本申请实施例还提供了一种电子设备, 包括: 一个或多个处理器和存储装置; 所述存储装置, 用于存储一个或多个可执行程序指令; 所述一个或多个处理器, 用 于执行所述存储装置中的一个或多个可执行程序指令, 以实现上述实施例所述的图 像目标检测方法。
图 12 和随后的讨论提供了对实现本申请实施例所述的图像目标检测装置所在 的电子设备的工作环境的简短、概括的描述。 图 12的工作环境仅仅是适当的工作环 境的一个实例并且不旨在建议关于工作环境的用途或功能的范围的任何限制。 实例 电子设备 1212包括但不限于可穿戴设备、 头戴设备、 医疗健康平台、 个人计算机、 服务器计算机、手持式或膝上型设备、移动设备(比如移动电话、个人数字助理(PDA)、 媒体播放器等等) 、 多处理器系统、 消费型电子设备、 小型计算机、 大型计算机、 包括上述任意系统或设备的分布式计算环境, 等等。
尽管没有要求, 但是在“计算机可读指令”被一个或多个电子设备执行的通用背 景下描述实施例。 计算机可读指令可以经由计算机可读介质来分布 (下文讨论) 。 计算机可读指令可以实现为程序模块, 比如执行特定任务或实现特定抽象数据类型 的功能、 对象、 应用编程接口(API)、 数据结构等等。 典型地, 该计算机可读指令的 功能可以在各种环境中随意组合或分布。
图 12 图示了包括本申请的图像目标检测装置中的一个或多个实施例的电子设 备 1212的实例。 在一种配置中, 电子设备 1212包括至少一个处理单元 1216和存储 器 1218。根据电子设备的确切配置和类型,存储器 1218可以是易失性的(比如 RAM)、 非易失性的(比如 ROM、 闪存等)或二者的某种组合。该配置在图 12中由虚线 1214 图示。 处理单元 1216可以是处理器, 例如 CPU。
在其他实施例中, 电子设备 1212 可以包括附加特征和 /或功能。 例如, 设备 1212还可以包括附加的存储装置(例如可移除和 /或不可移除的) , 其包括但不限 于磁存储装置、 光存储装置等等。 这种附加存储装置在图 12中由存储装置 1220图 示。 在一个实施例中, 用于实现本文所提供的一个或多个实施例的计算机可读指令 可以在存储装置 1220中。 存储装置 1220还可以存储用于实现操作系统、 应用程序 等的其他计算机可读指令。计算机可读指令可以载入存储器 1218中由例如处理单元 1216执行。
本文所使用的术语“计算机可读介质”包括计算机存储介质。 该计算机可读介质 可以是上述实施例中描述的电子设备中所包含的; 也可以是单独存在, 而未装配入 该电子设备中。 计算机存储介质包括以用于存储诸如计算机可读指令、 或处理器可 执行指令、 或其他数据之类的信息的任何方法或技术实现的易失性和非易失性、 可 移除和不可移除介质。 当上述的计算机可读指令或处理器可执行指令被电子设备的 一个或多个处理器执行时, 使得该电子设备实现如上述实施例中所述的图像目标检 测方法。 存储器 1218和存储装置 1220是计算机存储介质的实例。 计算机存储介质 包括但不限于 RAM、 ROM、 EEPROM、 闪存或其他存储器技术、 CD-ROM、 数字 通用盘(DVD)或其他光存储装置、 盒式磁带、 磁带、 磁盘存储装置或其他磁存储设 备、 或可以用于存储期望信息并可以被电子设备 1212访问的任何其他介质。任意这 样的计算机存储介质可以是电子设备 1212的一部分。
电子设备 1212 还可以包括允许电子设备 1212 与其他设备通信的通信连接 1226。 通信连接 1226可以包括但不限于调制解调器、 网络接口卡 (NIC)、 集成网络 接口、 射频发射器 /接收器、 红外端口、 USB连接或用于将电子设备 1212连接到 其他电子设备的其他接口。 通信连接 1226可以包括有线连接或无线连接。 通信连接 1226可以发射和 /或接收通信媒体。
术语“计算机可读介质”可以包括通信介质。 通信介质典型地包含计算机可读指 令或诸如载波或其他传输机构之类的“己调制数据信号”中的其他数据, 并且包括任 何信息递送介质。 术语“己调制数据信号”可以包括这样的信号: 该信号特性中的一 个或多个按照将信息编码到信号中的方式来设置或改变。
电子设备 1212可以包括输入设备 1224, 比如键盘、 鼠标、 笔、 语音输入设备、 触摸输入设备、 红外相机、 视频输入设备和 /或任何其他输入设备。 设备 1212中也 可以包括输出设备 1222, 比如一个或多个显示器、 扬声器、 打印机和 /或任意其他 输出设备。 输入设备 1224和输出设备 1222可以经由有线连接、 无线连接或其任意 组合连接到电子设备 1212。 在一个实施例中, 来自另一个电子设备的输入设备或输 出设备可以被用作电子设备 1212的输入设备 1224或输出设备 1222。
电子设备 1212的组件可以通过各种互连 (比如总线)连接。 这样的互连可以包 括外围组件互连 (PCI) (比如快速 PCI)、 通用串行总线 (USB)、 火线 (IEEE 1394)、 光学 总线结构等等。在另一个实施例中, 电子设备 1212的组件可以通过网络互连。例如, 存储器 1218可以由位于不同物理位置中的、通过网络互连的多个物理存储器单元构 成。
本领域技术人员将认识到, 用于存储计算机可读指令的存储设备可以跨越网络 分布。 例如, 可经由网络 1228访问的电子设备 1230可以存储用于实现本申请所提 供的一个或多个实施例的计算机可读指令。 电子设备 1212可以访问电子设备 1230 并且下载计算机可读指令的一部分或所有以供执行。 可替代地, 电子设备 1212可以 按需要下载多条计算机可读指令,或者一些指令可以在电子设备 1212处执行并且一 些指令可以在电子设备 1230处执行。
本文提供了实施例的各种操作。 在一个实施例中, 所述的一个或多个操作可以 构成一个或多个计算机可读介质上存储的计算机可读指令, 其在被电子设备执行时 将使得计算设备执行所述操作。 描述一些或所有操作的顺序不应当被解释为暗示这 些操作必需是顺序相关的。 本领域技术人员将理解具有本说明书的益处的可替代的 排序。 而且, 应当理解, 不是所有操作必需在本文所提供的每个实施例中存在。
而且, 尽管已经相对于一个或多个实现方式示出并描述了本申请, 但是本领域 技术人员基于对本说明书和附图的阅读和理解将会想到等价变型和修改。 本申请包 括所有这样的修改和变型, 并且仅由所附权利要求的范围限制。 特别地关于由上述 组件 (例如元件、 资源等)执行的各种功能, 用于描述这样的组件的术语旨在对应 于执行所述组件的指定功能 (例如其在功能上是等价的) 的任意组件 (除非另外指 示) , 即使在结构上与执行本文所示的本申请的示范性实现方式中的功能的公开结 构不等同。 此外, 尽管本申请的特定特征已经相对于若干实现方式中的仅一个被公 开, 但是这种特征可以与如可以对给定或特定应用而言是期望和有利的其他实现方 式的一个或多个其他特征组合。 而且, 就术语“包括”、 “具有”、 “含有”或其变形被 用在具体实施方式或权利要求中而言, 这样的术语旨在以与术语“包含”相似的方式 包括。
本申请实施例中的各功能单元可以集成在一个处理模块中, 也可以是各个单元 单独物理存在, 也可以两个或两个以上单元集成在一个模块中。 上述集成的模块既 可以采用硬件的形式实现, 也可以采用软件功能模块的形式实现。 所述集成的模块 如果以软件功能模块的形式实现并作为独立的产品销售或使用时, 也可以存储在一 个计算机可读取存储介质中。 上述提到的存储介质可以是只读存储器, 磁盘或光盘 等。 上述的各装置或系统, 可以执行相应方法实施例中的方法。
综上所述, 虽然本申请已以实施例揭露如上, 实施例前的序号仅为描述方便而 使用, 对本申请各实施例的顺序不造成限制。 并且, 上述实施例并非用以限制本申 请 本领域的普通技术人员, 在不脱离本申请的精神和范围内, 均可作各种更动与 润饰, 因此本申请的保护范围以权利要求界定的范围为准。

Claims

权利要求书
1、 一种图像目标检测方法, 由电子设备执行, 包括:
获取检测图像、 n级深度特征图框架以及 m级非深度特征图框架, n为大于等 于 2的整数, m为大于等于 1的整数; 其中特征图框架包括输出的特征尺寸以及维 度;
基于深度特征提取模型, 使用 i级深度特征图框架对所述检测图像的 ( i-1 ) 级 特征进行深度特征提取^ 以获取所述检测图像的 i级特征, 其中 i为小于等于 n的正 整数;
基于非深度特征提取模型, 使用 j 级非深度特征图框架对所述检测图像的 ( j-1+n ) 级特征进非深度特征提取, 以获取所述检测图像的 ( j+n ) 级特征, 其中 j 为小于等于 m的正整数; 以及
基于特征预测模型, 对所述检测图像的 a级特征至 ( m+n ) 级特征进行信息回 归操作, 从而获取所述检测图像的目标类型以及目标位置, 其中 a为小于 n且大于 等于 2的整数。
2、 根据权利要求 1所述的图像目标检测方法, 其中, 所述深度特征提取模型包 括深度输入卷积层、 深度第一非线性转换卷积层、 深度第二非线性转换卷积层以及 深度输出卷积层;
所述基于深度特征提取模型, 使用 i级深度特征图框架对所述检测图像的 ( i-1 ) 级特征进行深度特征提取, 以获取所述检测图像的 i级特征的步骤包括:
使用所述深度输入卷积层, 对所述检测图像的 ( i-1 ) 级特征进行升维操作, 以 得到所述检测图像的 i级升维特征;
使用所述深度第一非线性转换卷积层, 对所述检测图像的 i级升维特征进行第 一特征提取操作, 以得到所述检测图像的 i级第一卷积特征;
使用所述深度第二非线性转换卷积层, 对所述检测图像的 i级第一卷积特征进 行第二特征提取操作, 以得到所述检测图像的 i级第二卷积特征; 以及
使用所述深度输出卷积层, 对所述检测图像的 i级第二卷积特征进行降维操作, 以得到所述检测图像的 i级特征。
3、 根据权利要求 2所述的图像目标检测方法, 其中, 所述深度输入卷积层的卷 积核尺寸为 1*1, 所述深度第一非线性转换卷积层的卷积核尺寸为 3*3 , 所述深度第 二非线性转换卷积层的卷积核尺寸为 3*3, 所述深度输出卷积层的卷积核尺寸为 1*1 ;
所述深度输入卷积层为具有非线性激活函数的标准卷积层, 所述深度第一非线 性转换卷积层为具有非线性激活函数的深度可分离卷积层, 所述深度第二非线性转 换卷积层为具有非线性激活函数的深度可分离卷积层, 所述深度输出卷积层为不具 有激活函数的标准卷积层。
4、 根据权利要求 3所述的图像目标检测方法, 其中, 所述深度第二非线性转换 卷积层为具有非线性激活函数的深度可分离空洞卷积层。
5、 根据权利要求 1所述的图像目标检测方法, 其中, 所述非深度特征提取模型 包括非深度输入卷积层、 非深度非线性转换卷积层以及非深度输出卷积层;
所述基于非深度特征提取模型, 使用 j级非深度特征图框架对所述检测图像的 ( j- 1+n )级特征进非深度特征提取, 以获取所述检测图像的 ( j+n )级特征的步骤包 括:
使用所述非深度输入卷积层, 对所述检测图像的 ( j-1+n )级特征进行升维操作, 以得到所述检测图像的 ( j+n ) 级升维特征;
使用所述非深度非线性转换卷积层, 对所述检测图像的 ( j+n )级升维特征进行 特征提取操作, 以得到所述检测图像的 ( j+n ) 级卷积特征; 以及
使用所述非深度输出卷积层, 对所述检测图像的 ( j+n )级卷积特征进行降维操 作, 以得到所述检测图像的 ( j+n ) 级特征。
6、 根据权利要求 5所述的图像目标检测方法, 其中, 所述非深度输入卷积层的 卷积核尺寸为 1*1, 所述非深度非线性转换卷积层的卷积核尺寸为 3*3 , 所述非深度 输出卷积层的卷积核尺寸为 1*1 ;
所述非深度输入卷积层为具有非线性激活函数的标准卷积层, 所述非深度非线 性转换卷积层为具有非线性激活函数的深度可分离卷积层, 所述非深度输出卷积层 为不具有激活函数的标准卷积层。
7、 根据权利要求 6所述的图像目标检测方法, 其中, 所述非深度非线性转换卷 积层为具有非线性激活函数的深度可分离空洞卷积层。
8、 根据权利要求 1所述的图像目标检测方法, 其中, 所述特征预测模型包括特 征分类卷积层以及特征输出卷积层;
所述基于特征预测模型, 对所述检测图像的 a级特征至 ( m+n ) 级特征进行信 息回归操作, 从而获取所述检测图像的目标类型以及位置的步骤包括:
使用所述特征分类卷积层, 对所述检测图像的 a级特征至 (m+n) 级特征进行 特征提取操作, 以得到所述检测图像的分类识别特征; 以及
使用所述特征输出卷积层, 对所述检测图像的分类识别特征进行降维操作, 以 得到所述检测图像的目标类型以及位置。
9、 根据权利要求 8所述的图像目标检测方法, 其中, 所述特征分类卷积层的卷 积核尺寸为 3*3, 所述特征输出卷积层的卷积核尺寸为 1 * 1 ;
所述特征分类卷积层为不具有激活函数的深度可分离卷积层, 所述特征输出卷 积层为不具有激活函数的标准卷积层。
10、 一种图像目标检测装置, 包括:
图像以及框架获取模块, 用于获取检测图像、 n级深度特征图框架以及 m级非 深度特征图框架, n为大于等于 2的整数, m为大于等于 1的整数; 其中特征图框 架包括输出的特征尺寸以及维度;
深度特征提取模块, 用于基于深度特征提取模型, 使用 i级深度特征图框架对 所述检测图像的( i-1)级特征进行深度特征提取, 以获耳又所述检测图像的 i级特征, 其中 i为小于等于 n的正整数;
非深度特征提取模块, 用于基于非深度特征提取模型, 使用 j 级非深度特征图 框架对所述检测图像的 ( j-1+n)级特征进非深度特征提取, 以获取所述检测图像的 (j+n) 级特征, 其中 j为小于等于 m的正整数; 以及
目标检测模块, 用于基于特征预测模型, 对所述检测图像的 a级特征至( m+n) 级特征进行信息回归操作, 从而获取所述检测图像的目标类型以及目标位置, 其中 a为小于 n且大于等于 2的整数。
11、 根据权利要求 10所述的图像目标检测装置, 其中, 所述深度特征提取模型 包括深度输入卷积层、 深度第一非线性转换卷积层、 深度第二非线性转换卷积层以 及深度输出卷积层;
所述深度特征提取模块包括:
深度升维操作单元, 用于使用所述深度输入卷积层, 对所述检测图像的 i级特 征进行升维操作, 以得到所述检测图像的 i级升维特征;
第一深度特征提取单元, 用于使用所述深度第一非线性转换卷积层, 对所述检 测图像的 (i-1) 级升维特征进行第一特征提取操作, 以得到所述检测图像的 i级第 一卷积特征;
第二深度特征提取单元, 使用所述深度第二非线性转换卷积层, 对所述检测图 像的 i级第一卷积特征进行第二特征提取操作,以得到所述检测图像的 i级第二卷积 特征; 以及
深度降维操作单元, 用于使用所述深度输出卷积层, 对所述检测图像的 i级第 二卷积特征进行降维操作, 以得到所述检测图像的 i级特征。
12、 根据权利要求 10所述的图像目标检测装置, 其中, 所述非深度特征提取模 型包括非深度输入卷积层、 非深度非线性转换卷积层以及非深度输出卷积层;
所述非深度特征提取模块包括:
非深度升维操作单元, 用于使用所述非深度输入卷积层, 对所述检测图像的 ( j-1+n ) 级特征进行升维操作, 以得到所述检测图像的 ( j+n ) 级升维特征;
非深度特征提取单元, 用于使用所述非深度非线性转换卷积层, 对所述检测图 像的 ( j+n ) 级升维特征进行特征提取操作, 以得到所述检测图像的 ( j+n ) 级卷积 特征; 以及
非深度降维操作单元, 用于使用所述非深度输出卷积层, 对所述检测图像的 ( j+n ) 级卷积特征进行降维操作, 以得到所述检测图像的 ( j+n ) 级特征。
13、 根据权利要求 10所述的图像目标检测装置, 其中, 所述特征预测模型包括 特征分类卷积层以及特征输出卷积层;
所述目标检测模块包括:
特征分类单元, 用于使用所述特征分类卷积层, 对所述检测图像的 a级特征至
( m+n ) 级特征进行特征提取操作, 以得到所述检测图像的分类识别特征; 以及 特征输出单元, 用于使用所述特征输出卷积层, 对所述检测图像的分类识别特 征进行降维操作, 以得到所述检测图像的目标类型以及位置。
14、 根据权利要求 13所述的图像目标检测装置, 其中, 所述特征分类卷积层的 卷积核尺寸为 3*3, 所述特征输出卷积层的卷积核尺寸为 1*1 ;
所述特征分类卷积层为不具有激活函数的深度可分离卷积层, 所述特征输出卷 积层为不具有激活函数的标准卷积层。
15、 一种存储介质, 其内存储有处理器可执行指令, 所述指令由一个或一个以 上处理器执行时, 实现如权利要求 1-9中任一的图像目标检测方法。
16、 一种电子设备, 包括一个或多个处理器和存储装置; 所述存储装置, 用于存储一个或多个可执行程序指令;
所述一个或多个处理器, 用于执行所述存储装置中的一个或多个可执行程序指 令, 以实现如权利要求 1-9任一项所述的图像目标检测方法。
PCT/CN2019/090406 2018-07-11 2019-06-06 图像目标检测方法、装置、存储介质及电子设备 WO2020010975A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19833842.8A EP3742394A4 (en) 2018-07-11 2019-06-06 IMAGE TARGET DETECTION METHOD AND DEVICE, STORAGE MEDIUM AND ELECTRONIC DEVICE
US17/008,189 US11176404B2 (en) 2018-07-11 2020-08-31 Method and apparatus for detecting object in image, and storage medium thereof

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810754633.XA CN110717929A (zh) 2018-07-11 2018-07-11 图像目标检测方法、装置及存储介质
CN201810754633.X 2018-07-11

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/008,189 Continuation US11176404B2 (en) 2018-07-11 2020-08-31 Method and apparatus for detecting object in image, and storage medium thereof

Publications (1)

Publication Number Publication Date
WO2020010975A1 true WO2020010975A1 (zh) 2020-01-16

Family

ID=69143168

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/090406 WO2020010975A1 (zh) 2018-07-11 2019-06-06 图像目标检测方法、装置、存储介质及电子设备

Country Status (4)

Country Link
US (1) US11176404B2 (zh)
EP (1) EP3742394A4 (zh)
CN (1) CN110717929A (zh)
WO (1) WO2020010975A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020051776A1 (en) * 2018-09-11 2020-03-19 Intel Corporation Method and system of deep supervision object detection for reducing resource usage
US11410315B2 (en) 2019-11-16 2022-08-09 Uatc, Llc High quality instance segmentation
CN111507271B (zh) * 2020-04-20 2021-01-12 北京理工大学 一种机载光电视频目标智能化检测与识别方法
CN113392857B (zh) * 2021-08-17 2022-03-11 深圳市爱深盈通信息技术有限公司 基于yolo网络的目标检测方法、装置和设备终端

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105981008A (zh) * 2014-05-27 2016-09-28 北京旷视科技有限公司 学习深度人脸表示
CN106845499A (zh) * 2017-01-19 2017-06-13 清华大学 一种基于自然语言语义的图像目标检测方法
CN108198192A (zh) * 2018-01-15 2018-06-22 任俊芬 一种基于深度学习的高精度快速人体分割方法

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9668699B2 (en) * 2013-10-17 2017-06-06 Siemens Healthcare Gmbh Method and system for anatomical object detection using marginal space deep neural networks
KR20160083127A (ko) * 2013-11-30 2016-07-11 베이징 센스타임 테크놀로지 디벨롭먼트 컴퍼니 리미티드 얼굴 이미지 인식 방법 및 시스템
US10387773B2 (en) * 2014-10-27 2019-08-20 Ebay Inc. Hierarchical deep convolutional neural network for image classification
US10346726B2 (en) * 2014-12-15 2019-07-09 Samsung Electronics Co., Ltd. Image recognition method and apparatus, image verification method and apparatus, learning method and apparatus to recognize image, and learning method and apparatus to verify image
JP2016146174A (ja) * 2015-02-06 2016-08-12 パナソニックIpマネジメント株式会社 決定方法およびプログラム
US10192288B2 (en) * 2016-12-23 2019-01-29 Signal Processing, Inc. Method and system for generating high resolution worldview-3 images
WO2018121013A1 (en) * 2016-12-29 2018-07-05 Zhejiang Dahua Technology Co., Ltd. Systems and methods for detecting objects in images
EP3616118A1 (en) * 2017-04-26 2020-03-04 Skansense S.L.U. Identifying targets within images
US20190318806A1 (en) * 2018-04-12 2019-10-17 Illumina, Inc. Variant Classifier Based on Deep Neural Networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105981008A (zh) * 2014-05-27 2016-09-28 北京旷视科技有限公司 学习深度人脸表示
CN106845499A (zh) * 2017-01-19 2017-06-13 清华大学 一种基于自然语言语义的图像目标检测方法
CN108198192A (zh) * 2018-01-15 2018-06-22 任俊芬 一种基于深度学习的高精度快速人体分割方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3742394A4 *

Also Published As

Publication number Publication date
CN110717929A (zh) 2020-01-21
US11176404B2 (en) 2021-11-16
US20200394433A1 (en) 2020-12-17
EP3742394A1 (en) 2020-11-25
EP3742394A4 (en) 2021-07-28

Similar Documents

Publication Publication Date Title
WO2020010975A1 (zh) 图像目标检测方法、装置、存储介质及电子设备
WO2021238281A1 (zh) 一种神经网络的训练方法、图像分类系统及相关设备
US9916531B1 (en) Accumulator constrained quantization of convolutional neural networks
WO2021098362A1 (zh) 视频分类模型构建、视频分类的方法、装置、设备及介质
WO2020228522A1 (zh) 目标跟踪方法、装置、存储介质及电子设备
CN112668588B (zh) 车位信息生成方法、装置、设备和计算机可读介质
WO2023143178A1 (zh) 对象分割方法、装置、设备及存储介质
CN110826567B (zh) 光学字符识别方法、装置、设备及存储介质
WO2023231954A1 (zh) 一种数据的去噪方法以及相关设备
US10133955B2 (en) Systems and methods for object recognition based on human visual pathway
WO2022110640A1 (zh) 一种模型优化方法、装置、计算机设备及存储介质
WO2023030348A1 (zh) 图像生成方法、装置、设备及存储介质
WO2020248365A1 (zh) 智能分配模型训练内存方法、装置及计算机可读存储介质
WO2022042609A1 (zh) 提取热词的方法、装置、电子设备及介质
WO2017088434A1 (zh) 人脸模型矩阵训练方法、装置及存储介质
US20200082213A1 (en) Sample processing method and device
WO2023051369A1 (zh) 一种神经网络的获取方法、数据处理方法以及相关设备
WO2023179482A1 (zh) 一种图像处理方法、神经网络的训练方法以及相关设备
WO2022021834A1 (zh) 神经网络模型确定方法、装置、电子设备、介质及产品
EP4343616A1 (en) Image classification method, model training method, device, storage medium, and computer program
CN110287361B (zh) 一种人物图片筛选方法及装置
WO2024051655A1 (zh) 全视野组织学图像的处理方法、装置、介质和电子设备
CN110414450A (zh) 关键词检测方法、装置、存储介质及电子设备
WO2021073638A1 (zh) 运行神经网络模型的方法、装置和计算机设备
WO2023016290A1 (zh) 视频分类方法、装置、可读介质和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19833842

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019833842

Country of ref document: EP

Effective date: 20200821

NENP Non-entry into the national phase

Ref country code: DE