WO2018153319A1 - 物体检测方法、神经网络的训练方法、装置和电子设备 - Google Patents

物体检测方法、神经网络的训练方法、装置和电子设备 Download PDF

Info

Publication number
WO2018153319A1
WO2018153319A1 PCT/CN2018/076653 CN2018076653W WO2018153319A1 WO 2018153319 A1 WO2018153319 A1 WO 2018153319A1 CN 2018076653 W CN2018076653 W CN 2018076653W WO 2018153319 A1 WO2018153319 A1 WO 2018153319A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature map
subnet
target area
area frame
layer
Prior art date
Application number
PCT/CN2018/076653
Other languages
English (en)
French (fr)
Inventor
李弘扬
刘宇
欧阳万里
王晓刚
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Priority to SG11201907355XA priority Critical patent/SG11201907355XA/en
Priority to JP2019545345A priority patent/JP6902611B2/ja
Priority to US16/314,406 priority patent/US11321593B2/en
Publication of WO2018153319A1 publication Critical patent/WO2018153319A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06V10/7747Organisation of the process, e.g. bagging or boosting

Definitions

  • the present application relates to image processing technologies, and in particular, to an object detection method and apparatus, a neural network training method and apparatus, and an electronic device.
  • the purpose of the target area frame detection is to detect a number of rectangular frames from which the object may be present.
  • the size of the feature map is gradually reduced by the pooling layer in the convolutional neural network, thereby finally determining a rectangular frame in which an object may exist.
  • This network structure is called "zoom-out structure".
  • the present application provides a technique for performing target area frame detection based on an image.
  • an object detecting method includes: acquiring, by using a deep convolutional neural network for target area frame detection, a plurality of fused feature maps from image prediction to be processed; wherein the depth
  • the convolutional neural network includes a first subnet having at least one downsampling layer, the second subnet having at least one upsampling layer, and the fused feature map passing the first feature The figure and the second feature map are obtained.
  • the first feature map is obtained from the first subnet
  • the second feature map is obtained from the second subnet.
  • the target area frame data is obtained according to the multiple merged feature maps.
  • the second subnet is disposed at an end of the first subnet, and the first subnet has a plurality of first convolution layers and the at least one downsampling layer,
  • the downsampling layer is disposed between the plurality of first convolutional layers
  • the second subnet has a plurality of second convolutional layers and the at least one upsampling layer
  • the upsampling layer is disposed at Between the plurality of second convolutional layers, the first convolutional layer and the second convolutional layer are symmetrically disposed, and the at least one downsampling layer and the at least one upsampling layer are symmetrically disposed, respectively.
  • At least one of the first convolutional layers is provided with a first output branch for outputting the first feature map, and at the second convolutional layer, for outputting the second The second output branch of the feature map.
  • the second subnet further has a plurality of third convolution layers, and the input of the third convolution layer includes the first output branch and the second output branch;
  • the predicting acquiring the plurality of fusion feature maps comprises: acquiring the fusion feature maps from the output ends of the plurality of third convolution layers respectively.
  • At least one point in the fusion feature map has frame fusion detection data corresponding to a plurality of object detection frames and prediction accuracy information, and the target is acquired according to the multiple fusion feature maps.
  • the area frame data includes: acquiring target area frame data corresponding to each of the merged feature maps according to the frame fusion detection data and the prediction accurate information in the at least one of the fusion feature maps.
  • the acquiring the target area frame data according to the multiple fusion feature maps includes: respectively acquiring the primary target area frame data corresponding to the fusion feature maps; and performing the following object regions by iteratively a frame regression operation until the iteration satisfies an iterative termination condition: obtaining a new primary selection target region frame data from the adjusted fusion feature map by adjusting the fusion feature map; the primary selection obtained through the iteration
  • the target area frame data is used as target area frame data in the image to be processed.
  • the deep convolutional neural network further includes a third subnet, the third subnet having a plurality of sets of fourth convolutional layers and a plurality of pooling layers, the plurality of groups Four convolution layers respectively corresponding to the third convolution layer, the plurality of pooling layers respectively corresponding to the plurality of groups of fourth convolution layers, and input of each of the pooling layers including the adjusted The fusion feature map and the data of the primary target area frame.
  • the object region frame regression operation includes: convolving the current fusion feature map by the fourth convolution layer to obtain an adjusted fusion feature map; Selecting target area frame data, performing area pooling on the adjusted fusion feature map by the pooling layer, acquiring a new fusion feature map, and acquiring the new primary selection target area frame data from the new fusion feature map .
  • the third subnet further has a fifth convolution layer disposed at an output end of the pooling layer, where the new primary selection is obtained from the new fusion feature map.
  • the target area frame data includes: performing normalized convolution on the new fusion feature map by the fifth convolutional layer, and acquiring the new primary selection target area frame data from the normalized convolutional fusion feature map.
  • the first subnet and the second subnet are cognitive-sample normalization (Inception-BN) network structures
  • the third subnet is a residual network. (ResNet) structure.
  • a training method for a neural network including: inputting a sample image containing frame label information of a target area into a deep convolutional neural network for target area frame detection, and detecting and acquiring a plurality of fusion feature maps
  • the deep convolutional neural network includes a first subnet having at least one downsampling layer, and a second subnet having at least one upsampling layer; the fusion feature map Obtained by the first feature map and the second feature map, the first feature map is obtained from the first subnet, and the second feature map is obtained from the second subnet; Target area frame data of the sample image; determining first difference data detected by the object frame according to the acquired target area frame data of the sample image and the target area frame labeling information; and adjusting the first difference data according to the first difference data Network parameters of deep convolutional neural networks.
  • the second subnet is disposed at an end of the first subnet, and the first subnet has a plurality of first convolution layers and the at least one downsampling layer,
  • the downsampling layer is disposed between the plurality of first convolutional layers
  • the second subnet has a plurality of second convolutional layers and the at least one upsampling layer
  • the upsampling layer is disposed at Between the plurality of second convolutional layers, the first convolutional layer and the second convolutional layer are symmetrically disposed, and the at least one downsampling layer and the at least one upsampling layer are symmetrically disposed, respectively.
  • At least one of the first convolutional layers is provided with a first output branch for outputting the first feature map, and at the second convolutional layer, for outputting the second The second output branch of the feature map.
  • the second subnet further has a plurality of third convolution layers, and the input of the third convolution layer includes the first output branch and the second output branch;
  • the detecting the acquiring the plurality of fusion feature maps comprises: acquiring the fusion feature maps from the output ends of the plurality of third convolution layers respectively.
  • At least one point in the fusion feature map has frame fusion detection data and prediction accuracy information corresponding to the plurality of object detection frames.
  • the deep convolutional neural network further includes a third subnet, the third subnet having a plurality of sets of fourth convolutional layers and a plurality of pooling layers, the plurality of groups Four convolution layers respectively corresponding to the third convolution layer, the plurality of pooling layers respectively corresponding to the plurality of groups of fourth convolution layers, and input of each of the pooling layers including the adjusted The fusion feature map and the data of the primary target area frame.
  • the method further includes: iteratively performing the following target area box regression training operation until the iteration satisfies an iterative termination condition: respectively, by using the fourth convolution layer
  • the fusion feature map is convoluted, and the adjusted fusion feature map is obtained.
  • the adjusted fusion feature map is pooled by the pooling layer to obtain a new fusion feature map;
  • the new fusion feature map acquires the new primary selection target area frame data; according to the frame regression data between the unadjusted primary selection target area frame data and the new primary selection target area frame data, a new primary selection
  • the target area frame data and the corresponding target area frame labeling information determine second difference data detected by the object frame; and the network parameters of the third subnet are adjusted according to the second difference data.
  • the third subnet further has a fifth convolution layer disposed at an output end of the pooling layer, where the new primary selection is obtained from the new fusion feature map.
  • the target area frame data includes: performing normalized convolution on the new fusion feature map by the fifth convolutional layer, and acquiring the new primary selection target area frame data from the normalized convolutional fusion feature map.
  • the method before the sample image containing the target area frame labeling information is input into the deep convolutional neural network for the target area frame detection, and the method further includes: The sample image is scaled such that the true value of at least one object region frame is covered by the object detection frame.
  • the target area frame labeling information of the sample image includes the labeling information of the positive sample area frame and the labeling information of the negative sample area frame; the true value of the positive sample area box and the object area box
  • the overlap ratio is not lower than the first overlap ratio value, the overlap ratio of the negative sample region frame and the true value of the object region frame is not higher than the second overlap ratio value, and the first overlap ratio value is greater than the second overlap Ratio value.
  • the target area frame labeling information of the sample image further includes labeling information of the neutral sample area frame, and the overlap ratio of the true sample area box and the true value of the object area box is in the Between the first overlap ratio value and the second overlap ratio value.
  • the sum of the labeled positive sample region frames accounts for the total number of frames in the positive sample region frame, the negative sample region frame, and the neutral sample region frame.
  • the ratio is not less than a predetermined first ratio, the first ratio is greater than 50%; the sum of the labeled negative sample region boxes in the total number of frames is not greater than a predetermined second ratio; the sum of the labeled neutral sample region boxes
  • the proportion in the total number of frames is not greater than a predetermined third ratio, and the third ratio is not greater than half the sum of the first ratio and the second ratio.
  • the first subnet and the second subnet are cognitive-sample normalized network structures
  • the third subnet is a residual network structure
  • an object detecting apparatus comprising: a fusion feature map prediction module, configured to acquire a plurality of fusion features from a to-be-processed image prediction by a deep convolutional neural network for target area frame detection
  • the depth convolutional neural network includes a first subnet having at least one downsampling layer and a second subnet having at least one upsampling layer;
  • the fusion feature map is obtained by using the first feature map obtained from the first subnet, the second feature map being obtained from the second subnet, and the target region frame prediction module, And acquiring target area frame data according to the plurality of fusion feature patterns acquired by the fusion feature map prediction module.
  • the second subnet is disposed at an end of the first subnet, and the first subnet has a plurality of first convolution layers and the at least one downsampling layer,
  • the downsampling layer is disposed between the plurality of first convolutional layers
  • the second subnet has a plurality of second convolutional layers and the at least one upsampling layer
  • the upsampling layer is disposed at Between the plurality of second convolutional layers, the first convolutional layer and the second convolutional layer are symmetrically disposed, and the at least one downsampling layer and the at least one upsampling layer are symmetrically disposed, respectively.
  • At least one of the first convolutional layers is provided with a first output branch for outputting the first feature map, and at the second convolutional layer, for outputting the second The second output branch of the feature map.
  • the second subnet further has a plurality of third convolution layers, and the input of the third convolution layer includes the first output branch and the second output branch;
  • the fusion feature map prediction module is configured to separately acquire the fusion feature map from the output ends of the plurality of third convolution layers.
  • At least one point in the fusion feature map has frame fusion detection data corresponding to a plurality of object detection frames and prediction accuracy information
  • the target area frame prediction module is configured to use according to at least one The frame fusion detection data and the prediction accuracy information in the fusion feature map respectively acquire target region frame data corresponding to the fusion feature map.
  • the target area frame prediction module is configured to: respectively acquire primary selection target area frame data corresponding to the fusion feature maps; and iteratively perform the following object area frame regression operation until the The iteratively satisfying the iterative termination condition: obtaining the new primary target area frame data from the adjusted fusion feature map by adjusting the fusion feature map; and using the primary selection target area frame data obtained through the iteration as the Target area frame data in the image to be processed.
  • the deep convolutional neural network further includes a third subnet, the third subnet having a plurality of sets of fourth convolutional layers and a plurality of pooling layers, the plurality of groups Four convolution layers respectively corresponding to the third convolution layer, the plurality of pooling layers respectively corresponding to the plurality of groups of fourth convolution layers, and input of each of the pooling layers including the adjusted The fusion feature map and the data of the primary target area frame.
  • the target area frame prediction module includes: a frame adjustment unit, configured to convolve the current fusion feature map by the fourth convolution layer, and obtain an adjustment fusion feature.
  • the area pooling unit is configured to perform area pooling on the adjusted fusion feature map by using the pooling layer according to the current primary selection target area frame data to obtain a new fusion feature map; the primary selection box acquiring unit, And for acquiring the new primary target area frame data from the new fusion feature map.
  • the third subnet further has a fifth convolution layer disposed at an output end of the pooling layer, and the primary selection frame acquiring unit is configured to pass the fifth convolution layer
  • the new fusion feature map is normalized convolution, and the new primary target region frame data is obtained from the normalized convolution fusion feature map.
  • the first subnet and the second subnet are cognitive-sample normalization (Inception-BN) network structures
  • the third subnet is a residual network. (ResNet) structure.
  • a training apparatus for a neural network comprising: a fusion feature map detection module, configured to input a sample image containing frame label information of a target area into a deep convolutional neural network for target area frame detection Detecting, acquiring a plurality of fusion feature maps, the deep convolutional neural network comprising a first subnet and a second subnet, the first subnet having at least one downsampling layer, and the second subnet having at least one a sampling layer obtained by the first feature map and the second feature map, the first feature map is obtained from the first subnet, and the second feature map is obtained from the second subnet; a frame detection module, configured to acquire target area frame data of the sample image according to the plurality of fusion feature images, and a first difference acquisition module, configured to: according to the acquired target area frame data of the sample image and the target The area frame labeling information determines the first difference data detected by the object frame; the first network training module is configured to adjust the deep convolutional nerve
  • the second subnet is disposed at an end of the first subnet, and the first subnet has a plurality of first convolution layers and the at least one downsampling layer,
  • the downsampling layer is disposed between the plurality of first convolutional layers
  • the second subnet has a plurality of second convolutional layers and the at least one upsampling layer
  • the upsampling layer is disposed at Between the plurality of second convolutional layers, the first convolutional layer and the second convolutional layer are symmetrically disposed, and the at least one downsampling layer and the at least one upsampling layer are symmetrically disposed, respectively.
  • At least one of the first convolutional layers is provided with a first output branch for outputting the first feature map, and at the second convolutional layer, for outputting the second The second output branch of the feature map.
  • the second subnet further has a plurality of third convolution layers, and the input of the third convolution layer includes the first output branch and the second output branch;
  • the fusion feature map detecting module is configured to separately acquire the fusion feature map from the output ends of the plurality of third convolution layers.
  • At least one point in the fusion feature map has frame fusion detection data and prediction accuracy information corresponding to the plurality of object detection frames.
  • the deep convolutional neural network further includes a third subnet, the third subnet having a plurality of sets of fourth convolutional layers and a plurality of pooling layers, the plurality of groups Four convolution layers respectively corresponding to the third convolution layer, the plurality of pooling layers respectively corresponding to the plurality of groups of fourth convolution layers, and input of each of the pooling layers including the adjusted The fusion feature map and the data of the primary target area frame.
  • the apparatus further includes: a box regression iterative training module, configured to iteratively perform the following target area box regression training operation until the iteration satisfies an iterative termination condition:
  • the convolution layer respectively convolves the current fusion feature map to obtain an adjustment fusion feature map; and according to the current primary selection target area frame data, the pooled layer is layered by the pooling layer, Obtaining a new fusion feature map; acquiring the new primary target region frame data from the new fusion feature map; and between the unadjusted primary selection target region frame data and the new primary selection target region frame data
  • the frame regression data, the new primary selection target area frame data, and the corresponding target area frame labeling information determine second difference data detected by the object frame; and adjust network parameters of the third subnet according to the second difference data.
  • the third subnet further has a fifth convolution layer disposed at an output end of the pooling layer, and the box regression iterative training module is configured to pass the fifth convolution layer
  • the new fusion feature map is normalized convolution, and the new primary target region frame data is obtained from the normalized convolution fusion feature map.
  • the apparatus further includes: a pre-processing module, configured to input a sample image containing the target area frame labeling information into a deep convolutional neural network for target area frame detection, and detect and acquire more Before the feature map is merged, the sample image is scaled such that the true value of at least one object region frame is covered by the object detection frame.
  • a pre-processing module configured to input a sample image containing the target area frame labeling information into a deep convolutional neural network for target area frame detection, and detect and acquire more Before the feature map is merged, the sample image is scaled such that the true value of at least one object region frame is covered by the object detection frame.
  • the target area frame labeling information of the sample image includes the labeling information of the positive sample area frame and the labeling information of the negative sample area frame; the true value of the positive sample area box and the object area box
  • the overlap ratio is not lower than the first overlap ratio value, the overlap ratio of the negative sample region frame and the true value of the object region frame is not higher than the second overlap ratio value, and the first overlap ratio value is greater than the second overlap Ratio value.
  • the target area frame labeling information of the sample image further includes labeling information of the neutral sample area frame, and the overlap ratio of the true sample area box and the true value of the object area box is in the Between the first overlap ratio value and the second overlap ratio value.
  • the sum of the labeled positive sample region frames accounts for the total number of frames in the positive sample region frame, the negative sample region frame, and the neutral sample region frame.
  • the ratio is not less than a predetermined first ratio, the first ratio is greater than 50%; the sum of the labeled negative sample region frames in the total number of frames is not greater than a predetermined second ratio; the sum of the labeled neutral sample region frames
  • the proportion in the total number of frames is not greater than a predetermined third ratio, and the third ratio is not greater than half the sum of the first ratio and the second ratio.
  • the first subnet and the second subnet are cognitive-sample normalized network structures
  • the third subnet is a residual network structure
  • an electronic device comprising:
  • the memory is configured to store at least one executable instruction, the executable instruction causing the processor to perform an operation corresponding to the object detecting method according to any one of the embodiments of the present application; or the memory is configured to store at least one Executing instructions that cause the processor to perform operations corresponding to the training method of the neural network of any of the embodiments of the present application.
  • Another electronic device comprising:
  • the processor and the object detecting device according to any one of the embodiments of the present application; when the processor runs the object detecting device, the unit in the object detecting device according to any one of the embodiments of the present application is operated; or
  • the processor and the training device of the neural network according to any one of the embodiments of the present application; when the processor runs the training device of the neural network, the unit in the training device of the neural network according to any one of the embodiments of the present application is operated .
  • a computer program comprising computer readable code, when the computer readable code is run on a device, the processor in the device is operative to implement any of the embodiments of the present application An instruction of each step in the object detecting method; or
  • the processor in the device executes instructions for implementing the steps in the training method of the neural network described in any of the embodiments of the present application.
  • a computer readable storage medium for storing computer readable instructions, the instructions being executed to implement steps in an object detecting method according to any of the embodiments of the present application Operation, or operation of each step in the training method of the neural network described in any of the embodiments of the present application.
  • the training technical scheme of the neural network, the plurality of fusion feature maps are obtained from the image prediction to be processed by the deep convolutional neural network for target area frame detection, wherein the slave has at least one lower Acquiring, by the first subnet of the sampling layer, a plurality of first feature maps, and acquiring, by the second subnet having the at least one upsampling layer, a plurality of second feature maps, where the plurality of first feature maps and the plurality of second feature maps respectively Fusion to obtain a fusion feature map. Thereafter, the target area frame data is acquired according to the plurality of fusion feature maps.
  • the fusion feature map can effectively extract images.
  • the target area frame data of the large and small objects contained in the object thereby improving the accuracy and robustness of object detection.
  • FIG. 1 is a flow chart showing an object detecting method according to an embodiment of the present application.
  • FIG. 2 is a flow chart showing an object detecting method according to another embodiment of the present application.
  • FIG. 3 illustrates an exemplary structure of a deep convolutional neural network in accordance with an embodiment of the present application
  • FIG. 4 is a flow chart showing an object detecting method according to still another embodiment of the present application.
  • FIG. 5 is a flowchart illustrating a training method of a neural network according to an embodiment of the present application
  • FIG. 6 is a flow chart showing a training method of a neural network according to another embodiment of the present application.
  • FIG. 7 is a flow chart showing a training method of a neural network according to still another embodiment of the present application.
  • FIG. 8 is a block diagram showing the structure of an object detecting apparatus according to an embodiment of the present application.
  • FIG. 9 is a block diagram showing the structure of an object detecting apparatus according to another embodiment of the present application.
  • FIG. 10 is a block diagram showing the structure of a training apparatus for a neural network according to an embodiment of the present application.
  • FIG. 11 is a structural block diagram showing a training apparatus of a neural network according to another embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a first electronic device according to an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram showing a second electronic device according to another embodiment of the present application.
  • the application can be applied to a computer system/server that can operate with numerous other general purpose or special purpose computing system environments or configurations.
  • Examples of well-known computing systems, environments, and/or configurations suitable for use with computer systems/servers include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, based on Microprocessor systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, large computer systems, and distributed cloud computing technology environments including any of the above.
  • the computer system/server can be described in the general context of computer system executable instructions (such as program modules) being executed by a computer system.
  • program modules may include, but are not limited to, routines, programs, target programs, components, logic, data structures, which perform particular tasks or implement particular abstract data types.
  • the computer system/server can be implemented in a distributed cloud computing environment where tasks are performed by remote processing devices that are linked through a communication network.
  • program modules may be located on a local or remote computing system storage medium including storage devices.
  • the target area frame detection is performed by the network structure provided by the prior art, and the information in the feature map obtained from the upper layer in the convolutional neural network cannot be effectively utilized to assist in processing the information of the underlying network, so that the feature data acquired from the network is not It has sufficient representativeness and robustness, which is not conducive to the detection of small objects.
  • any of the clustering technical solutions provided by the present disclosure may be exemplified by software or hardware or a combination of software and hardware.
  • the clustering technical solution provided by the present disclosure may be implemented by a certain electronic device or by a certain processor, and the disclosure is not limited, and the electronic device may include, but is not limited to, a terminal or a server, and the processor may include but Not limited to CPU or GPU. The details are not described below.
  • FIG. 1 is a flow chart showing an object detecting method according to an embodiment of the present application.
  • the object detecting method of this embodiment includes the following steps:
  • Step S110 Acquire a plurality of fusion feature maps from the image to be processed prediction by using a deep convolutional neural network for target area frame detection.
  • step S110 may be performed by a processor invoking a memory stored instruction or by a fusion feature map prediction module 810 being executed by the processor.
  • the deep convolutional neural network includes a first subnet having at least one downsampling layer and a second subnet having at least one upsampling layer.
  • the fusion feature map is obtained by using the first feature map and the second feature map.
  • the first feature map is obtained from the first subnet, and the second feature map is obtained from the second subnet.
  • the image to be processed in the above embodiment of the present disclosure is a photo or video frame image in which one or more object objects are captured.
  • the image should meet certain resolution requirements, at least through the naked eye to identify the object object that was captured.
  • the first subnet in the deep convolutional neural network for target area frame detection is convolved and pooled by the image to be processed, and the first image of the image can be obtained at multiple convolution layers of different depths of the first subnet.
  • Feature maps these first feature maps characterize regions of different size extents.
  • the first feature map obtained in the shallower convolution layer can better express the details of the image, and the shallow convolution layer usually refers to the deep convolutional neural network.
  • the convolutional layer in the middle position but it is difficult to distinguish the foreground and the background; and the first feature map obtained in the deeper convolution layer can better extract the overall semantic features in the image, and the deep convolution layer usually refers to Depth convolutional neural network in the deep convolutional neural network, but will lose the details of the image, such as small object information.
  • the second subnet having at least one upsampling layer further performs the inverse processing on the first feature map acquired from the end of the first subnet, ie, deconvolution, upsampling, and pooling operations, from the end of the first subnet
  • the obtained first feature map is gradually enlarged, and a plurality of convolution layers of different depths of the second sub-network acquire a second feature map corresponding to the foregoing first feature map. Since the second feature map is deconvolved and upsampled by the convolved, downsampled first feature map, high-level semantic features are gradually deconvolved and combined with low-level detail features to assist in identifying small Object (area box of small objects).
  • the image processing path performed by the first subnet and the second subnet forms an hourglass-shaped structure, and the first feature map generated from the first convolutional layer of the first subnet is gradually reduced by downsampling;
  • the first feature map generated at the end of a subnet is progressively amplified by the second convolutional layer and the upsampled layer of the second subnet.
  • the at least one first feature map is merged with the corresponding second feature map to obtain a plurality of fusion feature maps, which can better represent the semantic features and low-level details of the upper layer of the image. Used to identify objects of different sizes.
  • Step S120 Acquire target area frame data according to multiple fusion feature maps.
  • step S120 may be performed by a processor invoking a memory stored instruction or by a target area frame prediction module 820 being executed by the processor.
  • the target area frame data may be extracted from the at least one merged feature map, and the target area frame data extracted from the at least one merged feature map is integrated as the target area frame data detected from the image to be processed.
  • a plurality of fused feature maps are obtained from the image prediction to be processed by the deep convolutional neural network for target area frame detection, wherein the first sub-graph having at least one downsampling layer is obtained
  • the network acquires a plurality of first feature maps, and obtains a plurality of second feature maps from the second subnet having at least one upsampling layer, and respectively fuses the plurality of first feature maps and the plurality of second feature maps to obtain the merged feature map.
  • the target area frame data is acquired according to the plurality of fusion feature maps.
  • the fusion feature map can effectively extract images.
  • the target area frame data of the large and small objects contained in the object thereby improving the accuracy and robustness of object detection.
  • FIG. 2 is a flow chart showing an object detecting method according to another embodiment of the present application.
  • step S210 a plurality of fused feature maps are acquired from the image prediction to be processed by the deep convolutional neural network for target area frame detection.
  • step S210 may be performed by a processor invoking a memory stored instruction or by a fusion feature map prediction module 810 being executed by the processor.
  • the first subnet has a plurality of first convolutional layers and at least one downsampling layer, and the downsampling layer is disposed between the plurality of first convolutional layers;
  • the second subnet There are a plurality of second convolution layers and at least one upsampling layer, and an upsampling layer is disposed between the plurality of second convolution layers.
  • the second subnet is disposed at an end of the first subnet, and the first convolution layer and the second convolution layer are symmetrically disposed, and the at least one downsampling layer and the at least one upsampling layer are symmetrically disposed respectively.
  • the layer acquires a plurality of second feature maps of the image.
  • a first output branch for outputting the first feature map is provided on the at least one first convolution layer, and a second output branch for outputting the second feature map is provided on the second convolution layer.
  • the second subnet further has a plurality of third convolutional layers, and the input of the third convolutional layer includes the first output branch and the second output branch.
  • the fusion feature map is acquired from the output ends of the plurality of third convolution layers.
  • both the first subnet and the second subnet are constructed as a cognitive-sample normalization (Inception-BN) network structure with better performance in object detection.
  • the Inception-BN network architecture excels at extracting different structures/patterns from images and is suitable for performing the task functions of the first subnet and the second subnet.
  • FIG. 3 illustrates an exemplary structure of a deep convolutional neural network in accordance with an embodiment of the present disclosure.
  • the deep convolutional neural network includes a first subnet SN1 and a second subnet SN2.
  • the first subnet SN1 has a plurality of first convolutional layers C1 and at least one downsampling layer P1 disposed between the plurality of first convolutional layers C1
  • the second subnet SN2 has a plurality of second convolutional layers C2 and at least one upsampling layer P2 disposed between the plurality of second convolutional layers C2, wherein the downsampling layer P1 and the upsampling layer P2 are symmetrically disposed, the plurality of first convolutional layers C1 and the plurality of second
  • the convoluted layer C2 is also symmetrically arranged.
  • At least one first convolutional layer C1 is provided with a first output branch F1
  • at least one second convolutional layer C2 is provided with a first output branch F2.
  • the second sub-network SN2 is further provided with a plurality of third convolutional layers C3, and the fusion feature map is output from the plurality of third convolutional layers C3.
  • At least one point in the fusion feature map has frame fusion detection data and prediction accuracy information corresponding to the plurality of object detection frames. That is to say, information of an object detection frame for performing object area frame detection, such as a convolution parameter or a feature parameter, is respectively provided at the first convolutional layer and the second convolutional layer.
  • the information of the object detection frame set in the first convolution layer and the second convolution layer of different depths respectively correspond to two or more object detection frame sets, and the two or more object detection frame sets respectively
  • An object detection frame including a range of different detection frame sizes for acquiring feature data of different size object region frames at different depths of the deep convolutional neural network.
  • the frame fusion detection data of the at least one point in the fusion feature map may include, but is not limited to, coordinate data, position and size data corresponding to the object detection frame in the object detection frame set, and the prediction accurate information may be the frame fusion detection data.
  • Confidence data such as: predictive accuracy.
  • each point in the fused feature map may have 1, 3, 6, or 9 coordinate data corresponding to the object detection frame and confidence data of the coordinate data.
  • step S220 is performed after step 210.
  • Step 220 Acquire target area frame data corresponding to each of the fusion feature maps according to the frame fusion detection data and the prediction accuracy information in the at least one fusion feature map.
  • step S220 may be performed by a processor invoking a memory stored instruction or by a processor.
  • the target area frame data may be acquired according to the predicted accurate information of the frame fusion detection data of at least one point in the fusion feature map. For example, if the confidence level of a certain frame coordinate data of a certain point is greater than a predetermined threshold (eg, 60%, 70%), the area frame corresponding to the frame coordinate data may be determined as one of the target area frame data.
  • a predetermined threshold eg, 60%, 70%
  • steps S230-S240 are performed.
  • Step S230 Acquire primary selection target area frame data corresponding to each of the fusion feature maps.
  • step S230 may be performed by a processor invoking a memory stored instruction or by a processor.
  • the process similar to the foregoing step S220 or S120 is performed to obtain the primary target area frame data, that is, the target area frame data acquired in the foregoing step S220 or S120 is used as the primary target area frame data in step S230. Further adjustments and corrections are made to improve the accuracy of the detection of the object area frame.
  • step S240 the following object region box regression operation is iteratively performed until the iteration satisfies the iterative termination condition, and the new priming target region frame data is acquired from the adjusted fused feature map by adjusting the fused feature map.
  • step S240 may be performed by a processor invoking a memory stored instruction or by a processor.
  • the primary target area frame data is adjusted, and then the newly selected target area frame data is respectively obtained from the adjusted fusion feature map, thereby performing regression on the primary selection target area frame. (Object area box regression operation) to obtain more accurate new primary target area frame data.
  • an object region frame regression operation is iteratively performed until the iterative termination condition is satisfied to finally obtain more accurate preliminary selection target region frame data.
  • the iterative termination condition can be set as needed, such as: the predetermined number of iterations, the adjustment value between the new primary selection target area frame data and the unadjusted primary selection target area frame data (ie, box regression) is smaller than the predetermined frame regression. value.
  • step S240 After the iteration of step S240 is completed, the iteratively obtained preliminary target area frame data is taken as the target area frame data in the image to be processed.
  • an object detecting method obtains a progressive convolutional neural network for target area frame detection with a symmetric structure, and gradually acquires a plurality of first convolutional layers of the first sub-network.
  • Upsampling a plurality of second feature maps further convolving the plurality of first feature maps and the corresponding second feature maps, thereby obtaining a semantic feature that better characterizes the upper layer of the image (eg, layout, front background information)
  • a fusion feature map of low-level detail features such as small object information
  • the new primary target region frame data is obtained from the adjusted fusion feature map by adjusting a plurality of fusion feature maps, thereby iteratively regressing the primary selection target region frame data.
  • FIG. 4 is a flow chart showing an object detecting method according to still another embodiment of the present application. This embodiment describes an exemplary object region frame regression operation in the aforementioned step S240.
  • the deep convolutional neural network further includes a third subnet having a plurality of sets of fourth convolutional layers and a plurality of pooling layers, the plurality of sets of fourth convolutional layers and the third convolutional layer
  • the plurality of pooling layers respectively correspond to the plurality of groups of fourth convolution layers
  • the input of each pooling layer comprises the adjusted fusion feature map and the data of the preliminary target area frame.
  • each set of fourth convolutional layers may include one or more convolutional layers, and each set of fourth convolutional layers may be coupled to the output of the aforementioned third convolutional layer to receive the fused feature map as an input.
  • Each pooling layer is disposed at the end of the corresponding fourth convolution layer, and receives the adjusted fusion feature map and the primary target area frame data as inputs.
  • each set of fourth convolution layer is used for convolving the fusion feature map acquired from the third convolutional layer to obtain an adjusted fusion feature map.
  • the primary target area frame data acquired from the fusion feature map is adjusted.
  • the pooling layer in the third subnet is used for regional pooling of the adjusted fusion feature map obtained by the fourth convolutional convolution to obtain a new fusion feature map. Thereby, new primary target area frame data can be obtained from the new fusion feature map.
  • the plurality of fusion feature maps at the beginning of the current iteration and the preliminary target region frame data are involved, and the new plurality of fusion feature maps obtained at the end of the current iteration are also involved. And new primary selection target area data.
  • step S410 the current fusion feature map is convolved by the fourth convolution layer to obtain an adjustment fusion feature map, thereby adjusting the current primary selection target area, and the adjustment includes the position of the primary selection target area frame. And/or size adjustments.
  • the step S410 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a frame adjustment unit 821 that is executed by the processor.
  • step S420 according to the current primary target area frame data, the pooled layer is layered by the pooling layer to obtain a new merged feature map.
  • step S420 may be performed by the processor invoking a corresponding instruction stored in the memory or by the area pooling unit 822 being executed by the processor.
  • the current primary target area frame is used as the attention area, and the adjusted fusion feature map is zoned to obtain a new fusion feature map.
  • the region is pooled according to the current primary target region frame data, and a new fusion feature map reflecting the degree of response to the adjusted target region frame is obtained, so as to obtain new information from the new fusion feature map.
  • Primary selection target area frame data
  • step S430 new preliminary target region frame data is acquired from the new fusion feature map, so that the regression of the target region frame can be completed, so that the adjusted target region frame is closer to the ground truth of the object region frame.
  • the processing of step S430 can be performed by a process similar to step S120 or S220.
  • the step S430 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a preliminary box acquisition unit 823 executed by the processor.
  • the third subnet further has a fifth convolution layer disposed at the output of the pooling layer.
  • step S430 specifically includes: normalizing convolution of the new fusion feature map by the fifth convolutional layer, and acquiring the new primary selection target region frame data from the normalized convolutional fusion feature map.
  • the third subnet can be constructed using any convolutional neural network having the above structure.
  • the third subnet is constructed as a Residual Network (ResNet) structure with better performance in newly developed object detection techniques to perform area pooling and normalized convolution.
  • ResNet Residual Network
  • the at least one fused feature map is further convoluted to adjust the primary target area frame data included in the fused feature map, Then, through the regional pooling, a new fusion feature map is obtained, and new preliminary target region frame data is obtained from the new fusion feature map, thereby adjusting and returning the predicted primary target region frame data, thereby helping to improve the object.
  • the accuracy and robustness of the detection is provided.
  • FIG. 5 is a flow chart showing a training method of a neural network according to an embodiment of the present application.
  • step S510 a sample image containing target area frame label information is input to a deep convolutional neural network for target area frame detection, and detection acquires a plurality of fusion feature maps.
  • step S510 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by the fused feature map detection module 1010 executed by the processor.
  • the deep convolutional neural network includes a first subnet having at least one downsampling layer, a second subnet having at least one upsampling layer, and a fusion feature map passing the first feature The figure and the second feature map are obtained.
  • the first feature map is obtained from the first subnet
  • the second feature map is obtained from the second subnet.
  • a plurality of fused feature maps can be acquired from the sample image detection containing the target area frame label information.
  • step S510 is typically performed on a plurality of sample images to acquire a plurality of fused feature maps for at least one sample image detection.
  • Step S520 acquiring target area frame data of the sample image according to the plurality of fusion feature maps.
  • the step S520 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by the target area frame detection module 1020 being executed by the processor.
  • step S520 is similar to the process of step S120, and details are not described herein.
  • Step S530 determining first difference data detected by the object frame according to the target area frame data of the acquired sample image and the target area frame labeling information.
  • the step S530 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by the first difference acquisition module 1030 being executed by the processor.
  • the loss value or the deviation value may be calculated according to the obtained target area frame data of the sample image and the target area frame label information as the first difference data, as a basis for the subsequent training depth convolutional neural network.
  • step S540 the network parameters of the deep convolutional neural network are adjusted according to the first difference data.
  • step S540 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by the first network training module 1040 being executed by the processor.
  • the determined first difference data is backhauled to the deep convolutional neural network to adjust the network parameters of the deep convolutional neural network.
  • the sample image containing the target area frame labeling information is input into the deep convolutional neural network for target area frame detection, and the detection acquires a plurality of fusion feature maps; wherein, the slave has at least one lower
  • the first subnet detection of the sampling layer acquires a plurality of first feature maps
  • the plurality of second feature maps are acquired from the second subnet having the at least one upsampling layer, respectively, by the plurality of first feature maps and the plurality of second
  • the feature map is fused to obtain the fused feature map
  • the target region frame data is obtained according to the plurality of fused feature maps.
  • the first difference data is determined according to the acquired target area frame data and the target area frame labeling information, and then the network parameters of the deep convolutional neural network are adjusted according to the first difference data. Since these fusion feature maps of the deep convolutional neural network obtained from training better characterize the semantic features (such as layout, front background information) and low-level details (such as small object information) in the image, These fusion feature maps are capable of efficiently extracting target area frame data of large and small objects contained in the image.
  • the deep convolutional neural network obtained by training can improve the accuracy and robustness of object detection.
  • FIG. 6 is a flowchart illustrating a training method of a neural network according to another embodiment of the present application.
  • the second subnet is disposed at the end of the first subnet;
  • the first subnet has a plurality of first convolutional layers and at least one downsampling layer, and the downsampling layer Provided between the plurality of first convolutional layers;
  • the second sub-network has a plurality of second convolutional layers and at least one up-sampling layer, and the up-sampling layer is disposed between the plurality of second convolutional layers.
  • the first convolution layer and the second convolution layer are symmetrically disposed, and at least one downsampling layer and at least one upsampling layer are symmetrically disposed, respectively.
  • At least one first convolutional layer is provided with a first output branch for outputting the first feature map
  • the second convolutional layer is provided with a second output branch for outputting the second feature map.
  • the second subnet further has a plurality of third convolutional layers, the input of the third convolutional layer comprising a first output branch and a second output branch.
  • the third convolution layer is configured to convolve the first feature map and the corresponding second feature map from the first output branch and the second output branch to obtain a corresponding merged feature map.
  • step S610 the sample image is scaled such that the true value of at least one object region frame in the sample image is covered by the object detection frame. This ensures a positive sample in any batch of sample images.
  • a sufficient number of positive samples are selected, and a certain number of negative samples are selected, so that the first subnet and the second subnet obtained by the training converge better.
  • the positive sample is the positive sample area box
  • the negative sample is the negative sample area frame.
  • the positive sample area frame and the negative sample area frame may be defined according to the following criteria: the overlap ratio of the positive value of the positive sample area frame and the object area frame is not lower than the first overlap ratio value, and the true value of the negative sample area frame and the object area frame The overlap ratio is not higher than the second overlap ratio value, and the first overlap ratio value is greater than the second overlap ratio value.
  • the target area frame label information of the sample image includes the label information of the positive sample area frame and the label information of the negative sample area frame.
  • the first overlap ratio value may be set according to design requirements, for example, the first overlap ratio value is set to any ratio value of 70%-95%, and the second overlap ratio value is set to 0%-30% or 0- Any ratio value in the 25% range.
  • a neutral sample a neutral sample area box
  • the neutral sample region frame may be defined according to the following criteria: the overlap ratio of the true value of the neutral sample region frame and the object region frame is between the first overlap ratio value and the second overlap ratio value, eg, 30%-70 Between %, between 25% and 80%.
  • the number of positive, negative, and neutral samples can be controlled in such a manner that among all the sample images, the sum of the labeled positive sample region boxes is in the positive sample region box, the negative sample region frame, and the neutral sample.
  • the proportion of the total number of frames in the area frame is not less than a predetermined first ratio, and the first ratio is greater than 50%; the sum of the labels of the negative sample area frames in the total number of frames is not greater than the predetermined second ratio;
  • the sum of the sex sample region frames in the total number of frames is not greater than a predetermined third ratio, and the third ratio is not greater than half of the sum of the first ratio and the second ratio.
  • Moderate use of neutral sample images helps to better distinguish between positive and negative samples, improving the robustness of the trained third subnet.
  • step S620 the sample image containing the target area frame label information is input to the deep convolutional neural network for target area frame detection, and the detection acquires a plurality of fusion feature maps.
  • the fusion feature map is acquired from the output ends of the plurality of third convolution layers.
  • step S620 can be performed by a processor invoking a corresponding instruction stored in the memory.
  • the frame fusion detection data of the at least one point in the fusion feature map may include, but is not limited to, coordinate data, location, and size data corresponding to an object detection frame in the object detection frame set, and the prediction accurate information may be the The box fuses the confidence data of the detected data, such as: the predicted accuracy.
  • both the first subnet and the second subnet are constructed as an Inception-BN network structure with better performance in object detection.
  • the frame fusion detection data of the at least one point in the fusion feature map may include, but is not limited to, coordinate data, location, and size data corresponding to an object detection frame in the object detection frame set, and the prediction accurate information may be the The box fuses the confidence data of the detected data, such as: the predicted accuracy.
  • step S630 the target region frame data corresponding to each of the fusion feature maps is respectively acquired according to the frame fusion detection data and the prediction accurate information in the at least one of the fusion feature maps.
  • step S630 can be performed by a processor invoking a corresponding instruction stored in the memory.
  • Step S640 determining first difference data detected by the object frame according to the acquired target area frame data of the sample image and the target area frame labeling information.
  • the step S640 can be performed by a processor invoking a corresponding instruction stored in the memory.
  • the loss value or the deviation value may be calculated according to the target area frame data of the obtained sample image and the target area frame label information as the first difference data, as a basis for the subsequent training depth convolutional neural network.
  • step S650 the network parameters of the deep convolutional neural network are adjusted according to the first difference data.
  • step S650 can be performed by a processor invoking a corresponding instruction stored in the memory.
  • steps S640-S650 is similar to the processing of the foregoing steps S530-S540, and details are not described herein.
  • a sample image containing the target area frame labeling information is input into a deep convolutional neural network having a symmetric structure for target area frame detection, and detecting and acquiring a plurality of fusion feature maps; Acquiring, by the first subnet of the at least one downsampling layer, the plurality of first feature maps, and acquiring, by the second subnet having the at least one upsampling layer, the plurality of second feature maps, respectively, by the plurality of first feature maps and The second feature map is fused to obtain the fused feature map, and the target region frame data is obtained according to the plurality of fused feature maps.
  • the first difference data is determined according to the acquired target area frame data and the target area frame labeling information, and then the network parameters of the deep convolutional neural network are adjusted according to the first difference data. Since these fusion feature maps of the deep convolutional neural network obtained from training better characterize the semantic features (such as layout, front background information) and low-level details (such as small object information) in the image, These fusion feature maps are capable of efficiently extracting target area frame data of large and small objects contained in the image.
  • the deep convolutional neural network obtained by training can improve the accuracy and robustness of object detection.
  • FIG. 7 is a flowchart illustrating a training method of a neural network according to still another embodiment of the present application.
  • the deep convolutional neural network trained according to the above embodiment further includes a third subnet having a plurality of sets of fourth convolutional layers and a plurality of pooling layers, and a plurality of fourth convolutional layers Corresponding to the third convolution layer respectively, the plurality of pooling layers respectively correspond to the plurality of groups of fourth convolution layers, and the input of each pooling layer includes the adjusted fusion feature map and the data of the preliminary target area frame.
  • each set of fourth convolutional layers may include one or more convolutional layers, and each set of fourth convolutional layers may be coupled to the output of the aforementioned third convolutional layer to receive the fused feature map as an input.
  • Each pooling layer is disposed at an end of the corresponding fourth convolution layer, and receives the adjusted fusion feature map and the primary target area frame data as inputs.
  • the training of the third subnet in the deep convolutional neural network is primarily described.
  • the first subnet and the second subnet may be trained by the method of any of the foregoing embodiments, and then the fusion feature map obtained from the training process of the first subnet and the second subnet is used, according to the method of the embodiment. Train the third subnet.
  • step S710 a plurality of merged feature maps acquired from sample images containing target region frame label information are acquired.
  • the plurality of merged feature maps are acquired from the sample image as described in the previous step S510 or S610.
  • step S710 can be performed by the processor invoking a corresponding instruction stored in the memory, or can be performed by the fused feature map detection module 1010 executed by the processor.
  • step S720 the target area box regression training operation is iteratively performed until the iteration satisfies the iteration termination condition.
  • the step S720 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a boxed regression iterative training module 1050 executed by the processor.
  • step S720 includes steps S721-S726.
  • step S721 the current fusion feature map is convoluted by the fourth convolution layer to obtain the adjustment fusion feature map, thereby achieving the purpose of adjusting the current primary selection target area frame.
  • step S722 according to the current primary target area frame data, the pooled layer is layered by the pooling layer to obtain a new merged feature map.
  • the new fusion feature map contains adjustments to the primary target area frame and reflects the degree of response to the adjusted target area frame.
  • step S723 new preliminary target area frame data is acquired from the new fusion feature map.
  • steps S721-S723 is similar to the processing of the foregoing steps S410-S430, and details are not described herein.
  • the third subnet further has a fifth convolution layer disposed at the output of the pooling layer.
  • step S723 specifically includes: normalizing convolution of the new fusion feature map by the fifth convolutional layer, and acquiring new primary selection target region frame data from the normalized convolutional fusion feature map.
  • step S724 the object frame is determined according to the frame regression data between the unselected primary target area frame data and the new primary target area frame data, the new primary selection target area frame data, and the corresponding target area frame annotation information.
  • step S724 can be performed by a processor invoking a corresponding instruction stored in the memory.
  • the detection offset can be determined by the new primary selection target area frame data and the corresponding target area frame annotation information, and the loss value is calculated as the second difference according to the detection offset and the frame regression data (ie, the frame movement/adjustment data). data.
  • the frame regression data ie, the frame movement/adjustment data.
  • step S725 the network parameters of the third subnet are adjusted according to the second difference data.
  • step S725 can be performed by a processor invoking a corresponding instruction stored in the memory.
  • the determined second difference data is backhauled to the third subnet to adjust the network parameters of the third subnet.
  • step S726 it is determined whether the iteration termination condition is satisfied.
  • step S726 can be performed by a processor invoking a corresponding instruction stored in memory.
  • step S726 If it is determined in step S726 that the aforementioned iteration satisfies the iterative termination condition (eg, the predetermined number of iterations is reached), the training of the third subnet is ended; if in step S726, it is determined that the aforementioned iteration does not satisfy the iterative termination condition (eg: When the predetermined number of iterations is reached, the process returns to step S721 to continue the aforementioned training on the third subnet until it is determined that the iteration termination condition is satisfied.
  • the iterative termination condition eg, the predetermined number of iterations is reached
  • the existing training of the neural network for the frame of the object region is performed only for the training of the target region frame regression to perform the iteration (such as the number of iterations N); and according to the training method provided by the present application, the regression is performed on the target region frame ( For example, the number of regressions M), each regression involves training of multiple iterations (such as the number of iterations N), that is, involving M ⁇ N iterations of training.
  • the third subnet thus trained is more accurate in performing the position detection of the object area frame.
  • the third subnet can be constructed using any convolutional neural network having the above structure.
  • the third subnet is constructed as a ResNet structure with better performance in newly developed object detection techniques to perform area pooling and normalized convolution.
  • the trained deep convolutional neural network further convolves each fusion feature image of the sample image to the fusion feature map.
  • the data of the selected primary target area frame is adjusted, and then the area is pooled to obtain a new fusion feature map, and the new primary selection target area frame data is obtained from the new fusion feature picture, thereby obtaining the preliminary selection target area frame data.
  • Adjustment and regression can further improve the accuracy and robustness of object detection.
  • FIG. 8 is a block diagram showing the structure of an object detecting apparatus according to an embodiment of the present application.
  • the object detecting apparatus of the present embodiment includes a fusion feature map prediction module 810 and a target area frame prediction module 820.
  • the fusion feature map prediction module 810 is configured to obtain a plurality of fusion feature maps from the image to be processed by the depth convolutional neural network for target area frame detection; wherein the deep convolutional neural network comprises a first subnet and a second a subnet, the first subnet has at least one downsampling layer, and the second subnet has at least one upsampling layer; the fused feature map is obtained by the first feature map and the second feature map, and the first feature map is obtained from the first subnet It is obtained that the second feature map is obtained from the second subnet.
  • the target area frame prediction module 820 is configured to acquire target area frame data according to the plurality of fusion feature patterns acquired by the fusion feature map prediction module 810.
  • the object detecting device of the embodiment is used to implement the corresponding object detecting method in the foregoing method embodiment, and has the beneficial effects of the corresponding method embodiment, and details are not described herein again.
  • FIG. 9 is a block diagram showing the structure of an object detecting apparatus according to another embodiment of the present application.
  • the second subnet is disposed at the end of the first subnet, the first subnet having a plurality of first convolutional layers and at least one downsampling a layer, a downsampling layer is disposed between the plurality of first convolutional layers, the second subnet has a plurality of second convolutional layers and the at least one upsampling layer, and the upsampling layer is disposed in the plurality of second convolutional layers Between the first convolutional layer and the second convolutional layer, the at least one downsampling layer and the at least one upsampling layer are symmetrically disposed, respectively.
  • the first output branch for outputting the first feature map is provided on the at least one first convolution layer, and the second output branch for outputting the second feature map is provided on the second convolution layer .
  • the second subnet further has a plurality of third convolution layers
  • the input of the third convolution layer comprises a first output branch and the second output branch.
  • the fusion feature map prediction module 810 is configured to respectively obtain the fusion feature map from the output ends of the plurality of third convolution layers.
  • At least one point in the fusion feature map has frame fusion detection data and prediction accuracy information corresponding to the plurality of object detection frames.
  • the target area frame prediction module 820 is configured to separately acquire target area frame data corresponding to each of the fusion feature patterns according to the frame fusion detection data and the prediction accurate information in the at least one fusion feature map.
  • the target area frame prediction module 820 is configured to respectively obtain the primary target area frame data corresponding to the respective fusion feature maps; and iteratively perform the following object area frame regression operation until the iteration satisfies the iterative termination condition: by adjusting the fusion feature map Obtaining new primary target area frame data from the adjusted fusion feature map; and using the iteratively obtained primary target area frame data as the target area frame data in the image to be processed.
  • the deep convolutional neural network further includes a third subnet, the third subnet has a plurality of fourth convolution layers and a plurality of pooling layers, and the plurality of fourth convolution layers respectively correspond to the third convolutional layer
  • the plurality of pooling layers respectively correspond to the plurality of groups of fourth convolution layers, and the input of each pooling layer includes the adjusted fusion feature map and the data of the primary target area frame.
  • the target area frame prediction module 820 includes:
  • the frame adjustment unit 821 is configured to convolve the current fusion feature map by using the fourth convolution layer to obtain an adjustment fusion feature map.
  • the area pooling unit 822 is configured to perform regional pooling on the adjusted fusion feature map according to the current primary selection target area frame data to obtain a new fusion feature map.
  • the primary box obtaining unit 823 is configured to acquire new primary target area frame data from the new merged feature map.
  • the third subnet further has a fifth convolution layer disposed at the output end of the pooling layer; correspondingly, the primary selection frame obtaining unit 823 is configured to perform normalized convolution on the new fusion feature map by the fifth convolutional layer. And obtaining new primary target area frame data from the normalized convolutional fusion feature map.
  • the first subnet and the second subnet are cognitive-sample normalization (Inception-BN) network structures
  • the third subnet is a residual network (ResNet) structure.
  • the object detecting device of the embodiment is used to implement the corresponding object detecting method in the foregoing method embodiment, and has the beneficial effects of the corresponding method embodiment, and details are not described herein again.
  • FIG. 10 is a block diagram showing the structure of a training apparatus for a neural network according to an embodiment of the present application.
  • the training apparatus of the neural network of the present embodiment further includes a fusion feature map detection module 1010, a target area frame detection module 1020, a first difference acquisition module 1030, and a first network training module 1040.
  • the fusion feature map detecting module 1010 is configured to input a sample image containing the target area frame labeling information into a deep convolutional neural network for target area frame detection, and detect and acquire a plurality of fusion feature maps, wherein the deep convolutional neural network includes the first subnet. And the second subnet, the first subnet has at least one downsampling layer, and the second subnet has at least one upsampling layer; the fusion feature map is obtained by the first feature map and the second feature map, the first feature map is from the first The subnet is obtained, and the second feature map is obtained from the second subnet.
  • the target area frame detecting module 1020 is configured to acquire target area frame data of the sample image according to the multiple fusion feature maps.
  • the first difference obtaining module 1030 is configured to determine, according to the target area frame data of the acquired sample image and the target area frame labeling information, the first difference data detected by the object frame.
  • the first network training module 1040 is configured to adjust network parameters of the deep convolutional neural network according to the first difference data.
  • the training device of the neural network in this embodiment is used to implement the training method of the corresponding neural network in the foregoing method embodiments, and has the beneficial effects of the corresponding method embodiments, and details are not described herein again.
  • FIG. 11 is a block diagram showing the structure of a training apparatus of a neural network according to another embodiment of the present application.
  • the second subnet is disposed at the end of the first subnet, the first subnet having a plurality of first convolutional layers and at least one downsampling a layer, a downsampling layer is disposed between the plurality of first convolutional layers, the second subnet has a plurality of second convolutional layers and at least one upsampling layer, and the upsampling layer is disposed between the plurality of second convolutional layers
  • the first convolution layer and the second convolution layer are symmetrically disposed, and the at least one downsampling layer and the at least one upsampling layer are symmetrically disposed, respectively.
  • the first output branch for outputting the first feature map is provided on the at least one first convolution layer, and the second output branch for outputting the second feature map is provided on the second convolution layer .
  • the second subnet further has a plurality of third convolution layers, and the input of the third convolution layer comprises a first output branch and a second output branch.
  • the fusion feature map detection module 1010 is configured to respectively obtain the fusion feature map from the output ends of the plurality of third convolution layers.
  • At least one point in the fusion feature map has frame fusion detection data and prediction accuracy information corresponding to the plurality of object detection frames.
  • the deep convolutional neural network further includes a third subnet, the third subnet has a plurality of fourth convolution layers and a plurality of pooling layers, and the plurality of fourth convolution layers respectively correspond to the third convolutional layer
  • the plurality of pooling layers respectively correspond to the plurality of groups of fourth convolution layers, and the input of each pooling layer includes the adjusted fusion feature map and the data of the primary target area frame.
  • the apparatus further includes: a box regression iterative training module 1050, configured to iteratively perform the following target area box regression training operation until the iteration satisfies the iterative termination condition: respectively, through the fourth convolution layer, respectively, the current fusion feature map Convolution, obtaining and adjusting the fusion feature map; according to the current primary selection target area frame data, the pooled layer is used to perform regional pooling on the adjusted fusion feature map to obtain a new fusion feature map; and obtain a new initial from the new fusion feature map.
  • a box regression iterative training module 1050 configured to iteratively perform the following target area box regression training operation until the iteration satisfies the iterative termination condition: respectively, through the fourth convolution layer, respectively, the current fusion feature map Convolution, obtaining and adjusting the fusion feature map; according to the current primary selection target area frame data, the pooled layer is used to perform regional pooling on the adjusted fusion feature map to obtain a new fusion feature map; and obtain a
  • Select the target area frame data determine the frame regression data between the unadjusted primary selection target area frame data and the new primary selection target area frame data, the new primary selection target area frame data, and the corresponding target area frame labeling information.
  • the second difference data detected by the object frame; adjusting the network parameter of the third subnet according to the second difference data.
  • the third subnet further has a fifth convolution layer disposed at the output of the pooling layer; correspondingly, the box regression iterative training module 1050 is configured to perform normalized convolution on the new fused feature map by the fifth convolutional layer. And acquiring the new primary target area frame data from the normalized convolutional fusion feature map.
  • the apparatus further includes: a pre-processing module 1060, configured to scale the sample image such that the true value of the at least one object area frame is covered by the object detection frame before iteratively performing the target area frame regression training operation.
  • a pre-processing module 1060 configured to scale the sample image such that the true value of the at least one object area frame is covered by the object detection frame before iteratively performing the target area frame regression training operation.
  • the target area frame labeling information of the sample image includes the labeling information of the positive sample area frame and the labeling information of the negative sample area frame; the overlap ratio of the true value of the positive sample area frame and the object area frame is not lower than the first overlapping ratio.
  • the value, the overlap ratio of the negative sample area frame and the true value of the object area frame is not higher than the second overlap ratio value, and the first overlap ratio value is greater than the second overlap ratio value.
  • the target area frame label information of the sample image further includes label information of the neutral sample area frame, and the overlap ratio of the true value of the neutral sample area frame and the object area frame is at the first overlap ratio value and the second overlap ratio value. between.
  • the sum of the labeled positive sample region frames is not less than a predetermined first ratio in the total number of frames of the positive sample region frame, the negative sample region frame, and the neutral sample region frame,
  • the first ratio is greater than 50%; the sum of the labeled negative sample area boxes in the total number of frames is not greater than the predetermined second ratio; the sum of the labeled neutral sample area boxes in the total number of frames is not greater than the predetermined
  • the third ratio, the third ratio is not greater than half of the sum of the first ratio and the second ratio.
  • the first subnet and the second subnet are cognitive-sample normalized network structures
  • the third subnet is a residual network structure.
  • the training device of the neural network in this embodiment is used to implement the training method of the corresponding neural network in the foregoing method embodiments, and has the beneficial effects of the corresponding method embodiments, and details are not described herein again.
  • an embodiment of the present application further provides an electronic device, including: a processor and a memory;
  • the memory is configured to store at least one executable instruction, the executable instruction causing the processor to perform an operation corresponding to the object detecting method according to any one of the foregoing embodiments of the present application; or
  • the memory is configured to store at least one executable instruction, the executable instruction causing the processor to perform an operation corresponding to the training method of the neural network described in any of the above embodiments of the present application.
  • the embodiment of the present application further provides another electronic device, including:
  • the processor and the object detecting device according to any one of the above embodiments of the present application; when the processor runs the object detecting device, the unit in the object detecting device according to any one of the above embodiments of the present application is operated; or
  • the processor and the training device of the neural network according to any one of the above embodiments of the present application; when the processor runs the training device of the neural network, the unit in the training device of the neural network according to any of the above embodiments of the present application Is being run.
  • FIG. 12 is a schematic structural diagram showing a first electronic device according to an embodiment of the present application.
  • the application also provides an electronic device, such as a mobile terminal, a personal computer (PC), a tablet computer, a server.
  • an electronic device such as a mobile terminal, a personal computer (PC), a tablet computer, a server.
  • PC personal computer
  • FIG. 12 a schematic structural diagram of a first electronic device 1200 suitable for implementing a terminal device or a server of an embodiment of the present application is shown.
  • the first electronic device 1200 includes, but is not limited to, one or more first processors, a first communication component, such as one or more first central processing units (CPU) 1201, and/or one or more first image processing units (GPUs) 1213, the first processor may be in accordance with executable instructions stored in the first read only memory (ROM) 1202 or from the first storage portion 1208 loads the executable instructions into the first random access memory (RAM) 1203 to perform various appropriate actions and processes.
  • the first communication component includes a first communication component 1212 and a first communication interface 1209.
  • the first communication component 1212 can include, but is not limited to, a network card, and the network card can include, but is not limited to, an IB (Infiniband) network card.
  • the first communication interface 1209 includes a communication interface such as a LAN card, a network interface card of a modem, and the first communication.
  • the interface 1209 performs communication processing via a network such as the Internet.
  • the first processor can communicate with the first read only memory 1202 and/or the first random access memory 1230 to execute executable instructions, coupled to the first communication component 1212 via the first bus 1204, and coupled to the first communication component 1212
  • the other target device communicates, so as to complete the operation corresponding to any method provided by the embodiment of the present application, for example, obtaining a plurality of fused feature maps from the image to be processed by using a deep convolutional neural network for target area frame detection;
  • the deep convolutional neural network includes a first subnet having at least one downsampling layer, and a second subnet having at least one upsampling layer; the fusion feature The figure is obtained by using a first feature map obtained from a first subnet, the second feature map being obtained from a second subnet, and obtaining according to the multiple fusion feature maps.
  • Target area box data is obtained by using a first feature map obtained from a first subnet, the second feature map being obtained from a second subnet, and obtaining according to the multiple fusion feature maps.
  • the first RAM 1203 various programs and data required for the operation of the device can also be stored.
  • the first CPU 1201, the first ROM 1202, and the first RAM 1203 are connected to each other through the first bus 1204.
  • the first ROM 1202 is an optional module.
  • the first RAM 1203 stores executable instructions, or writes executable instructions to the first ROM 1202 at runtime, the executable instructions causing the first processor 1201 to perform operations corresponding to the above-described communication methods.
  • a first input/output (I/O) interface 1205 is also coupled to the first bus 1204.
  • the first communication component 1212 can be integrated or can be configured to have multiple sub-modules (eg, multiple IB network cards) and be on a bus link.
  • the following components are coupled to the first I/O interface 1205: a first input portion 1206 including a keyboard, a mouse; including but not limited to a first output portion 1207 such as a cathode ray tube (CRT), a liquid crystal display (LCD), and a speaker; However, it is not limited to the first storage portion 1208 of the hard disk; and the first communication interface 1209 including but not limited to a network interface card such as a LAN card or a modem.
  • the first driver 1210 is also connected to the first I/O interface 1205 as needed.
  • the first removable medium 1211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, is mounted on the first drive 1210 as needed so that a computer program read therefrom is installed into the first storage portion 1208 as needed.
  • FIG. 12 is only an optional implementation manner.
  • the number and type of components in the foregoing FIG. 12 may be selected, deleted, added, or replaced according to actual needs; Separate settings or integrated setting implementations may also be employed for different functional component settings, such as GPU and CPU detachable settings or GPU integration on the CPU, the first communication component 1212 may be separately configured, or may be integrated in the CPU or On the GPU.
  • These alternative embodiments are all within the scope of the present application.
  • embodiments of the present application include a computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program comprising program code for executing the method illustrated in the flowchart, the program code comprising the corresponding execution
  • An instruction corresponding to the method step provided by the embodiment of the present application for example, for obtaining, by using a deep convolutional neural network for target area frame detection, executable code for acquiring a plurality of fusion feature images from an image to be processed;
  • the deep convolutional neural network includes a first subnet having at least one downsampling layer, and a second subnet having at least one upsampling layer; the fusion feature map Obtaining a feature map obtained from the first subnet, the second feature map being obtained from the second subnet; and acquiring the target according to the multiple fusion feature map
  • the executable code for the area box data for example, for obtaining, by using a deep convolutional neural network for target area frame detection, executable code for acquiring a plurality of
  • the computer program can be downloaded and installed from the network via a communication component, and/or installed from the first removable media 1211.
  • the computer program is executed by the first central processing unit (CPU) 1201, the above-described functions defined in the method of the present application are performed.
  • the electronic device acquires a plurality of fused feature maps from the image to be processed by using a deep convolutional neural network for target area frame detection, wherein the first sub-graph having at least one downsampling layer is obtained
  • the network acquires a plurality of first feature maps, and obtains a plurality of second feature maps from the second subnet having at least one upsampling layer, and respectively fuses the plurality of first feature maps and the plurality of second feature maps to obtain the merged feature map.
  • the target area frame data is acquired according to the plurality of fusion feature maps.
  • the fusion feature map can effectively extract images.
  • the target area frame data of the large and small objects contained in the object thereby improving the accuracy and robustness of object detection.
  • FIG. 13 is a schematic structural diagram showing a second electronic device according to another embodiment of the present application.
  • the application also provides an electronic device, such as a mobile terminal, a personal computer (PC), a tablet, a server.
  • an electronic device such as a mobile terminal, a personal computer (PC), a tablet, a server.
  • PC personal computer
  • a tablet a server
  • FIG 13 there is shown a block diagram of a second electronic device 1300 suitable for use in implementing a terminal device or server of an embodiment of the present application.
  • the second electronic device 1300 includes, but is not limited to, one or more second processors, a second communication component, such as one or more second central processing units. (CPU) 1301, and/or one or more second image processors (GPUs) 1313, which may be executable according to executable instructions stored in a second read only memory (ROM) 1302 or from a second storage portion 1308 loads the executable instructions into the second random access memory (RAM) 1303 to perform various appropriate actions and processes.
  • the second communication component includes a second communication component 1312 and a second communication interface 1309.
  • the second communication component 1312 can include, but is not limited to, a network card, the network card can include, but is not limited to, an IB (Infiniband) network card, and the second communication interface 1309 includes a communication interface such as a LAN card, a network interface card of a modem, and the second communication.
  • the interface 1309 performs communication processing via a network such as the Internet.
  • the second processor can communicate with the second read only memory 1302 and/or the second random access memory 1330 to execute executable instructions, connect to the second communication component 1312 via the second bus 1304, and via the second communication component 1312
  • the other target device communicates to complete the operation corresponding to any method provided by the embodiment of the present application.
  • the sample image containing the target area frame labeling information is input into the deep convolutional neural network for target area frame detection, and the detection is acquired.
  • the deep convolutional neural network comprising a first subnet having at least one downsampling layer, the second subnet having at least one upsampling layer;
  • the fusion feature map is obtained by using the first feature map and the second feature map, the first feature map is obtained from the first subnet, and the second feature map is obtained from the second subnet; Acquiring a target area frame data of the sample image; determining an object according to the acquired target area frame data of the sample image and the target area frame labeling information Detecting a first difference data; and adjusting the depth of the convolutional neural network in accordance with the first network parameter difference data.
  • the second RAM 1303 various programs and data required for the operation of the device can also be stored.
  • the second CPU 1301, the second ROM 1302, and the second RAM 1303 are connected to each other through the second bus 1304.
  • the second ROM 1302 is an optional module.
  • the second RAM 1303 stores executable instructions, or writes executable instructions to the second ROM 1302 at runtime, the executable instructions causing the second processor 1301 to perform operations corresponding to the above-described communication methods.
  • a second input/output (I/O) interface 1305 is also coupled to the second bus 1304.
  • the second communication component 1312 can be integrated or can be configured to have multiple sub-modules (eg, multiple IB network cards) and be on the bus link.
  • the following components are coupled to the second I/O interface 1305: including, but not limited to, a second input portion 1306 of a keyboard, mouse; including but not limited to, a second output portion such as a cathode ray tube (CRT), a liquid crystal display (LCD), and a speaker 1307; a second storage portion 1308 including but not limited to a hard disk; and a second communication interface 1309 including a network interface card such as a LAN card, a modem.
  • the second driver 1310 is also connected to the second I/O interface 1305 as needed.
  • a second detachable medium 1311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, is mounted on the second drive 1310 as needed so that a computer program read therefrom is installed into the second storage portion 1308 as needed.
  • FIG. 13 is only an optional implementation manner.
  • the number and types of the components in FIG. 13 may be selected, deleted, added, or replaced according to actual needs; Separate settings or integrated setting implementations may also be employed for different functional component settings, such as GPU and CPU detachable settings or GPU integration on the CPU, second communication component 1312 may be separately configured, or integrated in the CPU or On the GPU.
  • embodiments of the present application include a computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program comprising program code for executing the method illustrated in the flowchart, the program code comprising the corresponding execution
  • the instruction corresponding to the method step provided by the embodiment of the present application is used, for example, to input a sample image containing the target area frame labeling information into a deep convolutional neural network for target area frame detection, and detect executable for acquiring multiple fusion feature maps.
  • the deep convolutional neural network comprising a first subnet having at least one downsampling layer, the second subnet having at least one upsampling layer; the fusion feature The figure is obtained by using a first feature map obtained from a first subnet, the second feature map being obtained from a second subnet, and configured to be used according to the plurality of fusion features Obtaining executable code of target area frame data of the sample image; target frame data for obtaining the sample image according to the image Said label information for determining the target area frame executable code of the object detected by the first frame difference data; executable code for adjusting said network parameters depth convolutional neural network based on the first difference data.
  • the computer program can be downloaded and installed from the network via a communication component and/or installed from the second removable media 1311.
  • the computer program is executed by the second central processing unit (CPU) 1301, the above-described functions defined in the method of the embodiment of the present application are executed.
  • the electronic device inputs a sample image containing the target area frame labeling information into a deep convolutional neural network for target area frame detection, and detects and acquires a plurality of fusion feature maps; wherein the slave has at least one downsampling
  • the first subnet detection of the layer acquires the plurality of first feature maps
  • the plurality of second feature maps are acquired from the second subnet having the at least one upsampling layer, respectively, by the plurality of first feature maps and the plurality of second features
  • the map is merged to obtain a merged feature map, and the target region frame data is obtained according to the plurality of merged feature maps.
  • the first difference data is determined according to the acquired target area frame data and the target area frame labeling information, and then the network parameter of the deep convolutional neural network is adjusted according to the first difference data. Since these fusion feature maps of the deep convolutional neural network obtained from training better characterize the semantic features (such as layout, front background information) and low-level details (such as small object information) in the image, These fusion feature maps are capable of efficiently extracting target area frame data of large and small objects contained in the image.
  • the deep convolutional neural network obtained by training can improve the accuracy and robustness of object detection.
  • the embodiment of the present application further provides a computer program, including computer readable code, when the computer readable code is run on a device, the processor in the device executes to implement any of the embodiments of the present application.
  • the processor in the device executes instructions for implementing the steps in the training method of the neural network described in any of the embodiments of the present application.
  • the embodiment of the present application further provides a computer readable storage medium for storing computer readable instructions, when the instructions are executed, implementing steps in the object detecting method according to any embodiment of the present application. Operation, or operation of each step in the training method of the neural network described in any of the embodiments of the present application.
  • the various embodiments in the specification are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same or similar parts between the various embodiments may be referred to each other.
  • the description since it basically corresponds to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
  • the methods and apparatus of the present application may be implemented in a number of ways.
  • the methods and apparatus of the present application can be implemented in software, hardware, firmware, or any combination of software, hardware, and firmware.
  • the above-described sequence of steps for the method is for illustrative purposes only, and the steps of the method of the present application are not limited to the order specifically described above unless otherwise specifically stated.
  • the present application can also be implemented as a program recorded in a recording medium, the programs including machine readable instructions for implementing the method according to the present application.
  • the present application also covers a recording medium storing a program for executing the method according to the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

一种物体检测方法、神经网络的训练方法、装置和电子设备。物体检测方法包括:通过用于目标区域框检测的深度卷积神经网络,从待处理的图像预测获取多个融合特征图(S110),其中,从具有至少一个下采样层的第一子网获取多个第一特征图,从具有至少一个上采样层的第二子网获取多个第二特征图,分别由多个第一特征图和多个第二特征图融合得到融合特征图;根据所述多个融合特征图获取目标区域框数据(S120)。由于这些融合特征图较好地表征了图像中高层的语义特征和低层的细节特征,根据这些融合特征图能够有效地提取到图像中包含的大小物体的目标区域框数据,从而提高物体检测的准确性和鲁棒性。

Description

物体检测方法、神经网络的训练方法、装置和电子设备
本申请要求在2017年2月23日提交中国专利局、申请号为CN201710100676.1、发明名称为“物体检测方法、神经网络的训练方法、装置和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像处理技术,尤其涉及一种物体检测方法和装置、神经网络的训练方法和装置、电子设备。
背景技术
目标区域框检测的目的是从图像检测出若干可能存在物体的矩形框。在目前常规的使用卷积神经网络执行检测的技术中,通过卷积神经网络中的池化层逐渐减小特征图的大小,从而最终确定可能存在物体的矩形框,这种网络结构被称作“缩小网络”(zoom-out structure)。
发明内容
本申请提供一种基于图像进行目标区域框检测的技术。
根据本申请实施例的一方面,提供一种物体检测方法,包括:通过用于目标区域框检测的深度卷积神经网络,从待处理的图像预测获取多个融合特征图;其中,所述深度卷积神经网络包括第一子网和第二子网,所述第一子网具有至少一个下采样层,所述第二子网具有至少一个上采样层;所述融合特征图通过第一特征图和第二特征图得到,所述第一特征图从第一子网获取得到,所述第二特征图从第二子网获取得到;根据所述多个融合特征图获取目标区域框数据。
在本申请的一种实现方式中,所述第二子网设置在所述第一子网的末端,所述第一子网具有多个第一卷积层和所述至少一个下采样层,所述下采样层设置在所述多个第一卷积层之间,所述第二子网具有多个第二卷积层和所述至少一个上采样层,所述上采样层设置在所述多个第二卷积层之间,所述第一卷积层和所述第二卷积层对称设置,所述至少一个下采样层和所述至少一个上采样层分别对称地设置。
在本申请的一种实现方式中,在至少一个所述第一卷积层设有用于输出所述第一特征图的第一输出分支,在第二卷积层设有用于输出所述第二特征图的第二输出分支。
在本申请的一种实现方式中,所述第二子网还具有多个第三卷积层,所述第三卷积层的输入包括所述第一输出分支和所述第二输出分支;所述预测获取多个融合特征图包括:从所述多个第三卷积层的输出端分别获取所述融合特征图。
在本申请的一种实现方式中,所述融合特征图中的至少一个点具有与多个物体探测框对应的框融合检测数据以及预测准确信息,所述根据所述多个融合特征图获取目标区域框数据包括:根据至少一个所述融合特征图中的框融合检测数据以及预测准确信息分别获取与所述融合特征图各自对应的目标区域框数据。
在本申请的一种实现方式中,所述根据所述多个融合特征图获取目标区域框数据包括:分别获取所述融合特征图各自对应的初选目标区域框数据;迭代地执行以下物体区域框回归操作,直到所述迭代满足迭代终止条件为止:通过调整所述融合特征图,从经过调整的融合特征图获取新的初选目标区域框数据;将经过所述迭代得到的所述初选目标区域框数据作为所述待处理的图像中的目标区域框数据。
在本申请的一种实现方式中,所述深度卷积神经网络还包括第三子网,所述第三子网具有多组第四卷积层和多个池化层,所述多组第四卷积层分别与所述第三卷积层对应,所述多个池化层分别与所述多组第四卷积层对应,并且每个所述池化层的输入包括所述经过调整的融合特征图和所述初选目标区域框的数据。
在本申请的一种实现方式中,所述物体区域框回归操作包括:通过所述第四卷积层分别对当前的所述融合特征图进行卷积,获取调整融合特征图;根据当前的初选目标区域框数据,通过所述池化层对所述调整融合特征图进行区域池化,获取新的融合特征图;从所述新的融合特征图获取所述新的初选目标区域框数据。
在本申请的一种实现方式中,所述第三子网还具有设置在所述池化层输出端的第五卷积层,所述从所述新的融合特征图获取所述新的初选目标区域框数据包括:通过所述第五卷积层对所述新的融合特征图进行规范化卷积,从经过规范化卷积的融合特征图获取所述新的初选目标区域框数据。
在本申请的一种实现方式中,所述第一子网和所述第二子网均为认知―样本归一化(Inception-BN)网络结构,所述第三子网为残差网络(ResNet)结构。
根据本申请的第二方面,提供一种神经网络的训练方法,包括:将含有目标区域框标注信息的样本图像输入用于目标区域框检测的深度卷积神经网络,检测获取多个融合特征图,所述深度卷积神经网络包括第一子网和第二子网,所述第一子网具有至少一个下采样层,所述第二子网具有至少一个上采样层;所述融合特征图通过第一特征图和第二特征图得到,所述第一特征图从第一子网获取得到,所述第二特征图从第二子网获取得到;根据所述多个融合特征图获取所述样本图像的目标区域框数据;根据获取到的所述样本图像的目标区域框数据以及所述目标区域框标注信息确定物体框检测的第一差异数据;根据所述第一差异数据调整所述深度卷积神经网络的网络参数。
在本申请的一种实现方式中,所述第二子网设置在所述第一子网的末端,所述第一子网具有多个第一卷积层和所述至少一个下采样层,所述下采样层设置在所述多个第一卷积层之间,所述第二子网具有多个第二卷积层和所述至少一个上采样层,所述上采样层设置在所述多个第二卷积层之间,所述第一卷积层和所述第二卷积层对称设置,所述至少一个下采样层和所述至少一个上采样层分别对称地设置。
在本申请的一种实现方式中,在至少一个所述第一卷积层设有用于输出所述第一特征图的第一输出分支,在第二卷积层设有用于输出所述第二特征图的第二输出分支。
在本申请的一种实现方式中,所述第二子网还具有多个第三卷积层,所述第三卷积层的输入包括所述第一输出分支和所述第二输出分支;所述检测获取多个融合特征图包括:从所述多个第三卷积层的输出端分别获取所述融合特征图。
在本申请的一种实现方式中,所述融合特征图中的至少一个点具有与多个物体探测框对应的框融合检测数据以及预测准确信息。
在本申请的一种实现方式中,所述深度卷积神经网络还包括第三子网,所述第三子网具有多组第四卷积层和多个池化层,所述多组第四卷积层分别与所述第三卷积层对应,所述多个池化层分别与所述多组第四卷积层对应,并且每个所述池化层的输入包括所述经过调整的融合特征图和所述初选目标区域框的数据。
在本申请的一种实现方式中,所述方法还包括:迭代地执行以下目标区域框回归训练操作,直到所述迭代满足迭代终止条件为止:通过所述第四卷积层分别对当前的所述融合特征图进行卷积,获取调整融合特征图;根据当前的初选目标区域框数据,通过所述池化层对所述调整融合特征图进行区域池化,获取新的融合特征图;从所述新的融合特征图获取所述新的初选目标区域框数据;根据未经过调整的初选目标区域框数据和新的初选目标区域框数据之间的框回归数据、新的初选目标区域框数据和相应的目标区域框标注信息确定物体框检测的第二差异数据;根据所述第二差异数据调整所述第三子网的网络参数。
在本申请的一种实现方式中,所述第三子网还具有设置在所述池化层输出端的第五卷积层,所述从所述新的融合特征图获取所述新的初选目标区域框数据包括:通过所述第五卷积层对所述新的融合特征图进行规范化卷积,从经过规范化卷积的融合特征图获取所述新的初选目标区域框数据。
在本申请的一种实现方式中,在将含有目标区域框标注信息的样本图像输入用于目标区域框检测的深度卷积神经网络,检测获取多个融合特征图之前,所述方法还包括:缩放所述样本图像,使得至少一个物体区域框的真值被物体探测框覆盖。
在本申请的一种实现方式中,所述样本图像的目标区域框标注信息包括正样本区域框的标注信息和负样本区域框的标注信息;所述正样本区域框与物体区域框的真值的重叠率不低于第一重叠比率值,所述负样本区域框与物体区域框的真值的重叠率不高于第二重叠比率值,所述第一重叠比率值大于所述第二重叠比率值。
在本申请的一种实现方式中,所述样本图像的目标区域框标注信息还包括中性样本区域框的标注信息,所述中性样本区域框与物体区域框的真值的重叠率在所述第一重叠比率值和所述第二重叠比率值之间。
在本申请的一种实现方式中,在全部所述样本图像当中,标注的正样本区域框的总和在所述正样本区域框、负样本区域框以及中性样本区域框的框总数中的占比不小于预定的第一比值,所述第一比值大于50%; 标注的负样本区域框的总和在框总数中的占比不大于预定的第二比值;标注的中性样本区域框的总和在框总数中的占比不大于预定的第三比值,所述第三比例不大于第一比值和第二比值之和的一半。
在本申请的一种实现方式中,所述第一子网和所述第二子网均为认知―样本归一化网络结构,所述第三子网为残差网络结构。
根据本申请的第三方面,提供一种物体检测装置,包括:融合特征图预测模块,用于通过用于目标区域框检测的深度卷积神经网络,从待处理的图像预测获取多个融合特征图;其中,所述深度卷积神经网络包括第一子网和第二子网,所述第一子网具有至少一个下采样层,所述第二子网具有至少一个上采样层;所述融合特征图通过第一特征图和第二特征图得到,所述第一特征图从第一子网获取得到,所述第二特征图从第二子网获取得到;目标区域框预测模块,用于根据所述融合特征图预测模块获取的多个融合特征图获取目标区域框数据。
在本申请的一种实现方式中,所述第二子网设置在所述第一子网的末端,所述第一子网具有多个第一卷积层和所述至少一个下采样层,所述下采样层设置在所述多个第一卷积层之间,所述第二子网具有多个第二卷积层和所述至少一个上采样层,所述上采样层设置在所述多个第二卷积层之间,所述第一卷积层和所述第二卷积层对称设置,所述至少一个下采样层和所述至少一个上采样层分别对称地设置。
在本申请的一种实现方式中,在至少一个所述第一卷积层设有用于输出所述第一特征图的第一输出分支,在第二卷积层设有用于输出所述第二特征图的第二输出分支。
在本申请的一种实现方式中,所述第二子网还具有多个第三卷积层,所述第三卷积层的输入包括所述第一输出分支和所述第二输出分支;所述融合特征图预测模块用于从所述多个第三卷积层的输出端分别获取所述融合特征图。
在本申请的一种实现方式中,所述融合特征图中的至少一个点具有与多个物体探测框对应的框融合检测数据以及预测准确信息,所述目标区域框预测模块用于根据至少一个所述融合特征图中的框融合检测数据以及预测准确信息分别获取与所述融合特征图各自对应的目标区域框数据。
在本申请的一种实现方式中,所述目标区域框预测模块用于:分别获取所述融合特征图各自对应的初选目标区域框数据;迭代地执行以下物体区域框回归操作,直到所述迭代满足迭代终止条件为止:通过调整所述融合特征图,从经过调整的融合特征图获取新的初选目标区域框数据;将经过所述迭代得到的所述初选目标区域框数据作为所述待处理的图像中的目标区域框数据。
在本申请的一种实现方式中,所述深度卷积神经网络还包括第三子网,所述第三子网具有多组第四卷积层和多个池化层,所述多组第四卷积层分别与所述第三卷积层对应,所述多个池化层分别与所述多组第四卷积层对应,并且每个所述池化层的输入包括所述经过调整的融合特征图和所述初选目标区域框的数据。
在本申请的一种实现方式中,所述目标区域框预测模块包括:框调整单元,用于通过所述第四卷积层分别对当前的所述融合特征图进行卷积,获取调整融合特征图;区域池化单元,用于根据当前的初选目标区域框数据,通过所述池化层对所述调整融合特征图进行区域池化,获取新的融合特征图;初选框获取单元,用于从所述新的融合特征图获取所述新的初选目标区域框数据。
在本申请的一种实现方式中,所述第三子网还具有设置在所述池化层输出端的第五卷积层,所述初选框获取单元用于通过所述第五卷积层对所述新的融合特征图进行规范化卷积,并且从经过规范化卷积的融合特征图获取所述新的初选目标区域框数据。
在本申请的一种实现方式中,所述第一子网和所述第二子网均为认知―样本归一化(Inception-BN)网络结构,所述第三子网为残差网络(ResNet)结构。
根据本申请的第四方面,提供一种神经网络的训练装置,包括:融合特征图检测模块,用于将含有目标区域框标注信息的样本图像输入用于目标区域框检测的深度卷积神经网络,检测获取多个融合特征图,所述深度卷积神经网络包括第一子网和第二子网,所述第一子网具有至少一个下采样层,所述第二子网具有至少一个上采样层;所述融合特征图通过第一特征图和第二特征图得到,所述第一特征图从第一子网获取得到,所述第二特征图从第二子网获取得到;目标区域框检测模块,用于根据所述多个融合特征图获取所述样本图像的目标区域框数据;第一差异获取模块,用于根据获取到的所述样本图像的目标区域框数据以及所述目标区域框标注信息确定物体框检测的第一差异数据;第一网络训练模块,用于根据所述第一差异数据调整所述深度卷积神经网络的网络参数。
在本申请的一种实现方式中,所述第二子网设置在所述第一子网的末端,所述第一子网具有多个第一卷积层和所述至少一个下采样层,所述下采样层设置在所述多个第一卷积层之间,所述第二子网具有多个第二卷积层和所述至少一个上采样层,所述上采样层设置在所述多个第二卷积层之间,所述第一卷积层和所述第二卷积层对称设置,所述至少一个下采样层和所述至少一个上采样层分别对称地设置。
在本申请的一种实现方式中,在至少一个所述第一卷积层设有用于输出所述第一特征图的第一输出分支,在第二卷积层设有用于输出所述第二特征图的第二输出分支。
在本申请的一种实现方式中,所述第二子网还具有多个第三卷积层,所述第三卷积层的输入包括所述第一输出分支和所述第二输出分支;所述融合特征图检测模块用于从所述多个第三卷积层的输出端分别获取所述融合特征图。
在本申请的一种实现方式中,所述融合特征图中的至少一个点具有与多个物体探测框对应的框融合检测数据以及预测准确信息。
在本申请的一种实现方式中,所述深度卷积神经网络还包括第三子网,所述第三子网具有多组第四卷积层和多个池化层,所述多组第四卷积层分别与所述第三卷积层对应,所述多个池化层分别与所述多组第四卷积层对应,并且每个所述池化层的输入包括所述经过调整的融合特征图和所述初选目标区域框的数据。
在本申请的一种实现方式中,所述装置还包括:框回归迭代训练模块,用于迭代地执行以下目标区域框回归训练操作,直到所述迭代满足迭代终止条件为止:通过所述第四卷积层分别对当前的所述融合特征图进行卷积,获取调整融合特征图;根据当前的初选目标区域框数据,通过所述池化层对所述调整融合特征图进行区域池化,获取新的融合特征图;从所述新的融合特征图获取所述新的初选目标区域框数据;根据未经过调整的初选目标区域框数据和新的初选目标区域框数据之间的框回归数据、新的初选目标区域框数据和相应的目标区域框标注信息确定物体框检测的第二差异数据;根据所述第二差异数据调整所述第三子网的网络参数。
在本申请的一种实现方式中,所述第三子网还具有设置在所述池化层输出端的第五卷积层,所述框回归迭代训练模块用于通过所述第五卷积层对所述新的融合特征图进行规范化卷积,并且从经过规范化卷积的融合特征图获取所述新的初选目标区域框数据。
在本申请的一种实现方式中,所述装置还包括:预处理模块,用于在将含有目标区域框标注信息的样本图像输入用于目标区域框检测的深度卷积神经网络,检测获取多个融合特征图之前,缩放所述样本图像,使得至少一个物体区域框的真值被物体探测框覆盖。
在本申请的一种实现方式中,所述样本图像的目标区域框标注信息包括正样本区域框的标注信息和负样本区域框的标注信息;所述正样本区域框与物体区域框的真值的重叠率不低于第一重叠比率值,所述负样本区域框与物体区域框的真值的重叠率不高于第二重叠比率值,所述第一重叠比率值大于所述第二重叠比率值。
在本申请的一种实现方式中,所述样本图像的目标区域框标注信息还包括中性样本区域框的标注信息,所述中性样本区域框与物体区域框的真值的重叠率在所述第一重叠比率值和所述第二重叠比率值之间。
在本申请的一种实现方式中,在全部所述样本图像当中,标注的正样本区域框的总和在所述正样本区域框、负样本区域框以及中性样本区域框的框总数中的占比不小于预定的第一比值,所述第一比值大于50%;标注的负样本区域框的总和在框总数中的占比不大于预定的第二比值;标注的中性样本区域框的总和在框总数中的占比不大于预定的第三比值,所述第三比例不大于第一比值和第二比值之和的一半。
在本申请的一种实现方式中,所述第一子网和所述第二子网均为认知―样本归一化网络结构,所述第三子网为残差网络结构。
根据本申请的第五方面,提供了一种电子设备,包括:
处理器和存储器;
所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行本申请任一实施例所述的物体检测方法对应的操作;或者,所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行本申请任一实施例所述的神经网络的训练方法对应的操作。
根据本申请的第六方面,提供了另一种电子设备,包括:
处理器和本申请任一实施例所述的物体检测装置;在处理器运行所述物体检测装置时,本申请任一实 施例所述的物体检测装置中的单元被运行;或者
处理器和本申请任一实施例所述的神经网络的训练装置;在处理器运行所述神经网络的训练装置时,本申请任一实施例所述的神经网络的训练装置中的单元被运行。
根据本申请的第七方面,提供了一种计算机程序,包括计算机可读代码当所述计算机可读代码在设备上运行时,所述设备中的处理器执行用于实现本申请任一实施例所述的物体检测方法中各步骤的指令;或者
当所述计算机可读代码在设备上运行时,所述设备中的处理器执行用于实现本申请任一实施例所述的神经网络的训练方法中各步骤的指令。
根据本申请地第八方面,提供了一种计算机可读存储介质,用于存储计算机可读取的指令,所述指令被执行时实现本申请任一实施例所述的物体检测方法中各步骤的操作、或者本申请任一实施例所述的神经网络的训练方法中各步骤的操作。
根据本申请提供的物体检测方案、神经网络的训练技术方案,通过用于目标区域框检测的深度卷积神经网络,从待处理的图像预测获取多个融合特征图,其中,从具有至少一个下采样层的第一子网获取多个第一特征图,从具有至少一个上采样层的第二子网获取多个第二特征图,分别由多个第一特征图和多个第二特征图融合得到融合特征图。此后,再根据所述多个融合特征图获取目标区域框数据。由于这些融合特征图较好地表征了图像中高层的语义特征(如:布局、前背景信息)和低层的细节特征(如:小物体信息),因此根据这些融合特征图能够有效地提取到图像中包含的大小物体的目标区域框数据,从而提高物体检测的准确性和鲁棒性。
下面通过附图和实施例,对本申请的技术方案做进一步的详细描述。
附图说明
构成说明书的一部分的附图描述了本申请的实施例,并且连同描述一起用于解释本申请的原理。
参照附图,根据下面的详细描述,可以更加清楚地理解本申请,其中:
图1是示出根据本申请一实施例的物体检测方法的流程图;
图2是示出根据本申请另一实施例的物体检测方法的流程图;
图3示出根据本申请实施例的深度卷积神经网络的一种示例性结构;
图4是示出根据本申请又一实施例的物体检测方法的流程图;
图5是示出根据本申请一实施例的神经网络的训练方法的流程图;
图6是示出根据本申请另一实施例的神经网络的训练方法的流程图;
图7是示出根据本申请又一实施例的神经网络的训练方法的流程图;
图8是示出根据本申请一实施例的物体检测装置的结构框图;
图9是示出根据本申请另一实施例的物体检测装置的结构框图;
图10是示出根据本申请一实施例的神经网络的训练装置的结构框图;
图11是示出根据本申请另一实施例的神经网络的训练装置的结构框图;
图12是示出根据本申请一实施例的第一电子设备的结构示意图;
图13是示出根据本申请另一实施例的第二电子设备的结构示意图。
具体实施方式
现在将参照附图来详细描述本申请的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本申请的范围。
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本申请及其应用或使用的任何限制。
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义, 则在随后的附图中不需要对其进行进一步讨论。
本申请可以应用于计算机系统/服务器,其可与众多其它通用或专用计算系统环境或配置一起操作。适于与计算机系统/服务器一起使用的众所周知的计算系统、环境和/或配置的例子包括但不限于:个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统﹑大型计算机系统和包括上述任何系统的分布式云计算技术环境。
计算机系统/服务器可以在由计算机系统执行的计算机系统可执行指令(诸如程序模块)的一般语境下描述。通常,程序模块可以包括但不限于例程、程序、目标程序、组件、逻辑、数据结构,它们执行特定的任务或者实现特定的抽象数据类型。计算机系统/服务器可以在分布式云计算环境中实施,分布式云计算环境中,任务是由通过通信网络链接的远程处理设备执行的。在分布式云计算环境中,程序模块可以位于包括存储设备的本地或远程计算系统存储介质上。
通过现有技术提供的网络结构执行目标区域框检测,不能够有效地利用从卷积神经网络中的高层得到的特征图中的信息协助处理网络底层的信息,使得从网络获取到的特征数据不具有足够的代表性和鲁棒性,不利于小物体的检测。
下面结合图1-图13对本公开提供的物体检测技术方案进行说明。本公开提供的任一种聚类技术方案可由软件或者硬件或者软硬结合的方式进行示例。例如,本公开提供的聚类技术方案可由某一电子设备实施或者由某一处理器实施,本公开并不限制,所述电子设备可包括但不限于终端或服务器,所述处理器可包括但不限于CPU或GPU。以下不再赘述。
图1是示出根据本申请一实施例的物体检测方法的流程图。
参照图1,本实施例的物体检测方法包括一下步骤:
步骤S110,通过用于目标区域框检测的深度卷积神经网络,从待处理的图像预测获取多个融合特征图。
在一个可选示例中,步骤S110可以由处理器调用存储器存储的指令执行或者由被处理器运行的融合特征图预测模块810执行。
其中,深度卷积神经网络包括第一子网和第二子网,第一子网具有至少一个下采样层,第二子网具有至少一个上采样层。融合特征图通过第一特征图和第二特征图得到,第一特征图从第一子网获取得到,第二特征图从第二子网获取得到。
本公开上述实施例中待处理的图像是拍摄有一个或多个物体对象的照片或视频帧图像。该图像应满足一定的分辨率要求,至少通过肉眼能够辨别出拍摄到的物体对象。
用于目标区域框检测的深度卷积神经网络中的第一子网通过对待处理的图像进行卷积、池化,可在第一子网不同深度的多个卷积层获取该图像的第一特征图,这些第一特征图表征不同大小程度的区域框的特征。在设置有至少一个下采样层的第一子网中,在较浅的卷积层获得的第一特征图能够较好地表达图像的细节,较浅的卷积层通常指深度卷积神经网络中位置靠前的卷积层,但是难以区分前景和背景;而在较深的卷积层获得的第一特征图能够较好地提取图像中的整体语义特征,较深的卷积层通常指深度卷积神经网络中位置靠后的卷积层,但是将损失图像的细节信息,如小物体信息。
具有至少一个上采样层的第二子网进一步对从第一子网末端获取到的第一特征图执行相反的处理,即反卷积、上采样和池化操作,将从第一子网末端获取到的第一特征图逐步放大,在第二子网不同深度的多个卷积层获取与前述第一特征图相应的第二特征图。由于第二特征图均由经过卷积、下采样的第一特征图进行反卷积和上采样,在此过程中,高层语义特征被逐步反卷积并与低层细节特征结合,可协助识别小物体(小物体的区域框)。
由此,通过第一子网和第二子网执行的图像处理途径形成一个沙漏形的结构,从第一子网的第一卷积层生成的第一特征图通过下采样逐步变小;第一子网末端生成的第一特征图通过第二子网的第二卷积层和上采样层被逐步放大。
在此基础上,将至少一个第一特征图与相应的第二特征图进行融合,得到多个融合特征图,这些融合特征图可较好地表征图像中高层的语义特征和低层的细节特征,以用于识别不同大小的物体区域框。
步骤S120,根据多个融合特征图获取目标区域框数据。
在一个可选示例中,步骤S120可以由处理器调用存储器存储的指令执行或者由被处理器运行的目标区 域框预测模块820执行。
具体地,可从至少一个融合特征图提取目标区域框数据,再将从至少一个融合特征图提取的目标区域框数据整合,作为从待处理的图像检测到的目标区域框数据。
根据本申请实施例的物体检测方法,通过用于目标区域框检测的深度卷积神经网络,从待处理的图像预测获取多个融合特征图,其中,从具有至少一个下采样层的第一子网获取多个第一特征图,从具有至少一个上采样层的第二子网获取多个第二特征图,分别由多个第一特征图和多个第二特征图融合得到融合特征图。此后,再根据所述多个融合特征图获取目标区域框数据。由于这些融合特征图较好地表征了图像中高层的语义特征(如:布局、前背景信息)和低层的细节特征(如:小物体信息),因此根据这些融合特征图能够有效地提取到图像中包含的大小物体的目标区域框数据,从而提高物体检测的准确性和鲁棒性。
图2是示出根据本申请另一实施例的物体检测方法的流程图。
参照图2,在步骤S210,通过用于目标区域框检测的深度卷积神经网络,从待处理的图像预测获取多个融合特征图。
在一个可选示例中,步骤S210可以由处理器调用存储器存储的指令执行或者由被处理器运行的融合特征图预测模块810执行。
具体地,在该深度卷积神经网络中,第一子网具有多个第一卷积层和至少一个下采样层,下采样层设置在多个第一卷积层之间;第二子网具有多个第二卷积层和至少一个上采样层,上采样层设置在多个第二卷积层之间。第二子网设置在第一子网的末端,第一卷积层和第二卷积层对称设置,至少一个下采样层和至少一个上采样层分别对称地设置。
可在第一子网中不同深度的多个第一卷积层获取该图像的多个第一特征图,在第二子网中与前述多个第一卷积层对称设置的第二卷积层获取该图像的多个第二特征图。
可选地,在至少一个第一卷积层设有用于输出第一特征图的第一输出分支,在第二卷积层设有用于输出第二特征图的第二输出分支。
根据本申请的一种可选实施方式,第二子网还具有多个第三卷积层,所述第三卷积层的输入包括所述第一输出分支和所述第二输出分支。相应地,从所述多个第三卷积层的输出端分别获取所述融合特征图。
可使用任何具有上述结构的深度卷积神经网络。可选地,将第一子网和第二子网均构建为在物体检测中性能较佳的认知―样本归一化(Inception-BN)网络结构。Inception-BN网络结构擅长于从图像中提取不同的结构/模式(pattern),适合执行第一子网和第二子网的任务功能。
图3示出根据本公开实施例的深度卷积神经网络的一种示例性结构。
参照图3,根据本实施例的深度卷积神经网络包括第一子网SN1和第二子网SN2。其中,第一子网SN1具有多个第一卷积层C1和设置在多个第一卷积层C1之间的至少一个下采样层P1,第二子网SN2具有多个第二卷积层C2和设置在多个第二卷积层C2之间的至少一个上采样层P2,其中,下采样层P1和上采样层P2对称地设置,多个第一卷积层C1和多个第二卷积层C2也对称地设置。此外,至少一个第一卷积层C1设置有第一输出分支F1,至少一个第二卷积层C2设置有第一输出分支F2。第二子网SN2还设有多个第三卷积层C3,自多个第三卷积层C3输出融合特征图。
根据本申请的一种可实施方式,融合特征图中的至少一个点具有与多个物体探测框对应的框融合检测数据以及预测准确信息。也就是说,在第一卷积层和第二卷积层分别设有用于进行物体区域框探测的物体探测框的信息,如:卷积参数或特征参数。在不同深度的第一卷积层和第二卷积层中设置的物体探测框的信息分别与两个或两个以上物体探测框集合各自对应,这两个或两个以上物体探测框集合分别包括不同探测框大小范围的物体探测框,以用于在该深度卷积神经网络的不同深度获取不同大小的物体区域框的特征数据。
融合特征图中的至少一个点的框融合检测数据可包括但不限于例如与物体探测框集合中的物体探测框相应的坐标数据、位置及大小数据,该预测准确信息可以是该框融合检测数据的置信度数据,如:预测准确概率。例如,融合特征图中的每个点可具有1个、3个、6个或9个与物体探测框相应的坐标数据以及该坐标数据的置信度数据。
相应地,根据本申请的一种可选实施方式,在步骤210之后执行步骤S220。
步骤220,根据至少一个融合特征图中的框融合检测数据以及预测准确信息分别获取与融合特征图各自 对应的目标区域框数据。
在一个可选示例中,步骤S220可以由处理器调用存储器存储的指令执行或者由被处理器运行。
具体地,可根据融合特征图中至少一个点的框融合检测数据的预测准确信息来获取目标区域框数据。例如,如果某个点的某个框坐标数据的置信度大于预定的阈值(如:60%、70%),则可将该框坐标数据对应的区域框确定为目标区域框数据之一。
根据本申请的另一种可选实施方式,在执行步骤S210后,执行步骤S230-S240。
步骤S230,分别获取融合特征图各自对应的初选目标区域框数据。
在一个可选示例中,步骤S230可以由处理器调用存储器存储的指令执行或者由被处理器运行。
可例如,执行与前述步骤S220或S120类似的处理,获取初选目标区域框数据,即,将前述步骤S220或S120获取到的目标区域框数据作为步骤S230中的初选目标区域框数据,以进行进一步的调整、修正处理,提高物体区域框检测的准确性。
在步骤S240,迭代地执行以下物体区域框回归操作,直到迭代满足迭代终止条件为止,通过调整融合特征图,从经过调整的融合特征图获取新的初选目标区域框数据。
在一个可选示例中,步骤S240可以由处理器调用存储器存储的指令执行或者由被处理器运行。
也就是说,通过调整各个融合特征图来分别调整其中的初选目标区域框数据,再从经过调整的融合特征图分别获取新的初选目标区域框数据,从而对初选目标区域框进行回归(物体区域框回归操作),来获取更为准确的新的初选目标区域框数据。
在该步骤,迭代地执行这样的物体区域框回归操作,直到满足迭代终止条件为止,以最终获得更为精确的初选目标区域框数据。可根据需要设置该迭代终止条件,如:预定的迭代次数、新的初选目标区域框数据与未经过调整的初选目标区域框数据之间的调整值(即框回归)小于预定的框回归值。
在完成步骤S240的迭代之后,将经过迭代得到的初选目标区域框数据作为待处理的图像中的目标区域框数据。
根据本申请另一实施例的物体检测方法,通过具有对称结构的用于目标区域框检测的深度卷积神经网络,从第一子网的多个第一卷积层获取逐步经过卷积、下采样的待处理的图像的多个第一特征图,再从第二子网的对称的多个第二卷积层获取在第一子网的末端获取的第一特征图逐步经过反卷积、上采样的相应多个第二特征图,将多个第一特征图和相应的第二特征图进一步进行卷积,获得较好地表征了图像中高层的语义特征(如:布局、前背景信息)和低层的细节特征(如:小物体信息)的融合特征图,从而能够根据这些融合特征图有效地提取到图像中包含的大小物体的目标区域框数据。
在此基础上,通过调整多个融合特征图来从经过调整的融合特征图获取新的初选目标区域框数据,从而对初选目标区域框数据迭代地进行回归。通过对目标区域框数据进行多次的回归调整,能够更准确地检测到更为精准的包含的大小物体的目标区域框数据,进一步提高物体检测的准确性和鲁棒性。
图4是示出根据本申请还一实施例的物体检测方法的流程图。该实施例描述前述步骤S240中的一种示例性物体区域框回归操作。
根据该实施例的深度卷积神经网络还包括第三子网,第三子网具有多组第四卷积层和多个池化层,多组第四卷积层分别与第三卷积层对应,多个池化层分别与多组第四卷积层对应,并且每个池化层的输入包括经过调整的融合特征图和初选目标区域框的数据。
也就是说,每组第四卷积层可以包括一个或多个卷积层,每组第四卷积层可连接在前述第三卷积层的输出端,接收融合特征图作为输入。每个池化层设置在对应的第四卷积层的末端,接收经过调整的融合特征图和初选目标区域框数据作为输入。
其中,每组第四卷积层用于对从第三卷积层获取到的融合特征图进行卷积,获得调整融合特征图。在此过程中,对从该融合特征图获取的初选目标区域框数据进行调整。第三子网中的池化层用于对经过第四卷积层卷积获得的调整融合特征图进行区域池化,获取新的融合特征图。从而,可从新的融合特征图获取到新的初选目标区域框数据。
具体地,在每次迭代处理的物体区域框回归操作中,涉及当前迭代开始时的多个融合特征图以及初选目标区域框数据,还涉及当前迭代结束时获得的新的多个融合特征图以及新的初选目标区域框数据。
在步骤S410,通过第四卷积层分别对当前的融合特征图进行卷积,获取调整融合特征图,从而对当前 的初选目标区域框进行调整,该调整包括对初选目标区域框的位置和/或大小的调整。
在一个可选示例中,该步骤S410可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的框调整单元821执行。
在步骤S420,根据当前的初选目标区域框数据,通过池化层对调整融合特征图进行区域池化,获取新的融合特征图。
在一个可选示例中,该步骤S420可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的区域池化单元822执行。
也就是说,将当前的初选目标区域框作为关注区域,对调整融合特征图进行区域池化,获取新的融合特征图。
通过前述根据当前的初选目标区域框数据对调整融合特征图进行区域池化,获得反映对调整的目标区域框的响应程度的新的融合特征图,以便于后续从新的融合特征图获取新的初选目标区域框数据。
在步骤S430,从新的融合特征图获取新的初选目标区域框数据,从而可完成目标区域框的回归,使得调整的目标区域框更趋近物体区域框的真值(ground truth)。可通过与步骤S120或S220类似的处理执行步骤S430的处理。
在一个可选示例中,该步骤S430可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的初选框获取单元823执行。
根据本申请的一种可选实施方式,第三子网还具有设置在池化层输出端的第五卷积层。相应地,步骤S430具体包括:通过第五卷积层对新的融合特征图进行规范化卷积,并且从经过规范化卷积的融合特征图获取所述新的初选目标区域框数据。
可使用任何具有上述结构的卷积神经网络来构建第三子网。可选地,将第三子网构建为在新近开发的物体检测技术中性能较佳的残差网络(ResNet)结构结构,来执行区域池化和规范化卷积。
根据本申请上述实施例的物体检测方法,在前述各个实施例的基础上,通过对至少一个融合特征图进一步进行卷积,来对该融合特征图中包含的初选目标区域框数据进行调整,再经过区域池化来获得新的融合特征图,并从新的融合特征图获取新的初选目标区域框数据,从而对预测得到的初选目标区域框数据进行调整、回归,有助于提高物体检测的准确性和鲁棒性。
图5是示出根据本申请一实施例的神经网络的训练方法的流程图。
参照图5,在步骤S510,将含有目标区域框标注信息的样本图像输入用于目标区域框检测的深度卷积神经网络,检测获取多个融合特征图。
在一个可选示例中,该步骤S510可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的融合特征图检测模块1010执行。
如前所述,深度卷积神经网络包括第一子网和第二子网,第一子网具有至少一个下采样层,第二子网具有至少一个上采样层;融合特征图通过第一特征图和第二特征图得到,第一特征图从第一子网获取得到,第二特征图从第二子网获取得到。
通过使用用于深度卷积神经网络,可从含有目标区域框标注信息的样本图像检测获取到多个融合特征图。
通常对多个样本图像执行步骤S510的处理,为至少一个样本图像检测获取多个融合特征图。
步骤S520,根据多个融合特征图获取样本图像的目标区域框数据。
在一个可选示例中,该步骤S520可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的目标区域框检测模块1020执行。
步骤S520的处理与步骤S120的处理类似,在此不予赘述。
步骤S530,根据获取到的样本图像的目标区域框数据以及目标区域框标注信息确定物体框检测的第一差异数据。
在一个可选示例中,该步骤S530可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第一差异获取模块1030执行。
例如,可根据获取到的所述样本图像的目标区域框数据以及目标区域框标注信息计算损失值或偏差值作为该第一差异数据,作为后续训练深度卷积神经网络的依据。
在步骤S540,根据第一差异数据调整深度卷积神经网络的网络参数。
在一个可选示例中,该步骤S540可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第一网络训练模块1040执行。
例如,将确定的第一差异数据反传给该深度卷积神经网络,以调整该深度卷积神经网络的网络参数。
根据本申请提供的神经网络的训练方法,将含有目标区域框标注信息的样本图像输入用于目标区域框检测的深度卷积神经网络,检测获取多个融合特征图;其中,从具有至少一个下采样层的第一子网检测获取多个第一特征图,从具有至少一个上采样层的第二子网检测获取多个第二特征图,分别由多个第一特征图和多个第二特征图融合得到融合特征图,再根据多个融合特征图获取目标区域框数据。此后,根据获取到的目标区域框数据以及目标区域框标注信息确定第一差异数据,再根据第一差异数据调整深度卷积神经网络的网络参数。由于从训练获得的深度卷积神经网络的这些融合特征图较好地表征了图像中高层的语义特征(如:布局、前背景信息)和低层的细节特征(如:小物体信息),因此根据这些融合特征图能够有效地提取到图像中包含的大小物体的目标区域框数据。训练获得的深度卷积神经网络能够提高物体检测的准确性和鲁棒性。
图6是示出根据本申请另一实施例的神经网络的训练方法的流程图。
根据本实施例,在训练的深度卷积神经网络中,第二子网设置在第一子网的末端;第一子网具有多个第一卷积层和至少一个下采样层,下采样层设置在多个第一卷积层之间;第二子网具有多个第二卷积层和至少一个上采样层,上采样层设置在多个第二卷积层之间。第一卷积层和第二卷积层对称设置,至少一个下采样层和至少一个上采样层分别对称地设置。
在此基础上,可选地,在至少一个第一卷积层设有用于输出第一特征图的第一输出分支,在第二卷积层设有用于输出第二特征图的第二输出分支。
为此,可选地,第二子网还具有多个第三卷积层,第三卷积层的输入包括第一输出分支和第二输出分支。相应地,第三卷积层用于对来自第一输出分支和第二输出分支的第一特征图和相应的第二特征图进行卷积,获取相应的融合特征图。
参照图6,在步骤S610,缩放样本图像,使得样本图像中的至少一个物体区域框的真值被物体探测框覆盖。如此,可确保在任何批量的样本图像中具有正样本。
此外,可选地,选取足够数量的正样本,并选取一定数量的负样本,以使得训练得到的第一子网和第二子网较好地收敛。
在此,正样本为正样本区域框,负样本实为负样本区域框。可按照以下标准定义正样本区域框和负样本区域框:正样本区域框与物体区域框的真值的重叠率不低于第一重叠比率值,负样本区域框与物体区域框的真值的重叠率不高于第二重叠比率值,第一重叠比率值大于第二重叠比率值。
相应地,根据本申请的一种可实施方式,样本图像的目标区域框标注信息包括正样本区域框的标注信息和负样本区域框的标注信息。
这里,可根据设计需要设置第一重叠比率值,例如:将第一重叠比率值设置为70%-95%中的任何比率值,将第二重叠比率值设置为0%-30%或0-25%范围中的任何比率值。
此外,还可设置中性样本,即中性样本区域框。具体地,可按照以下标准定义中性样本区域框:中性样本区域框与物体区域框的真值的重叠率在第一重叠比率值和第二重叠比率值之间,如:30%-70%之间、25%-80%之间。
进一步地,可例如,按照以下方式控制正样本、负样本和中性样本的数量:在全部样本图像当中,标注的正样本区域框的总和在正样本区域框、负样本区域框以及中性样本区域框的框总数中的占比不小于预定的第一比值,第一比值大于50%;标注的负样本区域框的总和在框总数中的占比不大于预定的第二比值;标注的中性样本区域框的总和在框总数中的占比不大于预定的第三比值,第三比例不大于第一比值和第二比值之和的一半。适度地使用中性样本图像有助于更好地区分正样本和负样本,提高训练的第三子网的鲁棒性。
在步骤S620,将含有目标区域框标注信息的样本图像输入用于目标区域框检测的深度卷积神经网络,检测获取多个融合特征图。其中,从多个第三卷积层的输出端分别获取所述融合特征图。
在一个可选示例中,该步骤S620可以由处理器调用存储器存储的相应指令执行。
可选地,融合特征图中的至少一个点的框融合检测数据可包括但不限于例如与物体探测框集合中的物体探测框相应的坐标数据、位置及大小数据,该预测准确信息可以是该框融合检测数据的置信度数据,如:预测准确概率。
可使用任何具有上述结构的深度卷积神经网络。可选地,将第一子网和第二子网均构建为在物体检测中性能较佳的Inception-BN网络结构。
可选地,融合特征图中的至少一个点的框融合检测数据可包括但不限于例如与物体探测框集合中的物体探测框相应的坐标数据、位置及大小数据,该预测准确信息可以是该框融合检测数据的置信度数据,如:预测准确概率。
相应地,步骤S630,根据至少一个所述融合特征图中的框融合检测数据以及预测准确信息分别获取与所述融合特征图各自对应的目标区域框数据。
在一个可选示例中,该步骤S630可以由处理器调用存储器存储的相应指令执行。
步骤S640,根据获取到的所述样本图像的目标区域框数据以及所述目标区域框标注信息确定物体框检测的第一差异数据。
在一个可选示例中,该步骤S640可以由处理器调用存储器存储的相应指令执行。
例如,可根据获取到的样本图像的目标区域框数据以及目标区域框标注信息计算损失值或偏差值作为该第一差异数据,作为后续训练深度卷积神经网络的依据。
在步骤S650,根据第一差异数据调整深度卷积神经网络的网络参数。
在一个可选示例中,该步骤S650可以由处理器调用存储器存储的相应指令执行。
步骤S640-S650的处理与前述步骤S530-S540的处理类似,在此不予赘述。
根据本申请的神经网络的训练方法,将含有目标区域框标注信息的样本图像输入具有对称结构的用于目标区域框检测的深度卷积神经网络,检测获取多个融合特征图;其中,从具有至少一个下采样层的第一子网检测获取多个第一特征图,从具有至少一个上采样层的第二子网检测获取多个第二特征图,分别由多个第一特征图和多个第二特征图融合得到融合特征图,再根据多个融合特征图获取目标区域框数据。此后,根据获取到的目标区域框数据以及目标区域框标注信息确定第一差异数据,再根据第一差异数据调整深度卷积神经网络的网络参数。由于从训练获得的深度卷积神经网络的这些融合特征图较好地表征了图像中高层的语义特征(如:布局、前背景信息)和低层的细节特征(如:小物体信息),因此根据这些融合特征图能够有效地提取到图像中包含的大小物体的目标区域框数据。训练获得的深度卷积神经网络能够提高物体检测的准确性和鲁棒性。
图7是示出根据本申请又一实施例的神经网络的训练方法的流程图。
如前所述,根据上述实施例训练的该深度卷积神经网络还包括第三子网,第三子网具有多组第四卷积层和多个池化层,多组第四卷积层分别与第三卷积层对应,多个池化层分别与多组第四卷积层对应,并且每个池化层的输入包括经过调整的融合特征图和初选目标区域框的数据。
也就是说,每组第四卷积层可以包括一个或多个卷积层,每组第四卷积层可连接在前述第三卷积层的输出端,接收融合特征图作为输入。每个池化层设置在对应的第四卷积层的末端,接收经过调整的融合特征图和所述初选目标区域框数据作为输入。
在该实施例中,主要描述该深度卷积神经网络中的第三子网的训练。可先通过上述任一实施例的方法训练好第一子网和第二子网,再使用自第一子网和第二子网训练过程中获得的融合特征图,根据该实施例的方法来训练第三子网。
参照图7,在步骤S710,获取从含有目标区域框标注信息的样本图像获取的多个融合特征图。
如前步骤S510或S610所述,从样本图像获取该多个融合特征图。
在一个可选示例中,该步骤S710可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的融合特征图检测模块1010执行。在步骤S720,迭代地执行目标区域框回归训练操作,直到迭代满足迭代终止条件为止。
在一个可选示例中,该步骤S720可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的框回归迭代训练模块1050执行。
具体地,步骤S720包括步骤S721-S726。
在步骤S721,通过第四卷积层分别对当前的融合特征图进行卷积,获取调整融合特征图,从而达到对当前的初选目标区域框进行调整的目的。
在步骤S722,根据当前的初选目标区域框数据,通过池化层对调整融合特征图进行区域池化,获取新的融合特征图。新的融合特征图包含对初选目标区域框进行调整以及反映对调整的目标区域框的响应程度。
在步骤S723,从新的融合特征图获取新的初选目标区域框数据。
步骤S721-S723的处理与前述步骤S410-S430的处理类似,在此不予赘述。
根据本申请的一种可选实施方式,第三子网还具有设置在所述池化层输出端的第五卷积层。相应地,步骤S723具体包括:通过第五卷积层对新的融合特征图进行规范化卷积,并且从经过规范化卷积的融合特征图获取新的初选目标区域框数据。
在步骤S724,根据未经过调整的初选目标区域框数据和新的初选目标区域框数据之间的框回归数据、新的初选目标区域框数据和相应的目标区域框标注信息确定物体框检测的第二差异数据。
在一个可选示例中,该步骤S724可以由处理器调用存储器存储的相应指令执行。
例如,可通过新的初选目标区域框数据和相应的目标区域框标注信息确定检测偏移,并且根据检测偏移和框回归数据(即框移动/调整数据)来计算损失值作为第二差异数据。通过综合两个偏移参数(检测偏移和框回归数据)作为物体框检测的第二差异数据,能够提高训练的第三子网的准确性。
在步骤S725,根据第二差异数据调整第三子网的网络参数。
在一个可选示例中,该步骤S725可以由处理器调用存储器存储的相应指令执行。
例如,将确定的第二差异数据反传给第三子网,以调整第三子网的网络参数。
在步骤S726,确定是否满足迭代终止条件。
在一个可选示例中,该步骤S726可以由处理器调用存储器存储的相应指令执行。
如果在步骤S726,确定前述的迭代满足迭代终止条件(如:达到预定的迭代次数),则结束对第三子网的训练;如果在步骤S726,确定前述的迭代不满足迭代终止条件(如:达到预定的迭代次数),则返回执行步骤S721,继续进行前述对第三子网的训练,直到确定满足迭代终止条件为止。
现有的用于物体区域框回归的神经网络的训练仅针对一次目标区域框回归执行迭代(如迭代次数N)的训练;而根据本申请提供的训练方法,对目标区域框执行多次回归(如回归次数M),每次回归涉及多次迭代(如迭代次数N)的训练,即涉及M×N次迭代训练。由此训练得到的第三子网在进行物体区域框的定位检测上更为准确。
可使用任何具有上述结构的卷积神经网络来构建第三子网。可选地,将第三子网构建为在新近开发的物体检测技术中性能较佳的ResNet结构,来执行区域池化和规范化卷积。
根据本申请提供的神经网络的训练方法,在前述各实施例的基础上,训练得到的深度卷积神经网络通过对样本图像的每个融合特征图进一步进行卷积,来对该融合特征图中包含的初选目标区域框数据进行调整,再经过区域池化来获得新的融合特征图,并从新的融合特征图获取新的初选目标区域框数据,从而对得到的初选目标区域框数据进行调整、回归,能够进一步提高物体检测的准确性和鲁棒性。
图8是示出根据本申请一实施例的物体检测装置的结构框图。
参照图8,本实施例的物体检测装置包括融合特征图预测模块810和目标区域框预测模块820。
融合特征图预测模块810用于通过用于目标区域框检测的深度卷积神经网络,从待处理的图像预测获取多个融合特征图;其中,深度卷积神经网络包括第一子网和第二子网,第一子网具有至少一个下采样层,第二子网具有至少一个上采样层;融合特征图通过第一特征图和第二特征图得到,第一特征图从第一子网获取得到,第二特征图从第二子网获取得到。
目标区域框预测模块820用于根据融合特征图预测模块810获取的多个融合特征图获取目标区域框数据。
本实施例的物体检测装置用于实现前述方法实施例中相应的物体检测方法,并具有相应的方法实施例的有益效果,在此不再赘述。
图9是示出根据本申请另一实施例的物体检测装置的结构框图。
根据本实施例,在用于目标区域框检测的深度卷积神经网络中,第二子网设置在第一子网的末端,第一子网具有多个第一卷积层和至少一个下采样层,下采样层设置在多个第一卷积层之间,第二子网具有多 个第二卷积层和所述至少一个上采样层,上采样层设置在多个第二卷积层之间,第一卷积层和第二卷积层对称设置,至少一个下采样层和至少一个上采样层分别对称地设置。
根据一种可选的实施方式,在至少一个第一卷积层设有用于输出第一特征图的第一输出分支,在第二卷积层设有用于输出第二特征图的第二输出分支。
根据一种可选的实施方式,第二子网还具有多个第三卷积层,第三卷积层的输入包括第一输出分支和所述第二输出分支。相应地,融合特征图预测模块810用于从多个第三卷积层的输出端分别获取融合特征图。
可选地,融合特征图中的至少一个点具有与多个物体探测框对应的框融合检测数据以及预测准确信息。相应地,目标区域框预测模块820用于根据至少一个融合特征图中的框融合检测数据以及预测准确信息分别获取与融合特征图各自对应的目标区域框数据。
可选地,目标区域框预测模块820用于分别获取融合特征图各自对应的初选目标区域框数据;迭代地执行以下物体区域框回归操作,直到迭代满足迭代终止条件为止:通过调整融合特征图,从经过调整的融合特征图获取新的初选目标区域框数据;将经过迭代得到的初选目标区域框数据作为待处理的图像中的目标区域框数据。
可选地,深度卷积神经网络还包括第三子网,第三子网具有多组第四卷积层和多个池化层,多组第四卷积层分别与第三卷积层对应,多个池化层分别与多组第四卷积层对应,并且每个池化层的输入包括经过调整的融合特征图和初选目标区域框的数据。
可选地,目标区域框预测模块820包括:
框调整单元821,用于通过第四卷积层分别对当前的所述融合特征图进行卷积,获取调整融合特征图;
区域池化单元822,用于根据当前的初选目标区域框数据,通过池化层对调整融合特征图进行区域池化,获取新的融合特征图;
初选框获取单元823,用于从新的融合特征图获取新的初选目标区域框数据。
可选地,第三子网还具有设置在池化层输出端的第五卷积层;相应地,初选框获取单元823用于通过第五卷积层对新的融合特征图进行规范化卷积,并且从经过规范化卷积的融合特征图获取新的初选目标区域框数据。
可选地,第一子网和第二子网均为认知―样本归一化(Inception-BN)网络结构,第三子网为残差网络(ResNet)结构。
本实施例的物体检测装置用于实现前述方法实施例中相应的物体检测方法,并具有相应的方法实施例的有益效果,在此不再赘述。
图10是示出根据本申请一实施例的神经网络的训练装置的结构框图。
参照图10,本实施例的神经网络的训练装置还包括融合特征图检测模块1010、目标区域框检测模块1020、第一差异获取模块1030和第一网络训练模块1040。
融合特征图检测模块1010用于将含有目标区域框标注信息的样本图像输入用于目标区域框检测的深度卷积神经网络,检测获取多个融合特征图,深度卷积神经网络包括第一子网和第二子网,第一子网具有至少一个下采样层,第二子网具有至少一个上采样层;融合特征图通过第一特征图和第二特征图得到,第一特征图从第一子网获取得到,第二特征图从第二子网获取得到。
目标区域框检测模块1020用于根据多个融合特征图获取样本图像的目标区域框数据。
第一差异获取模块1030用于根据获取到的样本图像的目标区域框数据以及目标区域框标注信息确定物体框检测的第一差异数据。
第一网络训练模块1040用于根据第一差异数据调整深度卷积神经网络的网络参数。
本实施例的神经网络的训练装置用于实现前述方法实施例中相应的神经网络的训练方法,并具有相应的方法实施例的有益效果,在此不再赘述。
图11是示出根据本申请另一实施例的神经网络的训练装置的结构框图。
根据本实施例,在用于目标区域框检测的深度卷积神经网络中,第二子网设置在第一子网的末端,第一子网具有多个第一卷积层和至少一个下采样层,下采样层设置在多个第一卷积层之间,第二子网具有多个第二卷积层和至少一个上采样层,上采样层设置在多个第二卷积层之间,第一卷积层和第二卷积层对称 设置,至少一个下采样层和至少一个上采样层分别对称地设置。
根据一种可选的实施方式,在至少一个第一卷积层设有用于输出第一特征图的第一输出分支,在第二卷积层设有用于输出第二特征图的第二输出分支。
根据一种可选的实施方式,第二子网还具有多个第三卷积层,第三卷积层的输入包括第一输出分支和第二输出分支。相应地,融合特征图检测模块1010用于从多个第三卷积层的输出端分别获取融合特征图。
可选地,融合特征图中的至少一个点具有与多个物体探测框对应的框融合检测数据以及预测准确信息。
可选地,深度卷积神经网络还包括第三子网,第三子网具有多组第四卷积层和多个池化层,多组第四卷积层分别与第三卷积层对应,多个池化层分别与多组第四卷积层对应,并且每个池化层的输入包括经过调整的融合特征图和初选目标区域框的数据。
可选地,上述装置还包括:框回归迭代训练模块1050,用于迭代地执行以下目标区域框回归训练操作,直到迭代满足迭代终止条件为止:通过第四卷积层分别对当前的融合特征图进行卷积,获取调整融合特征图;根据当前的初选目标区域框数据,通过池化层对调整融合特征图进行区域池化,获取新的融合特征图;从新的融合特征图获取新的初选目标区域框数据;根据未经过调整的初选目标区域框数据和新的初选目标区域框数据之间的框回归数据、新的初选目标区域框数据和相应的目标区域框标注信息确定物体框检测的第二差异数据;根据第二差异数据调整第三子网的网络参数。
可选地,第三子网还具有设置在池化层输出端的第五卷积层;相应地,框回归迭代训练模块1050用于通过第五卷积层对新的融合特征图进行规范化卷积,并且从经过规范化卷积的融合特征图获取所述新的初选目标区域框数据。
可选地,上述装置还包括:预处理模块1060,用于在迭代地执行目标区域框回归训练操作之前,缩放样本图像,使得至少一个物体区域框的真值被物体探测框覆盖。
可选地,样本图像的目标区域框标注信息包括正样本区域框的标注信息和负样本区域框的标注信息;正样本区域框与物体区域框的真值的重叠率不低于第一重叠比率值,负样本区域框与物体区域框的真值的重叠率不高于第二重叠比率值,第一重叠比率值大于第二重叠比率值。
可选地,样本图像的目标区域框标注信息还包括中性样本区域框的标注信息,中性样本区域框与物体区域框的真值的重叠率在第一重叠比率值和第二重叠比率值之间。
可选地,,在全部样本图像当中,标注的正样本区域框的总和在正样本区域框、负样本区域框以及中性样本区域框的框总数中的占比不小于预定的第一比值,该第一比值大于50%;标注的负样本区域框的总和在框总数中的占比不大于预定的第二比值;标注的中性样本区域框的总和在框总数中的占比不大于预定的第三比值,第三比例不大于第一比值和第二比值之和的一半。
可选地,第一子网和第二子网均为认知―样本归一化网络结构,第三子网为残差网络结构。
本实施例的神经网络的训练装置用于实现前述方法实施例中相应的神经网络的训练方法,并具有相应的方法实施例的有益效果,在此不再赘述。
另外,本申请实施例还提供了一种电子设备,包括:处理器和存储器;
所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行本申请上述任一实施例所述的物体检测方法对应的操作;或者,
所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行本申请上述任一实施例所述的神经网络的训练方法对应的操作。
另外,本申请实施例还提供了另一种电子设备,包括:
处理器和本申请上述任一实施例所述的物体检测装置;在处理器运行所述物体检测装置时,本申请上述任一实施例所述的物体检测装置中的单元被运行;或者
处理器和本申请上述任一实施例所述的神经网络的训练装置;在处理器运行所述神经网络的训练装置时,本申请上述任一实施例所述的神经网络的训练装置中的单元被运行。
图12是示出根据本申请一个实施例的第一电子设备的结构示意图。
本申请还提供了一种电子设备,例如:可以是移动终端、个人计算机(PC)、平板电脑、服务器。下面参考图12,其示出了适于用来实现本申请实施例的终端设备或服务器的第一电子设备1200的结构示意图。
如图12所示,第一电子设备1200包括但不限于一个或多个第一处理器、第一通信元件,所述一个或 多个第一处理器例如:一个或多个第一中央处理单元(CPU)1201,和/或一个或多个第一图像处理器(GPU)1213,第一处理器可以根据存储在第一只读存储器(ROM)1202中的可执行指令或者从第一存储部分1208加载到第一随机访问存储器(RAM)1203中的可执行指令而执行各种适当的动作和处理。第一通信元件包括第一通信组件1212和第一通信接口1209。其中,第一通信组件1212可包括但不限于网卡,所述网卡可包括但不限于IB(Infiniband)网卡,第一通信接口1209包括诸如LAN卡、调制解调器的网络接口卡的通信接口,第一通信接口1209经由诸如因特网的网络执行通信处理。
第一处理器可与第一只读存储器1202和/或第一随机访问存储器1230中通信以执行可执行指令,通过第一总线1204与第一通信组件1212相连、并经第一通信组件1212与其他目标设备通信,从而完成本申请实施例提供的任一项方法对应的操作,例如,通过用于目标区域框检测的深度卷积神经网络,从待处理的图像预测获取多个融合特征图;其中,所述深度卷积神经网络包括第一子网和第二子网,所述第一子网具有至少一个下采样层,所述第二子网具有至少一个上采样层;所述融合特征图通过第一特征图和第二特征图得到,所述第一特征图从第一子网获取得到,所述第二特征图从第二子网获取得到;根据所述多个融合特征图获取目标区域框数据。
此外,在第一RAM 1203中,还可存储有装置操作所需的各种程序和数据。第一CPU1201、第一ROM1202以及第一RAM1203通过第一总线1204彼此相连。在有第一RAM1203的情况下,第一ROM1202为可选模块。第一RAM1203存储可执行指令,或在运行时向第一ROM1202中写入可执行指令,可执行指令使第一处理器1201执行上述通信方法对应的操作。第一输入/输出(I/O)接口1205也连接至第一总线1204。第一通信组件1212可以集成设置,也可以设置为具有多个子模块(例如多个IB网卡),并在总线链接上。
以下部件连接至第一I/O接口1205:包括键盘、鼠标的第一输入部分1206;包括但不限于诸如阴极射线管(CRT)、液晶显示器(LCD)以及扬声器的第一输出部分1207;包括但不限于硬盘的第一存储部分1208;以及包括但不限于诸如LAN卡、调制解调器的网络接口卡的第一通信接口1209。第一驱动器1210也根据需要连接至第一I/O接口1205。第一可拆卸介质1211,诸如磁盘、光盘、磁光盘、半导体存储器,根据需要安装在第一驱动器1210上,以便于从其上读出的计算机程序根据需要被安装入第一存储部分1208。
需要说明的是,如图12所示的架构仅为一种可选实现方式,在具体实践过程中,可根据实际需要对上述图12的部件数量和类型进行选择、删减、增加或替换;在不同功能部件设置上,也可采用分离设置或集成设置实现方式,例如GPU和CPU可分离设置或者可将GPU集成在CPU上,第一通信组件1212可分离设置,也可集成设置在CPU或GPU上。这些可替换的实施方式均落入本申请的保护范围。
特别地,根据本申请实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请实施例包括一种计算机程序产品,其包括有形地包含在机器可读介质上的计算机程序,计算机程序包含用于执行流程图所示的方法的程序代码,程序代码可包括对应执行本申请实施例提供的方法步骤对应的指令,例如,用于通过用于目标区域框检测的深度卷积神经网络,从待处理的图像预测获取多个融合特征图的可执行代码;其中,所述深度卷积神经网络包括第一子网和第二子网,所述第一子网具有至少一个下采样层,所述第二子网具有至少一个上采样层;所述融合特征图通过第一特征图和第二特征图得到,所述第一特征图从第一子网获取得到,所述第二特征图从第二子网获取得到;用于根据所述多个融合特征图获取目标区域框数据的可执行代码。在这样的实施例中,该计算机程序可以通过通信元件从网络上被下载和安装,和/或从第一可拆卸介质1211被安装。在该计算机程序被第一中央处理单元(CPU)1201执行时,执行本申请的方法中限定的上述功能。
本申请该实施例提供的电子设备,通过用于目标区域框检测的深度卷积神经网络,从待处理的图像预测获取多个融合特征图,其中,从具有至少一个下采样层的第一子网获取多个第一特征图,从具有至少一个上采样层的第二子网获取多个第二特征图,分别由多个第一特征图和多个第二特征图融合得到融合特征图。此后,再根据所述多个融合特征图获取目标区域框数据。由于这些融合特征图较好地表征了图像中高层的语义特征(如:布局、前背景信息)和低层的细节特征(如:小物体信息),因此根据这些融合特征图能够有效地提取到图像中包含的大小物体的目标区域框数据,从而提高物体检测的准确性和鲁棒性。
图13是示出根据本申请另一实施例的第二电子设备的结构示意图。
本申请还提供了一种电子设备,例如可以是移动终端、个人计算机(PC)、平板电脑、服务器。下面参考图13,其示出了适于用来实现本申请实施例的终端设备或服务器的第二电子设备1300的结构示意图。
如图13所示,第二电子设备1300包括但不限于一个或多个第二处理器、第二通信元件,所述一个或多个第二处理器例如:一个或多个第二中央处理单元(CPU)1301,和/或一个或多个第二图像处理器(GPU)1313,第二处理器可以根据存储在第二只读存储器(ROM)1302中的可执行指令或者从第二存储部分1308加载到第二随机访问存储器(RAM)1303中的可执行指令而执行各种适当的动作和处理。第二通信元件包括第二通信组件1312和第二通信接口1309。其中,第二通信组件1312可包括但不限于网卡,所述网卡可包括但不限于IB(Infiniband)网卡,第二通信接口1309包括诸如LAN卡、调制解调器的网络接口卡的通信接口,第二通信接口1309经由诸如因特网的网络执行通信处理。
第二处理器可与第二只读存储器1302和/或第二随机访问存储器1330中通信以执行可执行指令,通过第二总线1304与第二通信组件1312相连、并经第二通信组件1312与其他目标设备通信,从而完成本申请实施例提供的任一项方法对应的操作,例如,将含有目标区域框标注信息的样本图像输入用于目标区域框检测的深度卷积神经网络,检测获取多个融合特征图,所述深度卷积神经网络包括第一子网和第二子网,所述第一子网具有至少一个下采样层,所述第二子网具有至少一个上采样层;所述融合特征图通过第一特征图和第二特征图得到,所述第一特征图从第一子网获取得到,所述第二特征图从第二子网获取得到;根据所述多个融合特征图获取所述样本图像的目标区域框数据;根据获取到的所述样本图像的目标区域框数据以及所述目标区域框标注信息确定物体框检测的第一差异数据;根据所述第一差异数据调整所述深度卷积神经网络的网络参数。
此外,在第二RAM 1303中,还可存储有装置操作所需的各种程序和数据。第二CPU1301、第二ROM1302以及第二RAM1303通过第二总线1304彼此相连。在有第二RAM1303的情况下,第二ROM1302为可选模块。第二RAM1303存储可执行指令,或在运行时向第二ROM1302中写入可执行指令,可执行指令使第二处理器1301执行上述通信方法对应的操作。第二输入/输出(I/O)接口1305也连接至第二总线1304。第二通信组件1312可以集成设置,也可以设置为具有多个子模块(例如多个IB网卡),并在总线链接上。
以下部件连接至第二I/O接口1305:包括但不限于键盘、鼠标的第二输入部分1306;包括但不限于诸如阴极射线管(CRT)、液晶显示器(LCD)以及扬声器的第二输出部分1307;包括但不限于硬盘的第二存储部分1308;以及包括诸如LAN卡、调制解调器的网络接口卡的第二通信接口1309。第二驱动器1310也根据需要连接至第二I/O接口1305。第二可拆卸介质1311,诸如磁盘、光盘、磁光盘、半导体存储器,根据需要安装在第二驱动器1310上,以便于从其上读出的计算机程序根据需要被安装入第二存储部分1308。
需要说明的是,如图13所示的架构仅为一种可选实现方式,在具体实践过程中,可根据实际需要对上述图13的部件数量和类型进行选择、删减、增加或替换;在不同功能部件设置上,也可采用分离设置或集成设置实现方式,例如GPU和CPU可分离设置或者可将GPU集成在CPU上,第二通信组件1312可分离设置,也可集成设置在CPU或GPU上。这些可替换的实施方式均落入本申请的保护范围。
特别地,根据本申请实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请实施例包括一种计算机程序产品,其包括有形地包含在机器可读介质上的计算机程序,计算机程序包含用于执行流程图所示的方法的程序代码,程序代码可包括对应执行本申请实施例提供的方法步骤对应的指令,例如,用于将含有目标区域框标注信息的样本图像输入用于目标区域框检测的深度卷积神经网络,检测获取多个融合特征图的可执行代码,所述深度卷积神经网络包括第一子网和第二子网,所述第一子网具有至少一个下采样层,所述第二子网具有至少一个上采样层;所述融合特征图通过第一特征图和第二特征图得到,所述第一特征图从第一子网获取得到,所述第二特征图从第二子网获取得到;用于根据所述多个融合特征图获取所述样本图像的目标区域框数据的可执行代码;用于根据获取到的所述样本图像的目标区域框数据以及所述目标区域框标注信息确定物体框检测的第一差异数据的可执行代码;用于根据所述第一差异数据调整所述深度卷积神经网络的网络参数的可执行代码。在这样的实施例中,该计算机程序可以通过通信元件从网络上被下载和安装,和/或从第二可拆卸介质1311被安装。在该计算机程序被第二中央处理单元(CPU)1301执行时,执行本申请实施例的方法中限定的上述功能。
本申请该实施例提供的电子设备,将含有目标区域框标注信息的样本图像输入用于目标区域框检测的深度卷积神经网络,检测获取多个融合特征图;其中,从具有至少一个下采样层的第一子网检测获取多个第一特征图,从具有至少一个上采样层的第二子网检测获取多个第二特征图,分别由多个第一特征图和多个第二特征图融合得到融合特征图,再根据所述多个融合特征图获取目标区域框数据。此后,根据获取到 的目标区域框数据以及所述目标区域框标注信息确定第一差异数据,再根据所述第一差异数据调整所述深度卷积神经网络的网络参数。由于从训练获得的深度卷积神经网络的这些融合特征图较好地表征了图像中高层的语义特征(如:布局、前背景信息)和低层的细节特征(如:小物体信息),因此根据这些融合特征图能够有效地提取到图像中包含的大小物体的目标区域框数据。训练获得的深度卷积神经网络能够提高物体检测的准确性和鲁棒性。
另外,本申请实施例还提供了一种计算机程序,包括计算机可读代码,当所述计算机可读代码在设备上运行时,所述设备中的处理器执行用于实现本申请任一实施例所述的物体检测方法中各步骤的指令;或者
当所述计算机可读代码在设备上运行时,所述设备中的处理器执行用于实现本申请任一实施例所述的神经网络的训练方法中各步骤的指令。
另外,本申请实施例还提供了一种计算机可读存储介质,用于存储计算机可读取的指令,所述指令被执行时实现本申请任一实施例所述的物体检测方法中各步骤的操作、或者本申请任一实施例所述的神经网络的训练方法中各步骤的操作。本说明书中各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似的部分相互参见即可。对于系统实施例而言,由于其与方法实施例基本对应,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
可能以许多方式来实现本申请的方法和装置。例如,可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本申请的方法和装置。用于所述方法的步骤的上述顺序仅是为了进行说明,本申请的方法的步骤不限于以上具体描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本申请实施为记录在记录介质中的程序,这些程序包括用于实现根据本申请的方法的机器可读指令。因而,本申请还覆盖存储用于执行根据本申请的方法的程序的记录介质。
本申请的描述是为了示例和描述起见而给出的,而并不是无遗漏的或者将本申请限于所公开的形式。很多修改和变化对于本领域的普通技术人员而言是显然的。选择和描述实施例是为了更好说明本申请的原理和实际应用,并且使本领域的普通技术人员能够理解本发明从而设计适于特定用途的带有各种修改的各种实施例。

Claims (50)

  1. 一种物体检测方法,包括:
    通过用于目标区域框检测的深度卷积神经网络,从待处理的图像预测获取多个融合特征图;其中,所述深度卷积神经网络包括第一子网和第二子网,所述第一子网具有至少一个下采样层,所述第二子网具有至少一个上采样层;所述融合特征图通过第一特征图和第二特征图得到,所述第一特征图从第一子网获取得到,所述第二特征图从第二子网获取得到;
    根据所述多个融合特征图获取目标区域框数据。
  2. 根据权利要求1所述的方法,其中,所述第二子网设置在所述第一子网的末端,所述第一子网具有多个第一卷积层和所述至少一个下采样层,所述下采样层设置在所述多个第一卷积层之间,所述第二子网具有多个第二卷积层和所述至少一个上采样层,所述上采样层设置在所述多个第二卷积层之间,所述第一卷积层和所述第二卷积层对称设置,所述至少一个下采样层和所述至少一个上采样层分别对称地设置。
  3. 根据权利要求2所述的方法,其中,在至少一个所述第一卷积层设有用于输出所述第一特征图的第一输出分支,在第二卷积层设有用于输出所述第二特征图的第二输出分支。
  4. 根据权利要求3所述的方法,其中,所述第二子网还具有多个第三卷积层,所述第三卷积层的输入包括所述第一输出分支和所述第二输出分支;
    所述预测获取多个融合特征图包括:
    从所述多个第三卷积层的输出端分别获取所述融合特征图。
  5. 根据权利要求1-4中任一项所述的方法,其中,所述融合特征图中的至少一个点具有与多个物体探测框对应的框融合检测数据以及预测准确信息,
    所述根据所述多个融合特征图获取目标区域框数据包括:
    根据至少一个所述融合特征图中的框融合检测数据以及预测准确信息分别获取与所述融合特征图各自对应的目标区域框数据。
  6. 根据权利要求1-5中任一项所述的方法,其中,所述根据所述多个融合特征图获取目标区域框数据包括:
    分别获取所述融合特征图各自对应的初选目标区域框数据;
    迭代地执行以下物体区域框回归操作,直到所述迭代满足迭代终止条件为止:通过调整所述融合特征图,从经过调整的融合特征图获取新的初选目标区域框数据;
    将经过所述迭代得到的所述初选目标区域框数据作为所述待处理的图像中的目标区域框数据。
  7. 根据权利要求6所述的方法,其中,所述深度卷积神经网络还包括第三子网,所述第三子网具有多组第四卷积层和多个池化层,所述多组第四卷积层分别与所述第三卷积层对应,所述多个池化层分别与所述多组第四卷积层对应,并且每个所述池化层的输入包括所述经过调整的融合特征图和所述初选目标区域框的数据。
  8. 根据权利要求7所述的方法,其中,所述物体区域框回归操作包括:
    通过所述第四卷积层分别对当前的所述融合特征图进行卷积,获取调整融合特征图;
    根据当前的初选目标区域框数据,通过所述池化层对所述调整融合特征图进行区域池化,获取新的融合特征图;
    从所述新的融合特征图获取所述新的初选目标区域框数据。
  9. 根据权利要求8所述的方法,其中,所述第三子网还具有设置在所述池化层输出端的第五卷积层,所述从所述新的融合特征图获取所述新的初选目标区域框数据包括:
    通过所述第五卷积层对所述新的融合特征图进行规范化卷积,
    从经过规范化卷积的融合特征图获取所述新的初选目标区域框数据。
  10. 根据权利要求7-9中任一项所述的方法,其中,所述第一子网和所述第二子网均为认知―样本归一化(Inception-BN)网络结构,所述第三子网为残差网络(ResNet)结构。
  11. 一种神经网络的训练方法,包括:
    将含有目标区域框标注信息的样本图像输入用于目标区域框检测的深度卷积神经网络,检测获取多个 融合特征图,所述深度卷积神经网络包括第一子网和第二子网,所述第一子网具有至少一个下采样层,所述第二子网具有至少一个上采样层;所述融合特征图通过第一特征图和第二特征图得到,所述第一特征图从第一子网获取得到,所述第二特征图从第二子网获取得到;
    根据所述多个融合特征图获取所述样本图像的目标区域框数据;
    根据获取到的所述样本图像的目标区域框数据以及所述目标区域框标注信息确定物体框检测的第一差异数据;
    根据所述第一差异数据调整所述深度卷积神经网络的网络参数。
  12. 根据权利要求11所述的方法,其中,所述第二子网设置在所述第一子网的末端,所述第一子网具有多个第一卷积层和所述至少一个下采样层,所述下采样层设置在所述多个第一卷积层之间,所述第二子网具有多个第二卷积层和所述至少一个上采样层,所述上采样层设置在所述多个第二卷积层之间,所述第一卷积层和所述第二卷积层对称设置,所述至少一个下采样层和所述至少一个上采样层分别对称地设置。
  13. 根据权利要求12所述的方法,其中,在至少一个所述第一卷积层设有用于输出所述第一特征图的第一输出分支,在第二卷积层设有用于输出所述第二特征图的第二输出分支。
  14. 根据权利要求13所述的方法,其中,所述第二子网还具有多个第三卷积层,所述第三卷积层的输入包括所述第一输出分支和所述第二输出分支;
    所述检测获取多个融合特征图包括:
    从所述多个第三卷积层的输出端分别获取所述融合特征图。
  15. 根据权利要求11-14中任一项所述的方法,其中,所述融合特征图中的至少一个点具有与多个物体探测框对应的框融合检测数据以及预测准确信息。
  16. 根据权利要求11-15中任一项所述的方法,其中,所述深度卷积神经网络还包括第三子网,所述第三子网具有多组第四卷积层和多个池化层,所述多组第四卷积层分别与所述第三卷积层对应,所述多个池化层分别与所述多组第四卷积层对应,并且每个所述池化层的输入包括所述经过调整的融合特征图和所述初选目标区域框的数据。
  17. 根据权利要求16所述的方法,其中,所述方法还包括:
    迭代地执行以下目标区域框回归训练操作,直到所述迭代满足迭代终止条件为止:
    通过所述第四卷积层分别对当前的所述融合特征图进行卷积,获取调整融合特征图;
    根据当前的初选目标区域框数据,通过所述池化层对所述调整融合特征图进行区域池化,获取新的融合特征图;
    从所述新的融合特征图获取所述新的初选目标区域框数据;
    根据未经过调整的初选目标区域框数据和新的初选目标区域框数据之间的框回归数据、新的初选目标区域框数据和相应的目标区域框标注信息确定物体框检测的第二差异数据;
    根据所述第二差异数据调整所述第三子网的网络参数。
  18. 根据权利要求17所述的方法,其中,所述第三子网还具有设置在所述池化层输出端的第五卷积层,
    所述从所述新的融合特征图获取所述新的初选目标区域框数据包括:
    通过所述第五卷积层对所述新的融合特征图进行规范化卷积,
    从经过规范化卷积的融合特征图获取所述新的初选目标区域框数据。
  19. 根据权利要求11-18中任一项所述的方法,其中,在将含有目标区域框标注信息的样本图像输入用于目标区域框检测的深度卷积神经网络,检测获取多个融合特征图之前,所述方法还包括:
    缩放所述样本图像,使得至少一个物体区域框的真值被物体探测框覆盖。
  20. 根据权利要求16-19中任一项所述的方法,其中,所述样本图像的目标区域框标注信息包括正样本区域框的标注信息和负样本区域框的标注信息;
    所述正样本区域框与物体区域框的真值的重叠率不低于第一重叠比率值,所述负样本区域框与物体区域框的真值的重叠率不高于第二重叠比率值,所述第一重叠比率值大于所述第二重叠比率值。
  21. 根据权利要求20所述的方法,其中,所述样本图像的目标区域框标注信息还包括中性样本区域框的标注信息,所述中性样本区域框与物体区域框的真值的重叠率在所述第一重叠比率值和所述第二重叠比率值之间。
  22. 根据权利要求21所述的方法,其中,在全部所述样本图像当中,
    标注的正样本区域框的总和在所述正样本区域框、负样本区域框以及中性样本区域框的框总数中的占比不小于预定的第一比值,所述第一比值大于50%;
    标注的负样本区域框的总和在框总数中的占比不大于预定的第二比值;
    标注的中性样本区域框的总和在框总数中的占比不大于预定的第三比值,所述第三比例不大于第一比值和第二比值之和的一半。
  23. 根据权利要求16-22中任一项所述的方法,其中,所述第一子网和所述第二子网均为认知―样本归一化网络结构,所述第三子网为残差网络结构。
  24. 一种物体检测装置,包括:
    融合特征图预测模块,用于通过用于目标区域框检测的深度卷积神经网络,从待处理的图像预测获取多个融合特征图;其中,所述深度卷积神经网络包括第一子网和第二子网,所述第一子网具有至少一个下采样层,所述第二子网具有至少一个上采样层;所述融合特征图通过第一特征图和第二特征图得到,所述第一特征图从第一子网获取得到,所述第二特征图从第二子网获取得到;
    目标区域框预测模块,用于根据所述融合特征图预测模块获取的多个融合特征图获取目标区域框数据。
  25. 根据权利要求24所述的装置,其中,所述第二子网设置在所述第一子网的末端,所述第一子网具有多个第一卷积层和所述至少一个下采样层,所述下采样层设置在所述多个第一卷积层之间,所述第二子网具有多个第二卷积层和所述至少一个上采样层,所述上采样层设置在所述多个第二卷积层之间,所述第一卷积层和所述第二卷积层对称设置,所述至少一个下采样层和所述至少一个上采样层分别对称地设置。
  26. 根据权利要求25所述的装置,其中,在至少一个所述第一卷积层设有用于输出所述第一特征图的第一输出分支,在第二卷积层设有用于输出所述第二特征图的第二输出分支。
  27. 根据权利要求26所述的装置,其中,所述第二子网还具有多个第三卷积层,所述第三卷积层的输入包括所述第一输出分支和所述第二输出分支;
    所述融合特征图预测模块用于从所述多个第三卷积层的输出端分别获取所述融合特征图。
  28. 根据权利要求24-27中任一项所述的装置,其中,所述融合特征图中的至少一个点具有与多个物体探测框对应的框融合检测数据以及预测准确信息,
    所述目标区域框预测模块用于根据至少一个所述融合特征图中的框融合检测数据以及预测准确信息分别获取与所述融合特征图各自对应的目标区域框数据。
  29. 根据权利要求24-28中任一项所述的装置,其中,所述目标区域框预测模块用于:
    分别获取所述融合特征图各自对应的初选目标区域框数据;
    迭代地执行以下物体区域框回归操作,直到所述迭代满足迭代终止条件为止:通过调整所述融合特征图,从经过调整的融合特征图获取新的初选目标区域框数据;
    将经过所述迭代得到的所述初选目标区域框数据作为所述待处理的图像中的目标区域框数据。
  30. 根据权利要求29所述的装置,其中,所述深度卷积神经网络还包括第三子网,所述第三子网具有多组第四卷积层和多个池化层,所述多组第四卷积层分别与所述第三卷积层对应,所述多个池化层分别与所述多组第四卷积层对应,并且每个所述池化层的输入包括所述经过调整的融合特征图和所述初选目标区域框的数据。
  31. 根据权利要求30所述的装置,其中,所述目标区域框预测模块包括:
    框调整单元,用于通过所述第四卷积层分别对当前的所述融合特征图进行卷积,获取调整融合特征图;
    区域池化单元,用于根据当前的初选目标区域框数据,通过所述池化层对所述调整融合特征图进行区域池化,获取新的融合特征图;
    初选框获取单元,用于从所述新的融合特征图获取所述新的初选目标区域框数据。
  32. 根据权利要求31所述的装置,其中,所述第三子网还具有设置在所述池化层输出端的第五卷积层,
    所述初选框获取单元用于通过所述第五卷积层对所述新的融合特征图进行规范化卷积,并且从经过规范化卷积的融合特征图获取所述新的初选目标区域框数据。
  33. 根据权利要求30-32中任一项所述的装置,其中,所述第一子网和所述第二子网均为认知―样本归一化(Inception-BN)网络结构,所述第三子网为残差网络(ResNet)结构。
  34. 一种神经网络的训练装置,包括:
    融合特征图检测模块,用于将含有目标区域框标注信息的样本图像输入用于目标区域框检测的深度卷积神经网络,检测获取多个融合特征图,所述深度卷积神经网络包括第一子网和第二子网,所述第一子网具有至少一个下采样层,所述第二子网具有至少一个上采样层;所述融合特征图通过第一特征图和第二特征图得到,所述第一特征图从第一子网获取得到,所述第二特征图从第二子网获取得到;
    目标区域框检测模块,用于根据所述多个融合特征图获取所述样本图像的目标区域框数据;
    第一差异获取模块,用于根据获取到的所述样本图像的目标区域框数据以及所述目标区域框标注信息确定物体框检测的第一差异数据;
    第一网络训练模块,用于根据所述第一差异数据调整所述深度卷积神经网络的网络参数。
  35. 根据权利要求34所述的装置,其中,所述第二子网设置在所述第一子网的末端,所述第一子网具有多个第一卷积层和所述至少一个下采样层,所述下采样层设置在所述多个第一卷积层之间,所述第二子网具有多个第二卷积层和所述至少一个上采样层,所述上采样层设置在所述多个第二卷积层之间,所述第一卷积层和所述第二卷积层对称设置,所述至少一个下采样层和所述至少一个上采样层分别对称地设置。
  36. 根据权利要求35所述的装置,其中,在至少一个所述第一卷积层设有用于输出所述第一特征图的第一输出分支,在第二卷积层设有用于输出所述第二特征图的第二输出分支。
  37. 根据权利要求36所述的装置,其中,所述第二子网还具有多个第三卷积层,所述第三卷积层的输入包括所述第一输出分支和所述第二输出分支;
    所述融合特征图检测模块用于从所述多个第三卷积层的输出端分别获取所述融合特征图。
  38. 根据权利要求34-37中任一项所述的装置,其中,所述融合特征图中的至少一个点具有与多个物体探测框对应的框融合检测数据以及预测准确信息。
  39. 根据权利要求34-38中任一项所述的装置,其中,所述深度卷积神经网络还包括第三子网,所述第三子网具有多组第四卷积层和多个池化层,所述多组第四卷积层分别与所述第三卷积层对应,所述多个池化层分别与所述多组第四卷积层对应,并且每个所述池化层的输入包括所述经过调整的融合特征图和所述初选目标区域框的数据。
  40. 根据权利要求39所述的装置,其中,所述装置还包括:
    框回归迭代训练模块,用于迭代地执行以下目标区域框回归训练操作,直到所述迭代满足迭代终止条件为止:
    通过所述第四卷积层分别对当前的所述融合特征图进行卷积,获取调整融合特征图;
    根据当前的初选目标区域框数据,通过所述池化层对所述调整融合特征图进行区域池化,获取新的融合特征图;
    从所述新的融合特征图获取所述新的初选目标区域框数据;
    根据未经过调整的初选目标区域框数据和新的初选目标区域框数据之间的框回归数据、新的初选目标区域框数据和相应的目标区域框标注信息确定物体框检测的第二差异数据;
    根据所述第二差异数据调整所述第三子网的网络参数。
  41. 根据权利要求40所述的装置,其中,所述第三子网还具有设置在所述池化层输出端的第五卷积层,
    所述框回归迭代训练模块用于通过所述第五卷积层对所述新的融合特征图进行规范化卷积,并且从经过规范化卷积的融合特征图获取所述新的初选目标区域框数据。
  42. 根据权利要求39-41中任一项所述的装置,其中,所述装置还包括:
    预处理模块,用于在将含有目标区域框标注信息的样本图像输入用于目标区域框检测的深度卷积神经网络,检测获取多个融合特征图之前,缩放所述样本图像,使得至少一个物体区域框的真值被物体探测框覆盖。
  43. 根据权利要求39-42中任一项所述的装置,其中,所述样本图像的目标区域框标注信息包括正样本区域框的标注信息和负样本区域框的标注信息;
    所述正样本区域框与物体区域框的真值的重叠率不低于第一重叠比率值,所述负样本区域框与物体区域框的真值的重叠率不高于第二重叠比率值,所述第一重叠比率值大于所述第二重叠比率值。
  44. 根据权利要求43所述的装置,其中,所述样本图像的目标区域框标注信息还包括中性样本区域框 的标注信息,所述中性样本区域框与物体区域框的真值的重叠率在所述第一重叠比率值和所述第二重叠比率值之间。
  45. 根据权利要求44所述的装置,其中,在全部所述样本图像当中,
    标注的正样本区域框的总和在所述正样本区域框、负样本区域框以及中性样本区域框的框总数中的占比不小于预定的第一比值,所述第一比值大于50%;
    标注的负样本区域框的总和在框总数中的占比不大于预定的第二比值;
    标注的中性样本区域框的总和在框总数中的占比不大于预定的第三比值,所述第三比例不大于第一比值和第二比值之和的一半。
  46. 根据权利要求39-45中任一项所述的装置,其中,所述第一子网和所述第二子网均为认知―样本归一化网络结构,所述第三子网为残差网络结构。
  47. 一种电子设备,包括:处理器和存储器;
    所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如权利要求1-10任一项所述的物体检测方法对应的操作;或者,所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如权利要求11-23任一项所述的神经网络的训练方法对应的操作。
  48. 一种电子设备,包括:
    处理器和权利要求24-33任一项所述的物体检测装置;在处理器运行所述物体检测装置时,权利要求24-33任一项所述的物体检测装置中的单元被运行;或者
    处理器和权利要求34-46任一项所述的神经网络的训练装置;在处理器运行所述神经网络的训练装置时,权利要求34-46任一项所述的神经网络的训练装置中的单元被运行。
  49. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在设备上运行时,所述设备中的处理器执行用于实现如权利要求1-10任一项所述的物体检测方法中各步骤的指令;或者
    当所述计算机可读代码在设备上运行时,所述设备中的处理器执行用于实现如权利要求11-23任一项所述的神经网络的训练方法中各步骤的指令。
  50. 一种计算机可读存储介质,用于存储计算机可读取的指令,其特征在于,所述指令被执行时实现如权利要求1-10任一项所述的物体检测方法中各步骤的操作、或者如权利要求11-23任一项所述的神经网络的训练方法中各步骤的操作。
PCT/CN2018/076653 2017-02-23 2018-02-13 物体检测方法、神经网络的训练方法、装置和电子设备 WO2018153319A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
SG11201907355XA SG11201907355XA (en) 2017-02-23 2018-02-13 Method and apparatus for detecting object, method and apparatus for training neural network, and electronic device
JP2019545345A JP6902611B2 (ja) 2017-02-23 2018-02-13 物体検出方法、ニューラルネットワークの訓練方法、装置および電子機器
US16/314,406 US11321593B2 (en) 2017-02-23 2018-02-13 Method and apparatus for detecting object, method and apparatus for training neural network, and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710100676.1 2017-02-23
CN201710100676.1A CN108229455B (zh) 2017-02-23 2017-02-23 物体检测方法、神经网络的训练方法、装置和电子设备

Publications (1)

Publication Number Publication Date
WO2018153319A1 true WO2018153319A1 (zh) 2018-08-30

Family

ID=62657296

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/076653 WO2018153319A1 (zh) 2017-02-23 2018-02-13 物体检测方法、神经网络的训练方法、装置和电子设备

Country Status (5)

Country Link
US (1) US11321593B2 (zh)
JP (1) JP6902611B2 (zh)
CN (1) CN108229455B (zh)
SG (1) SG11201907355XA (zh)
WO (1) WO2018153319A1 (zh)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110097108A (zh) * 2019-04-24 2019-08-06 佳都新太科技股份有限公司 非机动车的识别方法、装置、设备及存储介质
CN110503063A (zh) * 2019-08-28 2019-11-26 东北大学秦皇岛分校 基于沙漏卷积自动编码神经网络的跌倒检测方法
CN111080528A (zh) * 2019-12-20 2020-04-28 北京金山云网络技术有限公司 图像超分辨率和模型训练方法、装置、电子设备及介质
CN111079620A (zh) * 2019-12-10 2020-04-28 北京小蝇科技有限责任公司 基于迁移学习的白细胞图像检测识别模型构建方法及应用
CN111091089A (zh) * 2019-12-12 2020-05-01 新华三大数据技术有限公司 一种人脸图像处理方法、装置、电子设备及存储介质
CN111126421A (zh) * 2018-10-31 2020-05-08 浙江宇视科技有限公司 目标检测方法、装置及可读存储介质
JP2020077393A (ja) * 2018-10-04 2020-05-21 株式会社ストラドビジョン 自動車の車線変更に対する危険を警報する方法及びこれを利用した警報装置
CN111339884A (zh) * 2020-02-19 2020-06-26 浙江大华技术股份有限公司 图像识别方法以及相关设备、装置
CN111353597A (zh) * 2018-12-24 2020-06-30 杭州海康威视数字技术股份有限公司 一种目标检测神经网络训练方法和装置
CN111401396A (zh) * 2019-01-03 2020-07-10 阿里巴巴集团控股有限公司 图像识别方法及装置
CN111881744A (zh) * 2020-06-23 2020-11-03 安徽清新互联信息科技有限公司 一种基于空间位置信息的人脸特征点定位方法及系统
CN112001211A (zh) * 2019-05-27 2020-11-27 商汤集团有限公司 对象检测方法、装置、设备及计算机可读存储介质
CN112101345A (zh) * 2020-08-26 2020-12-18 贵州优特云科技有限公司 一种水表读数识别的方法以及相关装置
CN112686329A (zh) * 2021-01-06 2021-04-20 西安邮电大学 基于双核卷积特征提取的电子喉镜图像分类方法
CN113191235A (zh) * 2021-04-22 2021-07-30 上海东普信息科技有限公司 杂物检测方法、装置、设备及存储介质
CN113496150A (zh) * 2020-03-20 2021-10-12 长沙智能驾驶研究院有限公司 密集目标检测方法、装置、存储介质及计算机设备
US20210365717A1 (en) * 2019-04-22 2021-11-25 Tencent Technology (Shenzhen) Company Limited Method and apparatus for segmenting a medical image, and storage medium
JP2022516398A (ja) * 2019-11-27 2022-02-28 深▲セン▼市商▲湯▼科技有限公司 画像処理方法及び画像処理装置、プロセッサ、電子機器並びに記憶媒体

Families Citing this family (83)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108701210B (zh) * 2016-02-02 2021-08-17 北京市商汤科技开发有限公司 用于cnn网络适配和对象在线追踪的方法和系统
US10496895B2 (en) * 2017-03-28 2019-12-03 Facebook, Inc. Generating refined object proposals using deep-learning models
CN108230294B (zh) * 2017-06-14 2020-09-29 北京市商汤科技开发有限公司 图像检测方法、装置、电子设备和存储介质
US10692243B2 (en) * 2017-12-03 2020-06-23 Facebook, Inc. Optimizations for dynamic object instance detection, segmentation, and structure mapping
CN108985206B (zh) * 2018-07-04 2020-07-28 百度在线网络技术(北京)有限公司 模型训练方法、人体识别方法、装置、设备及存储介质
CN108986891A (zh) * 2018-07-24 2018-12-11 北京市商汤科技开发有限公司 医疗影像处理方法及装置、电子设备及存储介质
CN110163197B (zh) * 2018-08-24 2023-03-10 腾讯科技(深圳)有限公司 目标检测方法、装置、计算机可读存储介质及计算机设备
CN109360633B (zh) * 2018-09-04 2022-08-30 北京市商汤科技开发有限公司 医疗影像处理方法及装置、处理设备及存储介质
CN109376767B (zh) * 2018-09-20 2021-07-13 中国科学技术大学 基于深度学习的视网膜oct图像分类方法
WO2020062191A1 (zh) * 2018-09-29 2020-04-02 华为技术有限公司 图像处理方法、装置及设备
CN109461177B (zh) * 2018-09-29 2021-12-10 浙江科技学院 一种基于神经网络的单目图像深度预测方法
CN109410240A (zh) * 2018-10-09 2019-03-01 电子科技大学中山学院 一种量体特征点定位方法、装置及其存储介质
CN109522966B (zh) * 2018-11-28 2022-09-27 中山大学 一种基于密集连接卷积神经网络的目标检测方法
CN111260548B (zh) * 2018-11-30 2023-07-21 浙江宇视科技有限公司 基于深度学习的贴图方法及装置
CN109543662B (zh) * 2018-12-28 2023-04-21 广州海昇计算机科技有限公司 基于区域提议的目标检测方法、系统、装置和存储介质
CN109800793B (zh) * 2018-12-28 2023-12-22 广州海昇教育科技有限责任公司 一种基于深度学习的目标检测方法和系统
CN111382647B (zh) * 2018-12-29 2021-07-30 广州市百果园信息技术有限公司 一种图片处理方法、装置、设备及存储介质
CN111445020B (zh) * 2019-01-16 2023-05-23 阿里巴巴集团控股有限公司 一种基于图的卷积网络训练方法、装置及系统
US10402977B1 (en) * 2019-01-25 2019-09-03 StradVision, Inc. Learning method and learning device for improving segmentation performance in road obstacle detection required to satisfy level 4 and level 5 of autonomous vehicles using laplacian pyramid network and testing method and testing device using the same
US10410352B1 (en) * 2019-01-25 2019-09-10 StradVision, Inc. Learning method and learning device for improving segmentation performance to be used for detecting events including pedestrian event, vehicle event, falling event and fallen event using edge loss and test method and test device using the same
US10824947B2 (en) * 2019-01-31 2020-11-03 StradVision, Inc. Learning method for supporting safer autonomous driving without danger of accident by estimating motions of surrounding objects through fusion of information from multiple sources, learning device, testing method and testing device using the same
CN109902634A (zh) * 2019-03-04 2019-06-18 上海七牛信息技术有限公司 一种基于神经网络的视频分类方法以及系统
CN111666960B (zh) * 2019-03-06 2024-01-19 南京地平线机器人技术有限公司 图像识别方法、装置、电子设备及可读存储介质
CN110111299A (zh) * 2019-03-18 2019-08-09 国网浙江省电力有限公司信息通信分公司 锈斑识别方法及装置
CN109978863B (zh) * 2019-03-27 2021-10-08 北京青燕祥云科技有限公司 基于x射线图像的目标检测方法及计算机设备
CN110210474B (zh) 2019-04-30 2021-06-01 北京市商汤科技开发有限公司 目标检测方法及装置、设备及存储介质
CN110084309B (zh) * 2019-04-30 2022-06-21 北京市商汤科技开发有限公司 特征图放大方法、装置和设备及计算机可读存储介质
CN110148157B (zh) * 2019-05-10 2021-02-02 腾讯科技(深圳)有限公司 画面目标跟踪方法、装置、存储介质及电子设备
JP7350515B2 (ja) * 2019-05-22 2023-09-26 キヤノン株式会社 情報処理装置、情報処理方法およびプログラム
CN110163864B (zh) * 2019-05-28 2020-12-04 北京迈格威科技有限公司 图像分割方法、装置、计算机设备和存储介质
CN110288082B (zh) * 2019-06-05 2022-04-05 北京字节跳动网络技术有限公司 卷积神经网络模型训练方法、装置和计算机可读存储介质
CN110263797B (zh) * 2019-06-21 2022-07-12 北京字节跳动网络技术有限公司 骨架的关键点估计方法、装置、设备及可读存储介质
CN110378398B (zh) * 2019-06-27 2023-08-25 东南大学 一种基于多尺度特征图跳跃融合的深度学习网络改进方法
CN112241665A (zh) * 2019-07-18 2021-01-19 顺丰科技有限公司 一种暴力分拣识别方法、装置、设备及存储介质
CN110826403B (zh) * 2019-09-27 2020-11-24 深圳云天励飞技术有限公司 跟踪目标确定方法及相关设备
CN110705479A (zh) * 2019-09-30 2020-01-17 北京猎户星空科技有限公司 模型训练方法和目标识别方法、装置、设备及介质
KR102287947B1 (ko) * 2019-10-28 2021-08-09 삼성전자주식회사 영상의 ai 부호화 및 ai 복호화 방법, 및 장치
CN110826457B (zh) * 2019-10-31 2022-08-19 上海融军科技有限公司 一种复杂场景下的车辆检测方法及装置
CN110852325B (zh) * 2019-10-31 2023-03-31 上海商汤智能科技有限公司 图像的分割方法及装置、电子设备和存储介质
CN111767934B (zh) * 2019-10-31 2023-11-03 杭州海康威视数字技术股份有限公司 一种图像识别方法、装置及电子设备
CN111767935B (zh) * 2019-10-31 2023-09-05 杭州海康威视数字技术股份有限公司 一种目标检测方法、装置及电子设备
CN110796115B (zh) * 2019-11-08 2022-12-23 厦门美图宜肤科技有限公司 图像检测方法、装置、电子设备及可读存储介质
CN111222534B (zh) * 2019-11-15 2022-10-11 重庆邮电大学 一种基于双向特征融合和更平衡l1损失的单发多框检测器优化方法
CN112825248A (zh) * 2019-11-19 2021-05-21 阿里巴巴集团控股有限公司 语音处理方法、模型训练方法、界面显示方法及设备
CN111046917B (zh) * 2019-11-20 2022-08-09 南京理工大学 基于深度神经网络的对象性增强目标检测方法
CN110956119B (zh) * 2019-11-26 2023-05-26 大连理工大学 一种图像中目标检测的方法
CN111104906A (zh) * 2019-12-19 2020-05-05 南京工程学院 一种基于yolo的输电塔鸟巢故障检测方法
CN110751134B (zh) * 2019-12-23 2020-05-12 长沙智能驾驶研究院有限公司 目标检测方法、装置、存储介质及计算机设备
CN111210417B (zh) * 2020-01-07 2023-04-07 创新奇智(北京)科技有限公司 基于卷积神经网络的布匹缺陷检测方法
CN111310633B (zh) * 2020-02-10 2023-05-05 江南大学 基于视频的并行时空注意力行人重识别方法
CN111260019B (zh) * 2020-02-18 2023-04-11 深圳鲲云信息科技有限公司 神经网络模型的数据处理方法、装置、设备及存储介质
CN111340048B (zh) * 2020-02-28 2022-02-22 深圳市商汤科技有限公司 图像处理方法及装置、电子设备和存储介质
CN111767919B (zh) * 2020-04-10 2024-02-06 福建电子口岸股份有限公司 一种多层双向特征提取与融合的目标检测方法
CN111914774A (zh) * 2020-05-07 2020-11-10 清华大学 基于稀疏卷积神经网络的3d物体检测方法及装置
CN111881912A (zh) * 2020-08-19 2020-11-03 Oppo广东移动通信有限公司 数据处理方法、装置以及电子设备
EP4113382A4 (en) 2020-09-15 2023-08-30 Samsung Electronics Co., Ltd. ELECTRONIC DEVICE, ITS CONTROL METHOD AND SYSTEM
KR20220036061A (ko) * 2020-09-15 2022-03-22 삼성전자주식회사 전자 장치, 그 제어 방법 및 전자 시스템
CN112288031A (zh) * 2020-11-18 2021-01-29 北京航空航天大学杭州创新研究院 交通信号灯检测方法、装置、电子设备和存储介质
CN112465226B (zh) * 2020-11-27 2023-01-20 上海交通大学 一种基于特征交互和图神经网络的用户行为预测方法
CN112419292B (zh) * 2020-11-30 2024-03-26 深圳云天励飞技术股份有限公司 病理图像的处理方法、装置、电子设备及存储介质
CN112446378B (zh) * 2020-11-30 2022-09-16 展讯通信(上海)有限公司 目标检测方法及装置、存储介质、终端
CN112418165B (zh) * 2020-12-07 2023-04-07 武汉工程大学 基于改进型级联神经网络的小尺寸目标检测方法与装置
CN112633352B (zh) * 2020-12-18 2023-08-29 浙江大华技术股份有限公司 一种目标检测方法、装置、电子设备及存储介质
CN112801266B (zh) * 2020-12-24 2023-10-31 武汉旷视金智科技有限公司 神经网络构建方法、装置、设备及介质
CN112989919B (zh) * 2020-12-25 2024-04-19 首都师范大学 一种从影像中提取目标对象的方法及系统
CN112766137B (zh) * 2021-01-14 2023-02-10 华南理工大学 一种基于深度学习的动态场景异物入侵检测方法
CN112784742A (zh) * 2021-01-21 2021-05-11 宠爱王国(北京)网络科技有限公司 鼻纹特征的提取方法、装置及非易失性存储介质
CN112906485B (zh) * 2021-01-25 2023-01-31 杭州易享优智能科技有限公司 基于改进的yolo模型的视障人士辅助障碍物感知方法
CN113052165A (zh) * 2021-01-28 2021-06-29 北京迈格威科技有限公司 目标检测方法、装置、电子设备及存储介质
CN112906621A (zh) * 2021-03-10 2021-06-04 北京华捷艾米科技有限公司 一种手部检测方法、装置、存储介质和设备
CN112990317B (zh) * 2021-03-18 2022-08-30 中国科学院长春光学精密机械与物理研究所 一种弱小目标检测方法
US20240161461A1 (en) * 2021-04-01 2024-05-16 Boe Technology Group Co., Ltd. Object detection method, object detection apparatus, and object detection system
CN113139543B (zh) * 2021-04-28 2023-09-01 北京百度网讯科技有限公司 目标对象检测模型的训练方法、目标对象检测方法和设备
CN113298130B (zh) * 2021-05-14 2023-05-09 嘉洋智慧安全科技(北京)股份有限公司 目标图像的检测、目标对象检测模型的生成方法
US11823490B2 (en) * 2021-06-08 2023-11-21 Adobe, Inc. Non-linear latent to latent model for multi-attribute face editing
CN113538351B (zh) * 2021-06-30 2024-01-19 国网山东省电力公司电力科学研究院 一种融合多参数电信号的外绝缘设备缺陷程度评估方法
CN113673578A (zh) * 2021-07-27 2021-11-19 浙江大华技术股份有限公司 图像检测方法、图像检测设备及计算机可读存储介质
CN114005178B (zh) * 2021-10-29 2023-09-01 北京百度网讯科技有限公司 人物交互检测方法、神经网络及其训练方法、设备和介质
CN114549883B (zh) * 2022-02-24 2023-09-05 北京百度网讯科技有限公司 图像处理方法、深度学习模型的训练方法、装置和设备
CN114871115A (zh) * 2022-04-28 2022-08-09 五邑大学 一种物体分拣方法、装置、设备及存储介质
CN115578624A (zh) * 2022-10-28 2023-01-06 北京市农林科学院 农业病虫害模型构建方法、检测方法及装置
CN116994231A (zh) * 2023-08-01 2023-11-03 无锡车联天下信息技术有限公司 一种车内遗留物体的确定方法、装置及电子设备
CN117237746B (zh) * 2023-11-13 2024-03-15 光宇锦业(武汉)智能科技有限公司 基于多交叉边缘融合小目标检测方法、系统及存储介质

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103413120A (zh) * 2013-07-25 2013-11-27 华南农业大学 基于物体整体性和局部性识别的跟踪方法

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016004330A1 (en) * 2014-07-03 2016-01-07 Oim Squared Inc. Interactive content generation
CN105120130B (zh) 2015-09-17 2018-06-29 京东方科技集团股份有限公司 一种图像升频系统、其训练方法及图像升频方法
US9424494B1 (en) * 2016-01-28 2016-08-23 International Business Machines Corporation Pure convolutional neural network localization
CN106126579B (zh) * 2016-06-17 2020-04-28 北京市商汤科技开发有限公司 物体识别方法和装置、数据处理装置和终端设备
CN106296728B (zh) 2016-07-27 2019-05-14 昆明理工大学 一种基于全卷积网络的非限制场景中运动目标快速分割方法
CN106295678B (zh) * 2016-07-27 2020-03-06 北京旷视科技有限公司 神经网络训练与构建方法和装置以及目标检测方法和装置
CN106355573B (zh) * 2016-08-24 2019-10-25 北京小米移动软件有限公司 图片中目标物的定位方法及装置
CN106447658B (zh) * 2016-09-26 2019-06-21 西北工业大学 基于全局和局部卷积网络的显著性目标检测方法
CN106709532B (zh) 2017-01-25 2020-03-10 京东方科技集团股份有限公司 图像处理方法和装置
CN110647834B (zh) * 2019-09-18 2021-06-25 北京市商汤科技开发有限公司 人脸和人手关联检测方法及装置、电子设备和存储介质
US11367271B2 (en) * 2020-06-19 2022-06-21 Adobe Inc. Similarity propagation for one-shot and few-shot image segmentation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103413120A (zh) * 2013-07-25 2013-11-27 华南农业大学 基于物体整体性和局部性识别的跟踪方法

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GAO, KAI ET AL.: "Depth Map Filtering and Upsampling Method for Virtual View Rendering", JOURNAL OF IMAGE AND GRAPHICS, vol. 18, no. 9, 30 September 2013 (2013-09-30), pages 1085 - 1092 *
JIANG, YINGFENG ET AL.: "A New Multi-scale Image Semantic Understanding Method Based on Deep Learning", JOURNAL OF OPTOELECTRONICS . LASER, vol. 27, no. 2, 29 February 2016 (2016-02-29), pages 224 - 230 *
ZHANG, WENDA ET AL.: "Image Target Recognition Method Based on Multi-scale Block Convolutional Neural Network", JOURNAL OF COMPUTER APPLICATIONS, vol. 36, no. 4, 10 April 2016 (2016-04-10), pages 1033 - 1038 *

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020077393A (ja) * 2018-10-04 2020-05-21 株式会社ストラドビジョン 自動車の車線変更に対する危険を警報する方法及びこれを利用した警報装置
CN111126421A (zh) * 2018-10-31 2020-05-08 浙江宇视科技有限公司 目标检测方法、装置及可读存储介质
CN111353597B (zh) * 2018-12-24 2023-12-05 杭州海康威视数字技术股份有限公司 一种目标检测神经网络训练方法和装置
CN111353597A (zh) * 2018-12-24 2020-06-30 杭州海康威视数字技术股份有限公司 一种目标检测神经网络训练方法和装置
CN111401396A (zh) * 2019-01-03 2020-07-10 阿里巴巴集团控股有限公司 图像识别方法及装置
CN111401396B (zh) * 2019-01-03 2023-04-18 阿里巴巴集团控股有限公司 图像识别方法及装置
US11887311B2 (en) 2019-04-22 2024-01-30 Tencent Technology (Shenzhen) Company Limited Method and apparatus for segmenting a medical image, and storage medium
EP3961484A4 (en) * 2019-04-22 2022-08-03 Tencent Technology (Shenzhen) Company Limited METHOD AND DEVICE FOR SEGMENTING MEDICAL IMAGES, ELECTRONIC DEVICE AND STORAGE MEDIA
US20210365717A1 (en) * 2019-04-22 2021-11-25 Tencent Technology (Shenzhen) Company Limited Method and apparatus for segmenting a medical image, and storage medium
CN110097108B (zh) * 2019-04-24 2021-03-02 佳都新太科技股份有限公司 非机动车的识别方法、装置、设备及存储介质
CN110097108A (zh) * 2019-04-24 2019-08-06 佳都新太科技股份有限公司 非机动车的识别方法、装置、设备及存储介质
CN112001211B (zh) * 2019-05-27 2024-04-19 商汤集团有限公司 对象检测方法、装置、设备及计算机可读存储介质
CN112001211A (zh) * 2019-05-27 2020-11-27 商汤集团有限公司 对象检测方法、装置、设备及计算机可读存储介质
CN110503063B (zh) * 2019-08-28 2021-12-17 东北大学秦皇岛分校 基于沙漏卷积自动编码神经网络的跌倒检测方法
CN110503063A (zh) * 2019-08-28 2019-11-26 东北大学秦皇岛分校 基于沙漏卷积自动编码神经网络的跌倒检测方法
JP2022516398A (ja) * 2019-11-27 2022-02-28 深▲セン▼市商▲湯▼科技有限公司 画像処理方法及び画像処理装置、プロセッサ、電子機器並びに記憶媒体
CN111079620A (zh) * 2019-12-10 2020-04-28 北京小蝇科技有限责任公司 基于迁移学习的白细胞图像检测识别模型构建方法及应用
CN111079620B (zh) * 2019-12-10 2023-10-17 北京小蝇科技有限责任公司 基于迁移学习的白细胞图像检测识别模型构建方法及应用
CN111091089A (zh) * 2019-12-12 2020-05-01 新华三大数据技术有限公司 一种人脸图像处理方法、装置、电子设备及存储介质
CN111091089B (zh) * 2019-12-12 2022-07-29 新华三大数据技术有限公司 一种人脸图像处理方法、装置、电子设备及存储介质
CN111080528B (zh) * 2019-12-20 2023-11-07 北京金山云网络技术有限公司 图像超分辨率和模型训练方法、装置、电子设备及介质
CN111080528A (zh) * 2019-12-20 2020-04-28 北京金山云网络技术有限公司 图像超分辨率和模型训练方法、装置、电子设备及介质
CN111339884A (zh) * 2020-02-19 2020-06-26 浙江大华技术股份有限公司 图像识别方法以及相关设备、装置
CN111339884B (zh) * 2020-02-19 2023-06-06 浙江大华技术股份有限公司 图像识别方法以及相关设备、装置
CN113496150A (zh) * 2020-03-20 2021-10-12 长沙智能驾驶研究院有限公司 密集目标检测方法、装置、存储介质及计算机设备
CN111881744A (zh) * 2020-06-23 2020-11-03 安徽清新互联信息科技有限公司 一种基于空间位置信息的人脸特征点定位方法及系统
CN112101345A (zh) * 2020-08-26 2020-12-18 贵州优特云科技有限公司 一种水表读数识别的方法以及相关装置
CN112686329A (zh) * 2021-01-06 2021-04-20 西安邮电大学 基于双核卷积特征提取的电子喉镜图像分类方法
CN113191235A (zh) * 2021-04-22 2021-07-30 上海东普信息科技有限公司 杂物检测方法、装置、设备及存储介质
CN113191235B (zh) * 2021-04-22 2024-05-17 上海东普信息科技有限公司 杂物检测方法、装置、设备及存储介质

Also Published As

Publication number Publication date
US20190156144A1 (en) 2019-05-23
JP6902611B2 (ja) 2021-07-14
SG11201907355XA (en) 2019-09-27
CN108229455B (zh) 2020-10-16
CN108229455A (zh) 2018-06-29
JP2020509488A (ja) 2020-03-26
US11321593B2 (en) 2022-05-03

Similar Documents

Publication Publication Date Title
WO2018153319A1 (zh) 物体检测方法、神经网络的训练方法、装置和电子设备
US11481869B2 (en) Cross-domain image translation
CN108256479B (zh) 人脸跟踪方法和装置
CN108446698B (zh) 在图像中检测文本的方法、装置、介质及电子设备
WO2019129032A1 (zh) 遥感图像识别方法、装置、存储介质以及电子设备
US10846870B2 (en) Joint training technique for depth map generation
US11270158B2 (en) Instance segmentation methods and apparatuses, electronic devices, programs, and media
WO2018010657A1 (zh) 结构化文本检测方法和系统、计算设备
CN108229591B (zh) 神经网络自适应训练方法和装置、设备、程序和存储介质
US20200334449A1 (en) Object detection based on neural network
EP2660753B1 (en) Image processing method and apparatus
WO2019080747A1 (zh) 目标跟踪及神经网络训练方法、装置、存储介质、电子设备
US20210124928A1 (en) Object tracking methods and apparatuses, electronic devices and storage media
AU2019201358B2 (en) Real time overlay placement in videos for augmented reality applications
CN113674146A (zh) 图像超分辨率
WO2020062494A1 (zh) 图像处理方法和装置
CN116109824A (zh) 基于扩散模型的医学影像及像素级标注生成方法及装置
Yang et al. Robust and real-time pose tracking for augmented reality on mobile devices
CN113570608B (zh) 目标分割的方法、装置及电子设备
CN113516697B (zh) 图像配准的方法、装置、电子设备及计算机可读存储介质
CN112508005B (zh) 用于处理图像的方法、装置、设备以及存储介质
CN113837194A (zh) 图像处理方法、图像处理装置、电子设备以及存储介质
CN111914850B (zh) 图片特征提取方法、装置、服务器和介质
WO2022226744A1 (en) Texture completion
CN111819567A (zh) 使用语义特征来匹配图像的方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18757814

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019545345

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 05.12.2019)

122 Ep: pct application non-entry in european phase

Ref document number: 18757814

Country of ref document: EP

Kind code of ref document: A1