WO2019091464A1 - 目标检测方法和装置、训练方法、电子设备和介质 - Google Patents

目标检测方法和装置、训练方法、电子设备和介质 Download PDF

Info

Publication number
WO2019091464A1
WO2019091464A1 PCT/CN2018/114884 CN2018114884W WO2019091464A1 WO 2019091464 A1 WO2019091464 A1 WO 2019091464A1 CN 2018114884 W CN2018114884 W CN 2018114884W WO 2019091464 A1 WO2019091464 A1 WO 2019091464A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
detection frame
feature
detection
regression
Prior art date
Application number
PCT/CN2018/114884
Other languages
English (en)
French (fr)
Inventor
李搏
武伟
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Priority to JP2020526040A priority Critical patent/JP7165731B2/ja
Priority to SG11202004324WA priority patent/SG11202004324WA/en
Priority to KR1020207016026A priority patent/KR20200087784A/ko
Publication of WO2019091464A1 publication Critical patent/WO2019091464A1/zh
Priority to US16/868,427 priority patent/US11455782B2/en
Priority to PH12020550588A priority patent/PH12020550588A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/248Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/255Detecting or recognising potential candidate objects based on visual cues, e.g. shapes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Definitions

  • the present disclosure relates to computer vision technology, and more particularly to an object detection method and apparatus, training method, electronic device, and medium.
  • Single-target tracking is an important issue in the field of artificial intelligence, and can be used in a series of tasks such as automatic driving and multi-target tracking.
  • the main task of single-target tracking is to specify a target to be tracked in a certain frame image of a video sequence, and to track the specified target in the subsequent frame image.
  • Embodiments of the present disclosure provide a technical solution for performing target tracking.
  • a target tracking method including:
  • a training method for a target detection network including:
  • the detection frame of the target object as a prediction detection frame, training the neural network, the first convolution layer and the said based on the annotation information of the detection frame and the prediction detection frame
  • the second convolution layer Obtaining, in the obtained detection frame, the detection frame of the target object as a prediction detection frame, training the neural network, the first convolution layer and the said based on the annotation information of the detection frame and the prediction detection frame
  • the second convolution layer Obtaining, in the obtained detection frame, the detection frame of the target object as a prediction detection frame, training the neural network, the first convolution layer and the said based on the annotation information of the detection frame and the prediction detection frame.
  • a target detecting apparatus including:
  • a neural network configured to respectively extract a feature of a template frame and a detection frame, wherein the template frame is a detection frame image of the target object, and an image size of the template frame is smaller than the detection frame;
  • a first convolution layer a channel for increasing features of the template frame, to obtain a first feature as a classification weight of the local area detector
  • a second convolution layer a channel for increasing features of the template frame, to obtain a second feature as a regression weight of the local area detector
  • a local area detector configured to output a classification result and a regression result of the multiple candidate boxes according to the characteristics of the detection frame
  • an obtaining unit configured to acquire, according to the classification result and the regression result of the multiple candidate frames output by the local area detector, a detection frame of the target object in the detection frame.
  • an electronic device including the object detecting device of any one of the embodiments of the present disclosure.
  • another electronic device including:
  • a memory for storing executable instructions
  • a processor for communicating with the memory to execute the executable instructions to perform the operations of the method of any of the embodiments of the present disclosure.
  • a computer storage medium for storing computer readable instructions that, when executed, implement the operations of the method of any of the embodiments of the present disclosure.
  • a computer program comprising computer readable instructions, when a computer readable instruction is run in a device, a processor in the device executes An executable instruction that implements the steps in the method of any of the embodiments of the present disclosure.
  • the characteristics of the template frame and the detection frame are respectively extracted by the neural network, and the classification weight and the regression weight of the local area detector are obtained based on the features of the template frame, and the feature of the detection frame is input into the local area detector to obtain a local part.
  • the neural network with the same or the same result can better extract the similar features of the same target object, so that the feature changes of the target object extracted in different frames are small, which is helpful to improve in the detection frame.
  • the accuracy of the target object detection result; the classification weight and the regression weight of the local area detector are obtained based on the feature of the template frame, and the local area detector can obtain the classification result and the regression result of the multiple candidate frames of the detection frame, and then obtain the detection frame.
  • the detection frame of the target object can better estimate the position and size change of the target object, and can more accurately find the position of the target object in the detection frame, thereby improving the speed and accuracy of the target tracking, and the tracking effect is good. ,high speed.
  • FIG. 1 is a flow chart of an embodiment of an object detection method of the present disclosure.
  • FIG. 2 is a flow chart of another embodiment of the object detection method of the present disclosure.
  • FIG. 3 is a flow chart of an embodiment of a training method for a target detection network of the present disclosure.
  • FIG. 4 is a flow chart of another embodiment of a training method for a target detection network of the present disclosure.
  • FIG. 5 is a schematic structural diagram of an embodiment of an object detecting apparatus according to the present disclosure.
  • FIG. 6 is a schematic structural view of another embodiment of an object detecting apparatus according to the present disclosure.
  • FIG. 7 is a schematic structural view of still another embodiment of the object detecting device of the present disclosure.
  • FIG. 8 is a schematic structural diagram of an application embodiment of an object detecting apparatus according to the present disclosure.
  • FIG. 9 is a schematic structural diagram of another application embodiment of the object detecting apparatus of the present disclosure.
  • FIG. 10 is a schematic structural diagram of an application embodiment of an electronic device according to the present disclosure.
  • a plurality may mean two or more, and “at least one” may mean one, two or more.
  • Embodiments of the present disclosure may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which may operate with numerous other general purpose or special purpose computing system environments or configurations.
  • Examples of well-known terminal devices, computing systems, environments, and/or configurations suitable for use with electronic devices such as terminal devices, computer systems, servers, and the like include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients Machines, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the above, and the like.
  • Electronic devices such as terminal devices, computer systems, servers, etc., can be described in the general context of computer system executable instructions (such as program modules) being executed by a computer system.
  • program modules may include routines, programs, target programs, components, logic, data structures, and the like that perform particular tasks or implement particular abstract data types.
  • the computer system/server can be implemented in a distributed cloud computing environment where tasks are performed by remote processing devices that are linked through a communication network.
  • program modules may be located on a local or remote computing system storage medium including storage devices.
  • the target detection method of this embodiment includes:
  • the template frame is a detection frame image of the target object, and the image size of the template frame is smaller than the detection frame; the detection frame is a current frame that needs to be detected by the target object or an area image that may include the target object in the current frame.
  • the detection frame is an area image that may include the target object in the current frame that needs to be detected by the target object
  • the area image is larger than the image size of the template frame, for example, the area image may be
  • the center point of the template frame image is the center point, and the size can be 2-4 times the size of the template frame image.
  • the template frame is a frame in the video sequence whose detection timing is before the detection frame and determined by the detection frame of the target object, and may be a start frame in the video sequence that needs to perform target tracking.
  • the position of the starting frame in the sequence of video frames is very flexible, for example it can be the first frame or any intermediate frame in the sequence of video frames.
  • the detection frame is a frame that needs to perform target tracking. After the detection frame of the target object is determined in the detection frame image, the image corresponding to the detection frame in the detection frame can be used as the template frame image of the next detection frame.
  • the features of the template frame and the detection frame may be respectively extracted through the same neural network; or the template frames and the detection frames are respectively extracted by different neural networks having the same structure. feature.
  • the operation 102 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a neural network operated by the processor.
  • the feature of the template frame may be convoluted by the first convolutional layer, and the first feature obtained by the convolution operation is used as the classification weight of the local area detector.
  • the classification weight of the local area detector may be obtained by increasing the number of channels of the feature of the template frame by the first convolution layer, and obtaining the first feature, where the number of channels of the first feature is The feature of the template frame is 2k times the number of channels, where k is an integer greater than zero.
  • the feature of the template frame may be convoluted by the second convolution layer, and the second feature obtained by the convolution operation is used as the regression weight of the local area detector.
  • the regression weight of the local area detector may be obtained by increasing the number of channels of the feature of the template frame by the second convolutional layer to obtain a second feature, the number of channels of the second feature Is 4k times the number of channels of the feature of the template frame, where k is an integer greater than zero.
  • the operation 104 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a first convolutional layer and a second convolutional layer, respectively, executed by the processor.
  • the classification result includes each candidate box as a probability value of the detection frame of the target object, and the regression result includes an offset of each candidate frame relative to the detection frame corresponding to the template frame.
  • the plurality of candidate blocks may include: detecting K candidate boxes at various locations in the frame.
  • K is a preset integer greater than one.
  • the ratio of the length to the width of the K candidate boxes is different.
  • the ratio of the length to the width of the K candidate frames may include: 1:1, 2:1, 2:1, 3:1, 1:3 ,Wait.
  • the classification result is used to indicate whether the K candidate boxes at each position are probability values of the detection frame of the target object.
  • the method further includes: normalizing the classification result, The sum of the probability values of the detection frames of whether the candidate boxes are the target objects is 1, thereby helping to determine whether each of the candidate frames is the detection frame of the target object.
  • the regression result includes detecting an offset of the K candidate boxes at each position in the frame image with respect to the detection frame of the target object in the template frame, wherein the offset is The amount of change in position and size may be included, where the position may be the position of the center point, the position of the four vertices of the reference frame, and the like.
  • the offset of each candidate frame relative to the detection frame of the target object in the template frame may include, for example, the abscissa of the position of the center point.
  • the operation 106 may include: performing a convolution operation on the features of the detection frame by using the classification weight to obtain a classification result of the plurality of candidate frames; and using the regression weight to detect the characteristics of the frame A convolution operation is performed to obtain regression results for multiple candidate boxes.
  • the operation 106 may be performed by a processor invoking a corresponding instruction stored in a memory or by a local area detector being executed by the processor.
  • the operation 108 may be performed by a processor invoking a corresponding instruction stored in a memory or by an acquisition unit executed by the processor.
  • the features of the template frame and the detection frame are respectively extracted by the neural network, and the classification weight and the regression weight of the local area detector are obtained based on the features of the template frame, and the feature input local area detection of the detection frame is performed.
  • the neural network with the same or the same result can better extract the similar features of the same target object, so that the feature changes of the target object extracted in different frames are small, which is helpful to improve in the detection frame.
  • the accuracy of the target object detection result; the classification weight and the regression weight of the local area detector are obtained based on the feature of the template frame, and the local area detector can obtain the classification result and the regression result of the multiple candidate frames of the detection frame, and then obtain the detection frame.
  • the detection frame of the target object can better estimate the position and size change of the target object, and can more accurately find the position of the target object in the detection frame, thereby improving the speed and accuracy of the target tracking, and the tracking effect is good. ,high speed.
  • the embodiment of the present disclosure is based on a template frame, and the local area detector can quickly generate a large number of candidate frames from the detection frame, and obtain detection frames of the K candidate boxes at each position in the detection frame relative to the target object in the template frame.
  • the offset can better estimate the position and size of the target object, and can more accurately find the position of the target object in the detection frame, thereby improving the speed and accuracy of the target tracking, and the tracking effect is good and fast.
  • the method may further include:
  • the classification result of the box and the regression result perform operation 108.
  • the detection frame when the detection frame is an area image that may include the target object in the current frame that needs to be detected by the target object, the detection frame may further include: centering on the center point of the template frame, An area image in which the length and/or width in the current frame corresponds to an image length and/or width larger than the template frame is taken as the detection frame.
  • the target detection method of this embodiment includes:
  • the template frame is a detection frame image of the target object, and the image size of the template frame is smaller than the detection frame; the detection frame is a current frame that needs to be detected by the target object or an area image that may include the target object in the current frame.
  • the template frame is a frame in the video sequence whose detection timing is before the detection frame and determined by the detection frame of the target object.
  • the features of the template frame and the detection frame may be respectively extracted through the same neural network; or the template frames and the detection frames are respectively extracted by different neural networks having the same structure. feature.
  • the operation 202 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a neural network operated by the processor.
  • 204 Perform a convolution operation on a feature of the detection frame by using a third convolution layer to obtain a third feature, where the number of channels of the third feature is the same as the number of channels of the feature of the detection frame; and detecting the frame by using the fourth convolution layer
  • the feature is subjected to a convolution operation to obtain a fourth feature having the same number of channels as the feature of the detected frame.
  • the operation 204 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a third convolutional layer and a fourth convolutional layer, respectively, executed by the processor.
  • the feature of the template frame may be convoluted by the first convolutional layer, and the first feature obtained by the convolution operation is used as the classification weight of the local area detector.
  • the feature of the template frame may be convoluted by the second convolution layer, and the second feature obtained by the convolution operation is used as the regression weight of the local area detector.
  • the operation 206 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a first convolutional layer and a second convolutional layer, respectively, executed by the processor.
  • the classification result includes each candidate box as a probability value of the detection frame of the target object, and the regression result includes an offset of each candidate frame relative to the detection frame corresponding to the template frame.
  • the operation 208 may be performed by a processor invoking a corresponding instruction stored in a memory, or by a local area detector being executed by the processor.
  • the operation 210 may be performed by a processor invoking a corresponding instruction stored in a memory, or by an acquisition unit executed by the processor.
  • the operation 108 or 210 may include: selecting an candidate box from the plurality of candidate boxes according to the classification result and the regression result, and according to the offset pair of the selected candidate frame The selected candidate box is subjected to regression to obtain a detection frame of the target object in the detection frame.
  • the method when selecting an candidate box from multiple candidate boxes according to the classification result and the regression result, the method may be implemented as follows: multiple candidate boxes according to the classification result and the weight coefficient of the regression result An alternative box is selected, for example, based on the classification result and the weight coefficient of the regression result, the sum of the product of the probability value of each candidate box and the weight coefficient of the classification result, and the product of the offset coefficient and the weight coefficient of the regression result, respectively. Calculating a composite score, and selecting an alternative box from the plurality of candidate boxes according to the combined score of the plurality of candidate boxes.
  • the method further includes: adjusting the probability value of the candidate box according to the change amount of the position and the size in the regression result, for example, according to the regression result.
  • the amount of change in position and size adjusts the probability value of the candidate box. For example, the probability value of the candidate box with a large amount of change in position (ie, a large positional movement) and a large amount of change in magnitude (ie, a large change in shape) is punished, and the probability value thereof is lowered.
  • the method may be implemented as follows: according to the adjusted classification result, one candidate is selected from multiple candidate boxes.
  • the marquee for example, selects an alternative box with the highest probability value from a plurality of candidate boxes according to the adjusted probability value.
  • the above-mentioned operation of adjusting the probability value of the candidate box according to the change amount of the position and the size in the regression result may be performed by the processor calling the corresponding instruction stored in the memory, or may be executed by the processor.
  • the adjustment unit is executed.
  • the target detection network of the embodiment of the present disclosure includes the neural network, the first convolutional layer, and the second convolutional layer of the embodiments of the present disclosure.
  • the training method of this embodiment includes:
  • the template frame is a detection frame image of the target object, and the image size of the template frame is smaller than the detection frame; the detection frame is a current frame that needs to be detected by the target object or an area image that may include the target object in the current frame.
  • the template frame is a frame in the video sequence whose detection timing is before the detection frame and determined by the detection frame of the target object.
  • the features of the template frame and the detection frame may be separately extracted through the same neural network; or the template frames and the detection frames are respectively extracted by different neural networks having the same structure. feature.
  • the operation 302 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a neural network operated by the processor.
  • the operation 304 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a first convolutional layer and a second convolutional layer, respectively, executed by the processor.
  • the classification result includes each candidate box as a probability value of the detection frame of the target object, and the regression result includes an offset of each candidate frame relative to the detection frame corresponding to the template frame.
  • the operation 306 may include: performing a convolution operation on the feature of the detection frame by using the classification weight to obtain a classification result of the plurality of candidate frames; and using the regression weight pair to detect the feature of the frame A convolution operation is performed to obtain regression results for multiple candidate boxes.
  • the operation 306 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by an area detector operated by the processor.
  • the operation 308 may be performed by a processor invoking a corresponding instruction stored in a memory or by an acquisition unit executed by the processor.
  • the obtained detection frame of the target object in the detection frame is used as a prediction detection frame, and the neural network, the first convolutional layer, and the second convolution layer are trained based on the annotation information of the detection frame and the prediction detection frame.
  • the operation 310 may be performed by a processor invoking a corresponding instruction stored in a memory or by a training unit executed by the processor.
  • the features of the template frame and the detection frame are respectively extracted by the neural network, and the classification weight and the regression weight of the local area detector are obtained based on the features of the template frame, and the feature input of the detection frame is detected.
  • the local area detector obtains the classification result and the regression result of the multiple candidate frames output by the local area detector, and obtains the target object in the detection frame according to the classification result and the regression result of the multiple candidate boxes output by the local area detector
  • the detection frame is based on the labeling information of the detection frame and the prediction detection frame training target detection network.
  • the neural network with the same result or the same result can better extract the similar features of the same target object, so that the feature changes of the target object extracted in different frames are small, Helping to improve the accuracy of the target object detection result in the detection frame; obtaining the classification weight and regression weight of the local area detector based on the feature of the template frame, and the local area detector can obtain the classification result and regression of the multiple candidate frames of the detection frame
  • the detection frame of the target object in the detection frame is further obtained, which can better estimate the position and size change of the target object, and can more accurately find the position of the target object in the detection frame, thereby improving the speed of the target tracking. And accuracy, tracking effect is good, fast.
  • the method further includes: extracting, by the neural network, a feature of the at least one other detection frame whose timing is located after the detection frame in the video sequence;
  • the method may further include: pre-focusing on the center point of the template frame, from the current An area image whose length and/or width corresponds to an image length and/or width larger than the template frame is taken as a detection frame in the frame.
  • the target detection network of the embodiment of the present disclosure includes the neural network, the first convolutional layer, the second convolutional layer, the third convolutional layer, and the fourth convolutional layer of the embodiments of the present disclosure.
  • the training method of this embodiment includes:
  • the template frame is a detection frame image of the target object, and the image size of the template frame is smaller than the detection frame; the detection frame is a current frame that needs to be detected by the target object or an area image that may include the target object in the current frame.
  • the template frame is a frame in the video sequence whose detection timing is before the detection frame and determined by the detection frame of the target object.
  • the features of the template frame and the detection frame may be separately extracted through the same neural network; or the template frames and the detection frames are respectively extracted by different neural networks having the same structure. feature.
  • the operation 402 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a neural network operated by the processor.
  • the operation 404 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a third convolutional layer and a fourth convolutional layer, respectively, executed by the processor.
  • the operation 406 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a first convolutional layer and a second convolutional layer, respectively, executed by the processor.
  • the classification result includes each candidate box as a probability value of the detection frame of the target object, and the regression result includes an offset of each candidate frame relative to the detection frame corresponding to the template frame.
  • the operation 408 may be performed by a processor invoking a corresponding instruction stored in a memory or by a local area detector being executed by the processor.
  • the operation 410 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a first feature extraction unit 701 that is executed by the processor.
  • the obtained detection frame of the target object in the detection frame is used as a prediction detection frame, and the difference between the position and the size of the detection frame in the detection frame according to the marked target object and the position and size of the prediction detection frame is The weight values of the network, the first convolutional layer, and the second convolutional layer are adjusted.
  • the operation 412 may be performed by a processor invoking a corresponding instruction stored in a memory or by a training unit executed by the processor.
  • the operation 308 or 410 may include: selecting an candidate box from the plurality of candidate frames according to the classification result and the regression result, and according to the offset pair of the selected candidate frame The selected candidate box is subjected to regression to obtain a detection frame of the target object in the detection frame.
  • the method when selecting an candidate box from multiple candidate boxes according to the classification result and the regression result, the method may be implemented as follows: multiple candidate boxes according to the classification result and the weight coefficient of the regression result An alternative box is selected, for example, based on the classification result and the weight coefficient of the regression result, the sum of the product of the probability value of each candidate box and the weight coefficient of the classification result, and the product of the offset coefficient and the weight coefficient of the regression result, respectively. Calculating a comprehensive score, and selecting an alternative box with a high probability value and a small offset from the plurality of candidate boxes according to the comprehensive score of the plurality of candidate boxes.
  • the method further includes: adjusting the probability value of the candidate box according to the change amount of the position and the size in the regression result, for example, according to the regression result.
  • the amount of change in position and size adjusts the probability value of the candidate box.
  • the method may be implemented as follows: according to the adjusted classification result, one candidate is selected from multiple candidate boxes.
  • the marquee for example, selects an alternative box with the highest probability value from a plurality of candidate boxes according to the adjusted probability value.
  • the above-mentioned operation of adjusting the probability value of the candidate box according to the amount of change in the position and size in the regression result may be performed by the processor calling the corresponding instruction stored in the memory, or may be executed by the processor. Adjustment unit execution.
  • the operation 308 or 410 may include: selecting an candidate box from the plurality of candidate frames according to the classification result and the regression result, and according to the offset pair of the selected candidate frame The selected candidate box is subjected to regression to obtain a detection frame of the target object in the detection frame.
  • the method when selecting an candidate box from multiple candidate boxes according to the classification result and the regression result, the method may be implemented as follows: multiple candidate boxes according to the classification result and the weight coefficient of the regression result Selecting an alternative box, for example, calculating a comprehensive score from the probability value and the offset of each candidate box according to the classification result and the weight coefficient of the regression result, according to the comprehensive score of the plurality of candidate boxes, from the above Select an alternate box in multiple candidate boxes.
  • the method further includes: adjusting the probability value of the candidate box according to the change amount of the position and the size in the regression result, for example, according to the regression result.
  • the amount of change in position and size adjusts the probability value of the candidate box. For example, the probability value of the candidate box with a large amount of change in position (ie, a large positional movement) and a large amount of change in magnitude (ie, a large change in shape) is punished, and the probability value thereof is lowered.
  • the method may be implemented as follows: according to the adjusted classification result, one candidate is selected from multiple candidate boxes.
  • the marquee for example, selects an alternative box with the highest probability value from a plurality of candidate boxes according to the adjusted probability value.
  • the local area detector may include a third convolutional layer, a fourth convolutional layer, and two convolution operation units.
  • the local area detector formed by combining the local area detector with the first convolution layer and the second convolution layer may also be referred to as an area proposal network.
  • the target detection method and the training method of the target detection network provided by the embodiments of the present disclosure may be performed by any suitable device having data processing capability, including but not limited to: a terminal device, a server, and the like.
  • any one of the object detection method and the target detection network training method provided by the embodiments of the present disclosure may be executed by a processor, such as the processor executing any one of the targets mentioned in the embodiments of the present disclosure by calling corresponding instructions stored in the memory. Detection method, training method of target detection network. This will not be repeated below.
  • the foregoing program may be stored in a computer readable storage medium, and the program is executed when executed.
  • the foregoing steps include the steps of the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
  • FIG. 5 is a schematic structural diagram of an embodiment of an object detecting apparatus according to the present disclosure.
  • the object detecting device of each embodiment of the present disclosure can be used to implement the above-described various object detecting method embodiments of the present disclosure.
  • the object detecting apparatus of this embodiment includes a neural network, a first convolutional layer, a second convolutional layer, a local area detector, and an acquisition unit. among them:
  • the neural network is configured to separately extract features of the template frame and the detection frame, wherein the template frame is a detection frame image of the target object, and the image size of the template frame is smaller than the detection frame.
  • the template frame is a detection frame image of the target object, and the image size of the template frame is smaller than the detection frame; the detection frame is a current frame that needs to be detected by the target object or an area image that may include the target object in the current frame.
  • the template frame is a frame in the video sequence whose detection timing is before the detection frame and determined by the detection frame of the target object.
  • the value network for extracting the features of the template frame and the detection frame may be the same neural network, or may be different neural networks having the same structure.
  • a first convolution layer configured to perform a convolution operation on the feature of the template frame, and use the first feature obtained by the convolution operation as the classification weight of the local area detector.
  • a second convolution layer configured to perform a convolution operation on the feature of the template frame by the second convolution layer, and use a second feature obtained by the convolution operation as a regression weight of the local area detector.
  • a local area detector is configured to output a classification result and a regression result of the plurality of candidate frames according to the characteristics of the detection frame; wherein the classification result includes a probability value of each detection box as a detection frame of the target object, and the regression result includes each The offset of the candidate box relative to the detection frame corresponding to the template frame.
  • an obtaining unit configured to acquire a detection frame of the target object in the detection frame according to the classification result and the regression result of the multiple candidate frames output by the local area detector.
  • the features of the template frame and the detection frame are respectively extracted by the neural network, and the classification weight and the regression weight of the local area detector are acquired based on the features of the template frame, and the feature input local area detection of the detection frame is performed.
  • the neural network with the same or the same result can better extract the similar features of the same target object, so that the feature changes of the target object extracted in different frames are small, which is helpful to improve in the detection frame.
  • the accuracy of the target object detection result; the classification weight and the regression weight of the local area detector are obtained based on the feature of the template frame, and the local area detector can obtain the classification result and the regression result of the multiple candidate frames of the detection frame, and then obtain the detection frame.
  • the detection frame of the target object can better estimate the position and size change of the target object, and can more accurately find the position of the target object in the detection frame, thereby improving the speed and accuracy of the target tracking, and the tracking effect is good. ,high speed.
  • the local area detector is configured to: perform a convolution operation on the feature of the detection frame by using the classification weight, obtain a classification result of the multiple candidate frames; and use the regression weight pair
  • the feature of the detection frame is subjected to a convolution operation to obtain regression results of a plurality of candidate frames.
  • the apparatus may further include: a preprocessing unit, configured to use the center point of the template frame as The center point intercepts, as a detection frame, an area image whose length and/or width corresponds to the image length and/or width of the template frame from the current frame.
  • FIG. 6 is a schematic structural view of another embodiment of the target detecting device of the present disclosure.
  • the method further includes: a third convolution layer, configured to perform a convolution operation on the feature of the detection frame to obtain a third feature, the third feature.
  • the number of channels is the same as the number of channels that detect the characteristics of the frame.
  • the local area detector is configured to perform a convolution operation on the third feature using the classification weight.
  • the method further includes: a fourth convolution layer, configured to perform a convolution operation on the feature of the detection frame to obtain the fourth feature, the fourth feature.
  • the number of channels is the same as the number of channels that detect the characteristics of the frame.
  • the local area detector is configured to perform a convolution operation on the fourth feature using the regression weights.
  • the acquiring unit is configured to: select an candidate box from the plurality of candidate boxes according to the classification result and the regression result, and according to the selected candidate box The shifting is performed on the selected candidate box to obtain a detection frame of the target object in the detection frame.
  • the obtaining unit selects an candidate box from the plurality of candidate frames according to the classification result and the regression result, and is configured to: select an candidate from the multiple candidate boxes according to the weighting coefficient of the classification result and the regression result frame.
  • the method further includes: an adjustment unit, configured to adjust the classification result according to the regression result.
  • the obtaining unit selects an candidate box from the plurality of candidate boxes according to the adjusted classification result according to the classification result and the regression result.
  • FIG. 7 is a schematic structural diagram of still another embodiment of the object detecting apparatus of the present disclosure.
  • the object detecting apparatus of this embodiment can be used to implement the training method embodiment of any of the target detecting networks of FIGS. 3 to 4 of the present disclosure.
  • the object detecting apparatus of the embodiment further includes: a training unit, configured to obtain a detection frame of the target object in the detection frame as a prediction detection frame, The neural network, the first convolutional layer, and the second convolutional layer are trained based on the annotation information of the detection frame and the prediction detection frame.
  • the tag information of the detection frame includes: a position and a size of the detection frame of the tagged target object in the detection frame.
  • the training unit is configured to compare the position and size of the marked detection frame with the position and size of the prediction detection frame for the neural network, the first convolutional layer and the second convolutional layer. The weight value is adjusted.
  • the characteristics of the template frame and the detection frame are respectively extracted by the neural network, and the classification weight and the regression weight of the local area detector are obtained based on the features of the template frame, and the feature of the detection frame is input into the local area detector to obtain a local part.
  • the annotation information and the prediction detection frame train the target detection network.
  • the neural network with the same result or the same result can better extract the similar features of the same target object, so that the feature changes of the target object extracted in different frames are small, Helping to improve the accuracy of the target object detection result in the detection frame; obtaining the classification weight and regression weight of the local area detector based on the feature of the template frame, and the local area detector can obtain the classification result and regression of the multiple candidate frames of the detection frame
  • the detection frame of the target object in the detection frame is further obtained, which can better estimate the position and size change of the target object, and can more accurately find the position of the target object in the detection frame, thereby improving the speed of the target tracking. And accuracy, tracking effect is good, fast.
  • FIG. 8 is a schematic structural diagram of an application embodiment of the target detecting device of the present disclosure.
  • FIG. 9 is a schematic structural diagram of another application embodiment of the target detecting device of the present disclosure.
  • LxMxN for example, 256x20x20
  • L represents the number of channels
  • M and N represent the height (ie, length) and width, respectively.
  • An embodiment of the present disclosure further provides an electronic device comprising the object detecting device of any of the above embodiments of the present disclosure.
  • An embodiment of the present disclosure further provides another electronic device, including: a memory for storing executable instructions; and a processor for communicating with the memory to execute executable instructions to perform target detection of any of the above embodiments of the present disclosure
  • the method or target detects the operation of the training method of the network.
  • FIG. 10 is a schematic structural diagram of an application embodiment of an electronic device according to the present disclosure.
  • the electronic device includes one or more processors, a communication unit, etc., such as one or more central processing units (CPUs), and/or one or more images.
  • processors such as one or more central processing units (CPUs), and/or one or more images.
  • a processor GPU or the like, the processor can perform various appropriate actions and processes according to executable instructions stored in a read only memory (ROM) or executable instructions loaded from a storage portion into a random access memory (RAM) .
  • ROM read only memory
  • RAM random access memory
  • the communication portion may include, but is not limited to, a network card, which may include, but is not limited to, an IB (Infiniband) network card, and the processor may communicate with the read only memory and/or the random access memory to execute executable instructions, and connect to the communication portion through the bus. And communicating with the other target device by the communication unit, so as to complete the operations corresponding to any method provided by the embodiment of the present application, for example, extracting the features of the template frame and the detection frame respectively by the neural network, wherein the template frame is the target object.
  • a network card which may include, but is not limited to, an IB (Infiniband) network card
  • the processor may communicate with the read only memory and/or the random access memory to execute executable instructions, and connect to the communication portion through the bus. And communicating with the other target device by the communication unit, so as to complete the operations corresponding to any method provided by the embodiment of the present application, for example, extracting the features of the template frame and the detection frame respectively by the neural network, wherein the template frame is the
  • a detection frame image the image size of the template frame is smaller than the detection frame; acquiring a classification weight and a regression weight of the local area detector based on the feature of the template frame; and inputting the feature of the detection frame into the local area detection Obtaining a classification result and a regression result of the plurality of candidate frames output by the local area detector; acquiring the detection frame according to the classification result and the regression result of the plurality of candidate boxes output by the local area detector The detection frame of the target object.
  • the feature of the template frame and the detection frame are separately extracted by the neural network, wherein the template frame is a detection frame image of the target object, and the image size of the template frame is smaller than the detection frame; a channel of the feature of the template frame, using the obtained first feature as a classification weight of the local area detector; and a channel for increasing a feature of the template frame by a second convolution layer, to obtain a second feature a regression weight of the local area detector; inputting a feature of the detection frame into the local area detector to obtain a classification result and a regression result of the plurality of candidate frames output by the local area detector; Obtaining a detection frame of the target object in the detection frame, and obtaining a detection frame of the target object in the detection frame as a prediction detection frame And training the neural network, the first convolutional layer, and the second convolutional layer based on the annotation information of the detection frame and the prediction detection frame.
  • the CPU, ROM, and RAM are connected to each other through a bus.
  • the ROM is an optional module.
  • the RAM stores executable instructions, or writes executable instructions to the ROM at runtime, the executable instructions causing the processor to perform operations corresponding to any of the methods described above.
  • An input/output (I/O) interface is also connected to the bus.
  • the communication unit can be integrated or set up with multiple sub-modules (eg multiple IB network cards) and on the bus link.
  • the following components are connected to the I/O interface: an input portion including a keyboard, a mouse, and the like; an output portion including a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a speaker; a storage portion including a hard disk or the like; The communication part of the network interface card of the LAN card, modem, etc.
  • the communication section performs communication processing via a network such as the Internet.
  • the drive is also connected to the I/O interface as needed.
  • a removable medium such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like is mounted on the drive as needed so that a computer program read therefrom is installed into the storage portion as needed.
  • FIG. 10 is only an optional implementation manner.
  • the number and type of components in the foregoing FIG. 10 may be selected, deleted, added, or replaced according to actual needs;
  • Different function components can also be implemented in separate settings or integrated settings, such as GPU and CPU detachable settings or GPU can be integrated on the CPU, the communication part can be separated, or integrated on the CPU or GPU. and many more.
  • an embodiment of the present disclosure further provides a computer storage medium for storing a computer readable instruction, when the instruction is executed, implementing the target detection method of the foregoing embodiment of the present disclosure or the training method of the target detection network. operating.
  • embodiments of the present disclosure also provide a computer program comprising computer readable instructions, when the computer readable instructions are run in a device, the processor in the device executes the above-described An executable instruction of a step in a target detection method or a target detection network training method of an embodiment.
  • Embodiments of the present disclosure may perform single-target tracking. For example, in a multi-target tracking system, target detection may not be performed every frame, but a fixed detection interval, for example, every 10 frames, and the middle 9 frames may be tracked by a single target. Determine the location of the target of the intermediate frame. Since the algorithm of the embodiment of the present disclosure is faster, the multi-target tracking system as a whole can complete the tracking faster and achieve better results.
  • the foregoing program may be stored in a computer readable storage medium, and the program is executed when executed.
  • the foregoing steps include the steps of the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
  • the methods and apparatus of the present disclosure may be implemented in a number of ways.
  • the methods and apparatus of the present disclosure may be implemented in software, hardware, firmware or any combination of software, hardware, firmware.
  • the above-described sequence of steps for the method is for illustrative purposes only, and the steps of the method of the present disclosure are not limited to the order described above unless otherwise specifically stated.
  • the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine readable instructions for implementing a method in accordance with the present disclosure.
  • the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Image Analysis (AREA)

Abstract

本公开实施例公开了一种目标检测方法和装置、训练方法、电子设备和介质,其中,目标检测方法包括:经神经网络分别提取模版帧和检测帧的特征,其中,所述模版帧为目标对象的检测框图像,所述模版帧的图像大小小于所述检测帧;基于所述模版帧的特征获取局部区域检测器的分类权重和回归权重;将所述检测帧的特征输入所述局部区域检测器,得到所述局部区域检测器输出的多个备选框的分类结果和回归结果;根据所述局部区域检测器输出的多个备选框的分类结果和回归结果,获取所述检测帧中所述目标对象的检测框。本公开实施例可以提升目标跟踪的速度和准确性。

Description

目标检测方法和装置、训练方法、电子设备和介质
本申请要求在2017年11月12日提交中国专利局、申请号为CN201711110587.1、发明名称为“目标检测方法和装置、训练方法、电子设备、程序和介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及计算机视觉技术,尤其是一种目标检测方法和装置、训练方法、电子设备和介质。
背景技术
单目标跟踪是人工智能领域的一个重要问题,在自动驾驶、多目标跟踪等一系列任务当中都可以用到。单目标跟踪的主要任务为:在一段视频序列的某一帧图像中指定一个需要跟踪的目标,在之后的帧图像中一直跟踪住这个指定的目标。
发明内容
本公开实施例提供一种用于进行目标跟踪的技术方案。
根据本公开实施例的一个方面,提供的一种目标跟踪方法,包括:
经神经网络分别提取模版帧和检测帧的特征,其中,所述模版帧为目标对象的检测框图像,所述模版帧的图像大小小于所述检测帧;
基于所述模版帧的特征获取局部区域检测器的分类权重和回归权重;
将所述检测帧的特征输入所述局部区域检测器,得到所述局部区域检测器输出的多个备选框的分类结果和回归结果;
根据所述局部区域检测器输出的多个备选框的分类结果和回归结果,获取所述检测帧中所述目标对象的检测框。
根据本公开实施例的另一个方面,提供的一种目标检测网络的训练方法,包括:
经神经网络分别提取模版帧和检测帧的特征,其中,所述模版帧为目标对象的检测框图像,所述模版帧的图像大小小于所述检测帧;
通过一卷积层,增加所述模板帧的特征的通道,以得到的第一特征作为所述局部区域检测器的分类权重;以及通过第二卷积层增加所述模板帧的特征的通道,以得到的第二特征作为所述局部区域检测器的回归权重;
将所述检测帧的特征输入所述局部区域检测器,得到所述局部区域检测器输出的多个备选框的分类结果和回归结果;
根据所述局部区域检测器输出的多个备选框的分类结果和回归结果,获取所述检测帧中所述目标对象的检测框;
以获得的所述检测帧中所述目标对象的检测框作为预测检测框,基于所述检测帧的标注信息和所述预测检测框训练所述神经网络、所述第一卷积层和所述第二卷积层。
根据本公开实施例的又一个方面,提供的一种目标检测装置,包括:
神经网络,用于分别提取模版帧和检测帧的特征,其中,所述模版帧为目标对象的检测框图像,所述模版帧的图像大小小于所述检测帧;
第一卷积层,用于增加所述模板帧的特征的通道,以得到的第一特征作为局部区域检测器的分类权重;
第二卷积层,用于增加所述模板帧的特征的通道,以得到的第二特征作为所述局部区域检测器的回归权重;
局部区域检测器,用于根据所述检测帧的特征,输出多个备选框的分类结果和回归结果;
获取单元,用于根据所述局部区域检测器输出的多个备选框的分类结果和回归结果,获取所述检测帧中所述目标对象的检测框。
根据本公开实施例的再一个方面,提供的一种电子设备,包括本公开任一实施例所述的目标检测装置。
根据本公开实施例的再一个方面,提供的另一种电子设备,包括:
存储器,用于存储可执行指令;以及
处理器,用于与所述存储器通信以执行所述可执行指令从而完成本公开任一实施例所述方法的操作。
根据本公开实施例的再一个方面,提供的一种计算机存储介质,用于存储计算机可读取的指令,所述指令被执行时实现本公开任一实施例所述方法的操作。
根据本公开实施例的再一个方面,提供的一种计算机程序,包括计算机可读取的指令,当所述计算机可读取的指令在设备中运行时,所述设备中的处理器执行用于实现本公开任一实施例所述方法中的步骤的可执行指令。
基于本公开上述实施例,经神经网络分别提取模版帧和检测帧的特征,基于模版帧的特征获取局部区域检测器的分类权重和回归权重,将检测帧的特征输入局部区域检测器,得到局部区域检测器输出的多个备选框的分类结果和回归结果,并根据局部区域检测器输出的多个备选框的分类结果和回归结果,获取检测帧中目标对象的检测框。本公开实施例,由同一或具有相同结果的神经网络可以更好的提取同一目标对象的相似特征,从而使得在不同帧中提取的目标对象的特征变化较小,有助于提高在检测帧中目标对象检测结果的准确性;基于模版帧的特征获取局部区域检测器的分类权重和回归权重,局部区域检测器可以获得检测帧多个备选框的分类结果和回归结果,进而获取检测帧中所述目标对象的检测框,能够更好的估计目标对象的位置和大小变化,能够更精确的找出目标对象在检测帧中的位置,从而提升了目标跟踪的速度和准确性,跟踪效果好、速度快。
下面通过附图和实施例,对本公开的技术方案做进一步的详细描述。
附图说明
构成说明书的一部分的附图描述了本公开的实施例,并且连同描述一起用于解释本公开的原理。
参照附图,根据下面的详细描述,可以更加清楚地理解本公开,其中:
图1为本公开目标检测方法一个实施例的流程图。
图2为本公开目标检测方法另一个实施例的流程图。
图3为本公开目标检测网络的训练方法一个实施例的流程图。
图4为本公开目标检测网络的训练方法另一个实施例的流程图。
图5为本公开目标检测装置一个实施例的结构示意图。
图6为本公开目标检测装置另一个实施例的结构示意图。
图7为本公开目标检测装置又一个实施例的结构示意图。
图8为本公开目标检测装置一个应用实施例的结构示意图。
图9为本公开目标检测装置另一个应用实施例的结构示意图。
图10为本公开电子设备一个应用实施例的结构示意图。
具体实施方式
现在将参照附图来详细描述本公开的各种示例性实施例。应注意到:除非另外说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本公开的范围。
还应理解,在本公开实施例中,“多个”可以指两个或两个以上,“至少一个”可以指一个、两个或两个以上。
本领域技术人员可以理解,本申请实施例中的“第一”、“第二”等术语仅用于区别不同步骤、设备或模块等,既不代表任何特定技术含义,也不表示它们之间的必然逻辑顺序。
还应理解,对于本公开实施例中提及的任一部件、数据或结构,在没有明确限定或者在前后文给出相反启示的情况下,一般可以理解为一个或多个。
还应理解,本公开对各个实施例的描述着重强调各个实施例之间的不同之处,其相同或相似之处可以相互参考,为了简洁,不再一一赘述。
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本公开及其应用或使用的任何限制。
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。
本公开实施例可以应用于终端设备、计算机系统、服务器等电子设备,其可与众多其它通用或专用计算系统环境或配置一起操作。适于与终端设备、计算机系统、服务器等电子设备一起使用的众所周知的终端设备、计算系统、环境和/或配置的例子包括但不限于:个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统﹑大型计算机系统和包括上述任何系统的分布式云计算技术环境,等等。
终端设备、计算机系统、服务器等电子设备可以在由计算机系统执行的计算机系统可执行指令(诸如程序模块)的一般语境下描述。通常,程序模块可以包括例程、程序、目标程序、组件、逻辑、数据结构等等,它们执行特定的任务或者实现特定的抽象数据类型。计算机系统/服务器可以在分布式云计算环境中实施,分布式云计算环境中,任务是由通过通信网络链接的远程处理设备执行的。在分布式云计算环境中,程序模块可以位于包括存储设备的本地或远程计算系统存储介质上。
图1为本公开目标检测方法一个实施例的流程图。如图1所示,该实施例的目标检测方法包括:
102,经神经网络分别提取模版帧和检测帧的特征。
其中,模版帧为目标对象的检测框图像,模版帧的图像大小小于检测帧;检测帧为需要进行目标对象检测的当前帧或者当前帧中可能包含目标对象的区域图像。检测帧为需要进行目标对象检测的当前帧中可能包含目标对象的区域图像时,在本公开各实施例的一个实施方式中,该区域图像大于模板帧的图像大小,例如,该区域图像可以以模板帧图像的中心点为中心点、大小可以为模板帧图像大小的2-4倍。
在本公开各实施例的一个实施方式中,模板帧为视频序列中检测时序位于检测帧之前、且目标对象的检测框确定的帧,可以是视频序列中需要进行目标跟踪的起始帧,该起始帧在视频帧序列中的位置是非常灵活的,例如可以是视频帧序列中的首帧或者任一中间帧。检测帧是需要进行目标跟踪的帧,检测帧图像中确定了目标对象的检测框后,该检测帧中对应检测框的图像便可以作为下一个检测帧的模板帧图像。
在本公开各实施例的一个实施方式中,该操作102中,可以经同一神经网络分别提取模版帧和检测帧的特征;或者,经具有相同结构的不同神经网络分别提取模版帧和检测帧的特征。
在一个可选示例中,该操作102可以由处理器调用存储器存储的相应指令执行,也可 以由被处理器运行的神经网络执行。
104,基于模版帧的特征获取局部区域检测器的分类权重和回归权重。
在本公开各实施例的一个实施方式中,可以通过第一卷积层对模板帧的特征进行卷积操作,以卷积操作得到的第一特征作为局部区域检测器的分类权重。
例如,在其中一个可选示例中,可以通过如下方式获取局部区域检测器的分类权重:通过第一卷积层增加模板帧的特征的通道数量,得到第一特征,第一特征的通道数量为模板帧的特征的通道数量的2k倍,其中,k为大于0的整数。
在本公开各实施例的一个实施方式中,可以通过第二卷积层对模板帧的特征进行卷积操作,以卷积操作得到的第二特征作为局部区域检测器的回归权重。
例如,在其中一个可选示例中,可以通过如下方式获取局部区域检测器的回归权重:通过第二卷积层增加模板帧的特征的通道数量,得到第二特征,该第二特征的通道数量为模板帧的特征的通道数量的4k倍,其中,k为大于0的整数。
在一个可选示例中,该操作104可以由处理器调用存储器存储的相应指令执行,也可以由分别被处理器运行的第一卷积层和第二卷积层执行。
106,将检测帧的特征输入局部区域检测器,得到局部区域检测器输出的多个备选框的分类结果和回归结果。
其中,分类结果包括各备选框分别为目标对象的检测框的概率值,回归结果包括各备选框相对于模板帧对应的检测框的偏移量。
在本公开各实施例的一个可选示例中,上述多个备选框可以包括:检测帧中各位置上的K个备选框。其中,K为预先设置的、大于1的整数。K个备选框的长度与宽度的比值各不相同,例如,K个备选框的长度与宽度的比值可以包括:1:1,2:1,2:1,3:1,1:3,等。分类结果用于表示各位置上的K个备选框是否为目标对象的检测框的概率值。
在本公开目标检测方法的一个可选实施例中,通过该操作106获得多个备选框是否为目标对象的检测框的概率值之后,还可以包括:对该分类结果进行归一化处理,使各备选框是否为目标对象的检测框的概率值之和为1,从而有助于判断各备选框是否为目标对象的检测框。
在本公开各实施例的一个可选示例中,回归结果包括检测帧图像中各位置上的K个备选框分别相对于模板帧中目标对象的检测框的偏移量,其中的偏移量可以包括位置和大小的变化量,其中的位置可以是中心点的位置,也可以是基准框的四个顶点的位置等。
在第二特征的通道数量为模板帧的特征的通道数量的4k倍时,各备选框分别相对于模板帧中目标对象的检测框的偏移量例如可以包括中心点的位置的横坐标的偏移量(dx)、中心点的位置的纵坐标的偏移量(dy)、高度的变化量(dh)和宽度的变化量(dw)。
在本公开各实施例的一个实施方式中,该操作106可以包括:利用分类权重对检测帧的特征进行卷积操作,获得多个备选框的分类结果;以及利用回归权重对检测帧的特征进行卷积操作,获得多个备选框的回归结果。
在一个可选示例中,该操作106可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的局部区域检测器执行。
108,根据局部区域检测器输出的多个备选框的分类结果和回归结果,获取检测帧中目标对象的检测框。
在一个可选示例中,该操作108可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的获取单元执行。
基于本公开上述实施例的目标检测方法,经神经网络分别提取模版帧和检测帧的特征,基于模版帧的特征获取局部区域检测器的分类权重和回归权重,将检测帧的特征输入局部区域检测器,得到局部区域检测器输出的多个备选框的分类结果和回归结果,并根据局部区域检测器输出的多个备选框的分类结果和回归结果,获取检测帧中目标对象的检测框。 本公开实施例,由同一或具有相同结果的神经网络可以更好的提取同一目标对象的相似特征,从而使得在不同帧中提取的目标对象的特征变化较小,有助于提高在检测帧中目标对象检测结果的准确性;基于模版帧的特征获取局部区域检测器的分类权重和回归权重,局部区域检测器可以获得检测帧多个备选框的分类结果和回归结果,进而获取检测帧中所述目标对象的检测框,能够更好的估计目标对象的位置和大小变化,能够更精确的找出目标对象在检测帧中的位置,从而提升了目标跟踪的速度和准确性,跟踪效果好、速度快。
本公开实施例基于模板帧,局部区域检测器可以从检测帧中快速产生大量的备选框,并获得检测帧中各位置上的K个备选框分别相对于模板帧中目标对象的检测框的偏移量,能够更好的估计目标对象的位置和大小变化,能够更精确的找出目标对象在检测帧中的位置,从而提升目标跟踪的速度和准确性,跟踪效果好、速度快。
在本公开目标检测方法的另一实施例中,还可以包括:
经神经网络提取视频序列中时序位于检测帧之后的至少一其他检测帧的特征;
将上述至少一其他检测帧的特征依次输入局部区域检测器,依次得到局部区域检测器输出的上述至少一其他检测帧中的多个备选框、以及各备选框的分类结果和回归结果,即:依次针对上述至少一其他检测帧的特征执行操作106;
依次根据上述至少一其他检测帧的多个备选框的分类结果和回归结果,获取上述至少一其他检测帧中目标对象的检测框;即:依次针对上述至少一其他检测帧的多个备选框的分类结果和回归结果执行操作108。
在本公开目标检测方法的又一个实施例中,检测帧为需要进行目标对象检测的当前帧中可能包含目标对象的区域图像时,还可以包括:预先以模板帧的中心点为中心点,从当前帧中截取长度和/或宽度对应大于模板帧的图像长度和/或宽度的区域图像作为检测帧。
图2为本公开目标检测方法另一个实施例的流程图。如图2所示,该实施例的目标检测方法包括:
202,经神经网络分别提取模版帧和检测帧的特征。
其中,模版帧为目标对象的检测框图像,模版帧的图像大小小于检测帧;检测帧为需要进行目标对象检测的当前帧或者当前帧中可能包含目标对象的区域图像。模板帧为视频序列中检测时序位于检测帧之前、且目标对象的检测框确定的帧。
在本公开各实施例的一个实施方式中,该操作202中,可以经同一神经网络分别提取模版帧和检测帧的特征;或者,经具有相同结构的不同神经网络分别提取模版帧和检测帧的特征。
在一个可选示例中,该操作202可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的神经网络执行。
204,通过第三卷积层对检测帧的特征进行卷积操作,获得第三特征,该第三特征的通道数量与检测帧的特征的通道数量相同;以及通过第四卷积层对检测帧的特征进行卷积操作,获得第四特征,该第四特征的通道数量与检测帧的特征的通道数量相同。
在一个可选示例中,该操作204可以由处理器调用存储器存储的相应指令执行,也可以由分别被处理器运行的第三卷积层和第四卷积层执行。
206,基于模版帧的特征获取局部区域检测器的分类权重和回归权重。
在本公开各实施例的一个实施方式中,可以通过第一卷积层对模板帧的特征进行卷积操作,以卷积操作得到的第一特征作为局部区域检测器的分类权重。
在本公开各实施例的一个实施方式中,可以通过第二卷积层对模板帧的特征进行卷积操作,以卷积操作得到的第二特征作为局部区域检测器的回归权重。
其中,操作206与204之间不存在执行顺序限制,二者可以同时执行,也可以以任意先后顺序执行。
在一个可选示例中,该操作206可以由处理器调用存储器存储的相应指令执行,也可 以由分别被处理器运行的第一卷积层和第二卷积层执行。
208,利用分类权重对第三特征进行卷积操作,获得多个备选框的分类结果;以及利用回归权重对第四特征进行卷积操作,获得多个备选框的回归结果。
其中,分类结果包括各备选框分别为目标对象的检测框的概率值,回归结果包括各备选框相对于模板帧对应的检测框的偏移量。
在一个可选示例中,该操作208可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的局部区域检测器执行。
210,根据局部区域检测器输出的多个备选框的分类结果和回归结果,获取检测帧中目标对象的检测框。
在一个可选示例中,该操作210可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的获取单元执行。
在本公开各实施例的一个实施方式中,操作108或者210可以包括:根据分类结果和回归结果从多个备选框中选取一个备选框,并根据选取的备选框的偏移量对选取的备选框进行回归,获得检测帧中目标对象的检测框。
在其中一个可选示例中,根据分类结果和回归结果从多个备选框中选取一个备选框时,可以通过如下方式实现:根据分类结果和回归结果的权重系数,从多个备选框中选取一个备选框,例如,根据分类结果和回归结果的权重系数,分别由各备选框的概率值与分类结果的权重系数的乘积和偏移量与回归结果的权重系数的乘积之和计算一个综合分数,根据上述多个备选框的综合分数,从上述多个备选框中选取一个备选框。
在其中的另一个可选示例中,通过上述各实施例获得回归结果之后,还可以包括:根据回归结果中位置和大小的变化量,对备选框的概率值进行调整,例如,根据回归结果中位置和大小的变化量,对备选框的概率值进行调整。例如,对位置的变化量较大(即:位置移动较大)、大小的变化量较大(即:形状变化较大)的备选框的概率值进行惩罚,降低其概率值。相应的,该示例中,根据分类结果和回归结果从多个备选框中选取一个备选框时,可以通过如下方式实现:根据调整后的分类结果,从多个备选框中选取一个备选框,例如,根据调整后的概率值,从多个备选框中,选取一个概率值最高的备选框。
在一个可选示例中,上述根据回归结果中位置和大小的变化量,对备选框的概率值进行调整的操作,可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的调整单元执行。
图3为本公开目标检测网络的训练方法一个实施例的流程图。本公开实施例的目标检测网络,包括本公开实施例的神经网络、第一卷积层和第二卷积层。如图3所示,该实施例的训练方法包括:
302,经神经网络分别提取模版帧和检测帧的特征。
其中,模版帧为目标对象的检测框图像,模版帧的图像大小小于检测帧;检测帧为需要进行目标对象检测的当前帧或者当前帧中可能包含目标对象的区域图像。模板帧为视频序列中检测时序位于检测帧之前、且目标对象的检测框确定的帧。
在本公开各实施例的一个实施方式中,该操作302中,可以经同一神经网络分别提取模版帧和检测帧的特征;或者,经具有相同结构的不同神经网络分别提取模版帧和检测帧的特征。
在一个可选示例中,该操作302可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的神经网络执行。
304,通过第一卷积层对模板帧的特征进行卷积操作,以卷积操作得到的第一特征作为局部区域检测器的分类权重;以及通过第二卷积层对模板帧的特征进行卷积操作,以卷积操作得到的第二特征作为局部区域检测器的回归权重。
在一个可选示例中,该操作304可以由处理器调用存储器存储的相应指令执行,也可 以由分别被处理器运行的第一卷积层和第二卷积层执行。
306,将检测帧的特征输入局部区域检测器,得到局部区域检测器输出的多个备选框的分类结果和回归结果。
其中,分类结果包括各备选框分别为目标对象的检测框的概率值,回归结果包括各备选框相对于模板帧对应的检测框的偏移量。
在本公开各实施例的一个实施方式中,该操作306可以包括:利用分类权重对检测帧的特征进行卷积操作,获得多个备选框的分类结果;以及利用回归权重对检测帧的特征进行卷积操作,获得多个备选框的回归结果。
在一个可选示例中,该操作306可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的区域检测器执行。
308,根据局部区域检测器输出的多个备选框的分类结果和回归结果,获取检测帧中目标对象的检测框。
在一个可选示例中,该操作308可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的获取单元执行。
310,以获得的检测帧中目标对象的检测框作为预测检测框,基于检测帧的标注信息和预测检测框训练神经网络、第一卷积层和第二卷积层。
在一个可选示例中,该操作310可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的训练单元执行。
基于本公开上述实施例的目标检测网络的训练方法,经神经网络分别提取模版帧和检测帧的特征,基于模版帧的特征获取局部区域检测器的分类权重和回归权重,将检测帧的特征输入局部区域检测器,得到局部区域检测器输出的多个备选框的分类结果和回归结果,并根据局部区域检测器输出的多个备选框的分类结果和回归结果,获取检测帧中目标对象的检测框,基于检测帧的标注信息和预测检测框训练目标检测网络。基于本公开实施例训练得到的目标检测网络,由同一或具有相同结果的神经网络可以更好的提取同一目标对象的相似特征,从而使得在不同帧中提取的目标对象的特征变化较小,有助于提高在检测帧中目标对象检测结果的准确性;基于模版帧的特征获取局部区域检测器的分类权重和回归权重,局部区域检测器可以获得检测帧多个备选框的分类结果和回归结果,进而获取检测帧中所述目标对象的检测框,能够更好的估计目标对象的位置和大小变化,能够更精确的找出目标对象在检测帧中的位置,从而提升了目标跟踪的速度和准确性,跟踪效果好、速度快。
在本公开训练方法的另一实施例中,还可以包括:经神经网络提取视频序列中时序位于检测帧之后的至少一其他检测帧的特征;
将至少一其他检测帧的特征依次输入局部区域检测器,依次得到局部区域检测器输出的至少一其他检测帧中的多个备选框、以及各备选框的分类结果和回归结果,即:针对依次至少一其他检测帧的特征执行操作306;
依次根据至少一其他检测帧的多个备选框的分类结果和回归结果,获取至少一其他检测帧中目标对象的检测框;即:针对依次至少一其他检测帧的多个备选框的分类结果和回归结果执行操作308。
在本公开训练方法的又一个实施例中,检测帧为需要进行目标对象检测的当前帧中可能包含目标对象的区域图像时,还可以包括:预先以模板帧的中心点为中心点,从当前帧中截取长度和/或宽度对应大于模板帧的图像长度和/或宽度的区域图像作为检测帧。
图4为本公开目标检测网络的训练方法另一个实施例的流程图。本公开实施例的目标检测网络,包括本公开实施例的神经网络、第一卷积层、第二卷积层、第三卷积层和第四卷积层。如图4所示,该实施例的训练方法包括:
402,经神经网络分别提取模版帧和检测帧的特征。
其中,模版帧为目标对象的检测框图像,模版帧的图像大小小于检测帧;检测帧为需要进行目标对象检测的当前帧或者当前帧中可能包含目标对象的区域图像。模板帧为视频序列中检测时序位于检测帧之前、且目标对象的检测框确定的帧。
在本公开各实施例的一个实施方式中,该操作402中,可以经同一神经网络分别提取模版帧和检测帧的特征;或者,经具有相同结构的不同神经网络分别提取模版帧和检测帧的特征。
在一个可选示例中,该操作402可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的神经网络执行。
404,通过第三卷积层对检测帧的特征进行卷积操作,获得第三特征,该第三特征的通道数量与检测帧的特征的通道数量相同;以及通过第四卷积层对检测帧的特征进行卷积操作,获得第四特征,该第四特征的通道数量与检测帧的特征的通道数量相同。
在一个可选示例中,该操作404可以由处理器调用存储器存储的相应指令执行,也可以由分别被处理器运行的第三卷积层和第四卷积层执行。
406,通过第一卷积层对模板帧的特征进行卷积操作,以卷积操作得到的第一特征作为局部区域检测器的分类权重;以及通过第二卷积层对模板帧的特征进行卷积操作,以卷积操作得到的第二特征作为局部区域检测器的回归权重。
其中,操作406与404之间不存在执行顺序限制,二者可以同时执行,也可以以任意先后顺序执行。
在一个可选示例中,该操作406可以由处理器调用存储器存储的相应指令执行,也可以由分别被处理器运行的第一卷积层和第二卷积层执行。
408,利用分类权重对第三特征进行卷积操作,获得多个备选框的分类结果;以及利用回归权重对第四特征进行卷积操作,获得多个备选框的回归结果。
其中,分类结果包括各备选框分别为目标对象的检测框的概率值,回归结果包括各备选框相对于模板帧对应的检测框的偏移量。
在一个可选示例中,该操作408可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的局部区域检测器执行。
410,根据局部区域检测器输出的多个备选框的分类结果和回归结果,获取检测帧中目标对象的检测框。
在一个可选示例中,该操作410可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第一特征提取单元701执行。
412,以获得的检测帧中目标对象的检测框作为预测检测框,根据标注的目标对象在检测帧中的检测框的位置和大小,与预测检测框的位置和大小之间的差异,对神经网络、第一卷积层和第二卷积层的权重值进行调整。
在一个可选示例中,该操作412可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的训练单元执行。
在本公开各实施例的一个实施方式中,操作308或者410可以包括:根据分类结果和回归结果从多个备选框中选取一个备选框,并根据选取的备选框的偏移量对选取的备选框进行回归,获得检测帧中目标对象的检测框。
在其中一个可选示例中,根据分类结果和回归结果从多个备选框中选取一个备选框时,可以通过如下方式实现:根据分类结果和回归结果的权重系数,从多个备选框中选取一个备选框,例如,根据分类结果和回归结果的权重系数,分别由各备选框的概率值与分类结果的权重系数的乘积和偏移量与回归结果的权重系数的乘积之和计算一个综合分数,根据上述多个备选框的综合分数,从上述多个备选框中选取一个概率值高、偏移量小的备选框。
在其中的另一个可选示例中,通过上述各实施例获得回归结果之后,还可以包括:根据回归结果中位置和大小的变化量,对备选框的概率值进行调整,例如,根据回归结果中 位置和大小的变化量,对备选框的概率值进行调整。相应的,该示例中,根据分类结果和回归结果从多个备选框中选取一个备选框时,可以通过如下方式实现:根据调整后的分类结果,从多个备选框中选取一个备选框,例如,根据调整后的概率值,从多个备选框中,选取一个概率值最高的备选框。
在一个可选示例中,上述根据回归结果中位置和大小的变化量,对备选框的概率值进行调整的操作可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的调整单元执行。
在本公开各实施例的一个实施方式中,操作308或者410可以包括:根据分类结果和回归结果从多个备选框中选取一个备选框,并根据选取的备选框的偏移量对选取的备选框进行回归,获得检测帧中目标对象的检测框。
在其中一个可选示例中,根据分类结果和回归结果从多个备选框中选取一个备选框时,可以通过如下方式实现:根据分类结果和回归结果的权重系数,从多个备选框中选取一个备选框,例如,根据分类结果和回归结果的权重系数,分别由各备选框的概率值和偏移量计算一个综合分数,根据上述多个备选框的综合分数,从上述多个备选框中选取一个备选框。
在其中的另一个可选示例中,通过上述各实施例获得回归结果之后,还可以包括:根据回归结果中位置和大小的变化量,对备选框的概率值进行调整,例如,根据回归结果中位置和大小的变化量,对备选框的概率值进行调整。例如,对位置的变化量较大(即:位置移动较大)、大小的变化量较大(即:形状变化较大)的备选框的概率值进行惩罚,降低其概率值。相应的,该示例中,根据分类结果和回归结果从多个备选框中选取一个备选框时,可以通过如下方式实现:根据调整后的分类结果,从多个备选框中选取一个备选框,例如,根据调整后的概率值,从多个备选框中,选取一个概率值最高的备选框。
本公开各实施例中,局部区域检测器可以包括第三卷积层、第四卷积层和两个卷积操作单元。其中,局部区域检测器与第一卷积层、第二卷积层结合后,形成的局部区域检测器也可以称为区域提议网络。
本公开实施例提供的任一种目标检测方法、目标检测网络的训练方法可以由任意适当的具有数据处理能力的设备执行,包括但不限于:终端设备和服务器等。或者,本公开实施例提供的任一种目标检测方法、目标检测网络的训练方法可以由处理器执行,如处理器通过调用存储器存储的相应指令来执行本公开实施例提及的任一种目标检测方法、目标检测网络的训练方法。下文不再赘述。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
图5为本公开目标检测装置一个实施例的结构示意图。本公开各实施例的目标检测装置可用于实现本公开上述各目标检测方法实施例。如图5所示,该实施例的目标检测装置包括:神经网络,第一卷积层,第二卷积层,局部区域检测器和获取单元。其中:
神经网络,用于分别提取模版帧和检测帧的特征,其中,模版帧为目标对象的检测框图像,模版帧的图像大小小于检测帧。其中,模版帧为目标对象的检测框图像,模版帧的图像大小小于检测帧;检测帧为需要进行目标对象检测的当前帧或者当前帧中可能包含目标对象的区域图像。模板帧为视频序列中检测时序位于检测帧之前、且目标对象的检测框确定的帧。提取模版帧和检测帧的特征的身价网络可以是同一神经网络,或者,也可以是具有相同结构的不同神经网络。
第一卷积层,用于对所述模板帧的特征进行卷积操作,以卷积操作得到的第一特征作为所述局部区域检测器的分类权重。
第二卷积层,用于通过第二卷积层对所述模板帧的特征进行卷积操作,以卷积操作得到的第二特征作为所述局部区域检测器的回归权重。
局部区域检测器,用于根据检测帧的特征,输出多个备选框的分类结果和回归结果;其中,分类结果包括各备选框分别为目标对象的检测框的概率值,回归结果包括各备选框相对于模板帧对应的检测框的偏移量。
获取单元,用于根据局部区域检测器输出的多个备选框的分类结果和回归结果,获取检测帧中目标对象的检测框。
基于本公开上述实施例的目标检测装置,经神经网络分别提取模版帧和检测帧的特征,基于模版帧的特征获取局部区域检测器的分类权重和回归权重,将检测帧的特征输入局部区域检测器,得到局部区域检测器输出的多个备选框的分类结果和回归结果,并根据局部区域检测器输出的多个备选框的分类结果和回归结果,获取检测帧中目标对象的检测框。本公开实施例,由同一或具有相同结果的神经网络可以更好的提取同一目标对象的相似特征,从而使得在不同帧中提取的目标对象的特征变化较小,有助于提高在检测帧中目标对象检测结果的准确性;基于模版帧的特征获取局部区域检测器的分类权重和回归权重,局部区域检测器可以获得检测帧多个备选框的分类结果和回归结果,进而获取检测帧中所述目标对象的检测框,能够更好的估计目标对象的位置和大小变化,能够更精确的找出目标对象在检测帧中的位置,从而提升了目标跟踪的速度和准确性,跟踪效果好、速度快。
在本公开目标检测装置各实施例的一种实施方式中,局部区域检测器用于:利用分类权重对检测帧的特征进行卷积操作,获得多个备选框的分类结果;以及利用回归权重对检测帧的特征进行卷积操作,获得多个备选框的回归结果。
检测帧为需要进行目标对象检测的当前帧中可能包含目标对象的区域图像时,在本公开目标检测装置另一个实施例中,还可以包括:预处理单元,用于以模板帧的中心点为中心点,从当前帧中截取长度和/或宽度对应大于模板帧的图像长度和/或宽度的区域图像作为检测帧。如图6所示,为本公开目标检测装置另一个实施例的结构示意图。
另外,再参见图6,在本公开目标检测装置的又一个实施例中,还可以包括:第三卷积层,用于对检测帧的特征进行卷积操作,获得第三特征,第三特征的通道数量与检测帧的特征的通道数量相同。相应地,该实施例中,局部区域检测器用于利用分类权重对第三特征进行卷积操作。
另外,再参见图6,在本公开目标检测装置的再一个实施例中,还可以包括:第四卷积层,用于对检测帧的特征进行卷积操作,获得第四特征,第四特征的通道数量与检测帧的特征的通道数量相同。相应地,该实施例中,局部区域检测器用于利用回归权重对第四特征进行卷积操作。
在本公开目标检测装置各实施例的另一种实施方式中,获取单元用于:根据分类结果和回归结果从多个备选框中选取一个备选框,并根据选取的备选框的偏移量对选取的备选框进行回归,获得检测帧中目标对象的检测框。
示例性地,获取单元根据分类结果和回归结果从多个备选框中选取一个备选框时,用于:根据分类结果和回归结果的权重系数,从多个备选框中选取一个备选框。
另外,再参见图6,在本公开目标检测装置的还一个实施例中,还可以包括:调整单元,用于根据回归结果对分类结果进行调整。相应地,获取单元根据分类结果和回归结果从多个备选框中选取一个备选框时,用于根据调整后的分类结果,从多个备选框中选取一个备选框。
图7为本公开目标检测装置再一个实施例的结构示意图。该实施例的目标检测装置可用于实现本公开图3~图4任一目标检测网络的训练方法实施例。如图7所示,与图5或图6所示实施例相比,该实施例的目标检测装置还包括:训练单元,用于以获得的检测帧中目标对象的检测框作为预测检测框,基于检测帧的标注信息和预测检测框训练神经网络、 第一卷积层和第二卷积层。
在其中一种实施方式中,检测帧的标注信息包括:标注的目标对象在检测帧中的检测框的位置和大小。相应地,该实施方式中,训练单元用于根据标注的检测框的位置和大小与预测检测框的位置和大小之间的差异,对神经网络、第一卷积层和第二卷积层的权重值进行调整。
基于本公开上述实施例,经神经网络分别提取模版帧和检测帧的特征,基于模版帧的特征获取局部区域检测器的分类权重和回归权重,将检测帧的特征输入局部区域检测器,得到局部区域检测器输出的多个备选框的分类结果和回归结果,并根据局部区域检测器输出的多个备选框的分类结果和回归结果,获取检测帧中目标对象的检测框,基于检测帧的标注信息和预测检测框训练目标检测网络。基于本公开实施例训练得到的目标检测网络,由同一或具有相同结果的神经网络可以更好的提取同一目标对象的相似特征,从而使得在不同帧中提取的目标对象的特征变化较小,有助于提高在检测帧中目标对象检测结果的准确性;基于模版帧的特征获取局部区域检测器的分类权重和回归权重,局部区域检测器可以获得检测帧多个备选框的分类结果和回归结果,进而获取检测帧中所述目标对象的检测框,能够更好的估计目标对象的位置和大小变化,能够更精确的找出目标对象在检测帧中的位置,从而提升了目标跟踪的速度和准确性,跟踪效果好、速度快。
如图8所示,为本公开目标检测装置一个应用实施例的结构示意图。如图9所示,为本公开目标检测装置另一个应用实施例的结构示意图。在图8和图9中,LxMxN(例如256x20x20)中,L表示通道数量,M和N分别表示高度(也即长度)和宽度。
本公开实施例还提供了一种电子设备,包括本公开上述任一实施例的目标检测装置。
本公开实施例还提供了另一种电子设备,包括:存储器,用于存储可执行指令;以及处理器,用于与存储器通信以执行可执行指令从而完成本公开上述任一实施例的目标检测方法或者目标检测网络的训练方法的操作。
图10为本公开电子设备一个应用实施例的结构示意图。下面参考图10,其示出了适于用来实现本申请实施例的终端设备或服务器的电子设备的结构示意图。如图10所示,该电子设备包括一个或多个处理器、通信部等,所述一个或多个处理器例如:一个或多个中央处理单元(CPU),和/或一个或多个图像处理器(GPU)等,处理器可以根据存储在只读存储器(ROM)中的可执行指令或者从存储部分加载到随机访问存储器(RAM)中的可执行指令而执行各种适当的动作和处理。通信部可包括但不限于网卡,所述网卡可包括但不限于IB(Infiniband)网卡,处理器可与只读存储器和/或随机访问存储器中通信以执行可执行指令,通过总线与通信部相连、并经通信部与其他目标设备通信,从而完成本申请实施例提供的任一方法对应的操作,例如,经神经网络分别提取模版帧和检测帧的特征,其中,所述模版帧为目标对象的检测框图像,所述模版帧的图像大小小于所述检测帧;基于所述模版帧的特征获取局部区域检测器的分类权重和回归权重;将所述检测帧的特征输入所述局部区域检测器,得到所述局部区域检测器输出的多个备选框的分类结果和回归结果;根据所述局部区域检测器输出的多个备选框的分类结果和回归结果,获取所述检测帧中所述目标对象的检测框。再如,经神经网络分别提取模版帧和检测帧的特征,其中,所述模版帧为目标对象的检测框图像,所述模版帧的图像大小小于所述检测帧;通过一卷积层,增加所述模板帧的特征的通道,以得到的第一特征作为所述局部区域检测器的分类权重;以及通过第二卷积层增加所述模板帧的特征的通道,以得到的第二特征作为所述局部区域检测器的回归权重;将所述检测帧的特征输入所述局部区域检测器,得到所述局部区域检测器输出的多个备选框的分类结果和回归结果;根据所述局部区域检测器输出的多个备选框的分类结果和回归结果,获取所述检测帧中所述目标对象的检测框;以获得的所述检测帧中所述目标对象的检测框作为预测检测框,基于所述检测帧的标注信息和所述预测检测框训练所述神经网络、所述第一卷积层和所述第二卷积层。
此外,在RAM中,还可存储有装置操作所需的各种程序和数据。CPU、ROM以及RAM通过总线彼此相连。在有RAM的情况下,ROM为可选模块。RAM存储可执行指令,或在运行时向ROM中写入可执行指令,可执行指令使处理器执行本公开上述任一方法对应的操作。输入/输出(I/O)接口也连接至总线。通信部可以集成设置,也可以设置为具有多个子模块(例如多个IB网卡),并在总线链接上。
以下部件连接至I/O接口:包括键盘、鼠标等的输入部分;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分;包括硬盘等的存储部分;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分。通信部分经由诸如因特网的网络执行通信处理。驱动器也根据需要连接至I/O接口。可拆卸介质,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器上,以便于从其上读出的计算机程序根据需要被安装入存储部分。
需要说明的,如图10所示的架构仅为一种可选实现方式,在具体实践过程中,可根据实际需要对上述图10的部件数量和类型进行选择、删减、增加或替换;在不同功能部件设置上,也可采用分离设置或集成设置等实现方式,例如GPU和CPU可分离设置或者可将GPU集成在CPU上,通信部可分离设置,也可集成设置在CPU或GPU上,等等。这些可替换的实施方式均落入本公开公开的保护范围。
另外,本公开实施例还提供了一种计算机存储介质,用于存储计算机可读取的指令,该指令被执行时实现本公开上述任一实施例的目标检测方法或者目标检测网络的训练方法的操作。
另外,本公开实施例还提供了一种计算机程序,包括计算机可读取的指令,当该计算机可读取的指令在设备中运行时,该设备中的处理器执行用于实现本公开上述任一实施例的目标检测方法或者目标检测网络的训练方法中的步骤的可执行指令。
本公开实施例可以进行单目标跟踪,例如多目标跟踪系统当中,可以不每帧都进行目标检测,而是固定检测间隔,例如每10帧检测一次,而中间的9帧可以通过单目标跟踪来确定中间帧的目标的位置。由于本公开实施例的算法速度较快,所以整体上可以让多目标跟踪系统能够更快完成跟踪,并且达到更好的效果。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
本说明书中各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似的部分相互参见即可。对于系统实施例而言,由于其与方法实施例基本对应,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
可能以许多方式来实现本公开的方法和装置。例如,可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本公开的方法和装置。用于所述方法的步骤的上述顺序仅是为了进行说明,本公开的方法的步骤不限于以上描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本公开实施为记录在记录介质中的程序,这些程序包括用于实现根据本公开的方法的机器可读指令。因而,本公开还覆盖存储用于执行根据本公开的方法的程序的记录介质。
本公开的描述是为了示例和描述起见而给出的,而并不是无遗漏的或者将本公开限于所公开的形式。很多修改和变化对于本领域的普通技术人员而言是显然的。选择和描述实施例是为了更好说明本公开的原理和实际应用,并且使本领域的普通技术人员能够理解本公开从而设计适于特定用途的带有各种修改的各种实施例。

Claims (43)

  1. 一种目标检测方法,其特征在于,包括:
    经神经网络分别提取模版帧和检测帧的特征,其中,所述模版帧为目标对象的检测框图像,所述模版帧的图像大小小于所述检测帧;
    基于所述模版帧的特征获取局部区域检测器的分类权重和回归权重;
    将所述检测帧的特征输入所述局部区域检测器,得到所述局部区域检测器输出的多个备选框的分类结果和回归结果;
    根据所述局部区域检测器输出的多个备选框的分类结果和回归结果,获取所述检测帧中所述目标对象的检测框。
  2. 根据权利要求1所述的方法,其特征在于,还包括:
    经所述神经网络提取视频序列中时序位于所述检测帧之后的至少一其他检测帧的特征;
    将所述至少一其他检测帧的特征依次输入所述局部区域检测器,依次得到所述局部区域检测器输出的所述至少一其他检测帧中的多个备选框、以及各备选框的分类结果和回归结果;
    依次根据所述至少一其他检测帧的多个备选框的分类结果和回归结果,获取所述至少一其他检测帧中所述目标对象的检测框。
  3. 根据权利要求1或2所述的方法,其特征在于,所述经神经网络分别提取模版帧和检测帧的特征,包括:
    经同一神经网络分别提取所述模版帧和所述检测帧的特征;或者,
    经具有相同结构的不同神经网络分别提取所述模版帧和所述检测帧的特征。
  4. 根据权利要求1-3任一所述的方法,其特征在于,所述模板帧为视频序列中检测时序位于所述检测帧之前、且目标对象的检测框确定的帧。
  5. 根据权利要求1-4任一所述的方法,其特征在于,所述检测帧为需要进行所述目标对象检测的当前帧或者当前帧中可能包含所述目标对象的区域图像。
  6. 根据权利要求5所述的方法,其特征在于,所述检测帧为需要进行所述目标对象检测的当前帧中可能包含所述目标对象的区域图像时,所述方法还包括:
    以所述模板帧的中心点为中心点,从当前帧中截取长度和/或宽度对应大于所述模板帧的图像长度和/或宽度的区域图像作为所述检测帧。
  7. 根据权利要求1-6任一所述的方法,其特征在于,基于所述模版帧的特征获取局部区域检测器的分类权重,包括:
    通过第一卷积层对所述模板帧的特征进行卷积操作,以卷积操作得到的第一特征作为所述局部区域检测器的分类权重。
  8. 根据权利要求1-7任一所述的方法,其特征在于,基于所述模版帧的特征获取局部区域检测器的回归权重,包括:
    通过第二卷积层对所述模板帧的特征进行卷积操作,以卷积操作得到的第二特征作为所述局部区域检测器的回归权重。
  9. 根据权利要求1-8任一所述的方法,其特征在于,将所述检测帧的特征输入所述局部区域检测器,得到所述局部区域检测器输出的多个备选框的分类结果和回归结果,包括:
    利用所述分类权重对所述检测帧的特征进行卷积操作,获得多个备选框的分类结果;以及利用所述回归权重对所述检测帧的特征进行卷积操作,获得多个备选框的回归结果。
  10. 根据权利要求9所述的方法,其特征在于,提取所述检测帧的特征之后,还包括:通过第三卷积层对所述检测帧的特征进行卷积操作,获得第三特征,所述第三特征的通道数量与所述检测帧的特征的通道数量相同;
    所述利用所述分类权重对所述检测帧的特征进行卷积操作,获得多个备选框的分类结 果,包括:利用所述分类权重对所述第三特征进行卷积操作,获得多个备选框的分类结果。
  11. 根据权利要求9或10所述的方法,其特征在于,提取所述模板帧的特征之后,还包括:通过第四卷积层对所述检测帧的特征进行卷积操作,获得第四特征,所述第四特征的通道数量与所述检测帧的特征的通道数量相同;
    利用所述回归权重对所述检测帧的特征进行卷积操作,获得多个备选框的回归结果,包括:利用所述回归权重对所述第四特征进行卷积操作,获得多个备选框的回归结果。
  12. 根据权利要求1-11任一所述的方法,其特征在于,根据所述局部区域检测器输出的多个备选框的分类结果和回归结果,获取所述检测帧中所述目标对象的检测框,包括:
    根据所述分类结果和所述回归结果从所述多个备选框中选取一个备选框,并根据选取的备选框的偏移量对所述选取的备选框进行回归,获得所述检测帧中所述目标对象的检测框。
  13. 根据权利要求12所述的方法,其特征在于,根据所述分类结果和所述回归结果从所述多个备选框中选取一个备选框,包括:
    根据所述分类结果和所述回归结果的权重系数,从所述多个备选框中选取一个备选框。
  14. 根据权利要求12所述的方法,其特征在于,所述获得回归结果之后,还包括:根据所述回归结果对所述分类结果进行调整;
    根据所述分类结果和所述回归结果从所述多个备选框中选取一个备选框,包括:根据调整后的分类结果,从所述多个备选框中选取一个备选框。
  15. 一种目标检测网络的训练方法,其特征在于,包括:
    经神经网络分别提取模版帧和检测帧的特征,其中,所述模版帧为目标对象的检测框图像,所述模版帧的图像大小小于所述检测帧;
    通过一卷积层对所述模板帧的特征进行卷积操作,以卷积操作得到的第一特征作为所述局部区域检测器的分类权重;以及通过第二卷积层对所述模板帧的特征进行卷积操作,以卷积操作得到的第二特征作为所述局部区域检测器的回归权重;
    将所述检测帧的特征输入所述局部区域检测器,得到所述局部区域检测器输出的多个备选框的分类结果和回归结果;
    根据所述局部区域检测器输出的多个备选框的分类结果和回归结果,获取所述检测帧中所述目标对象的检测框;
    以获得的所述检测帧中所述目标对象的检测框作为预测检测框,基于所述检测帧的标注信息和所述预测检测框训练所述神经网络、所述第一卷积层和所述第二卷积层。
  16. 根据权利要求15所述的方法,其特征在于,还包括:
    经所述神经网络提取视频序列中时序位于所述检测帧之后的至少一其他检测帧的特征;
    将所述至少一其他检测帧的特征依次输入所述局部区域检测器,依次得到所述局部区域检测器输出的所述至少一其他检测帧中的多个备选框、以及各备选框的分类结果和回归结果;
    依次根据所述至少一其他检测帧的多个备选框的分类结果和回归结果,获取所述至少一其他检测帧中所述目标对象的检测框。
  17. 根据权利要求15或16所述的方法,其特征在于,经神经网络分别提取模版帧和检测帧的特征,包括:
    经同一神经网络分别提取所述模版帧和所述检测帧的特征;或者,
    经具有相同结构的不同神经网络分别提取所述模版帧和所述检测帧的特征。
  18. 根据权利要求15-17任一所述的方法,其特征在于,所述模板帧为视频序列中检测时序位于所述检测帧之前、且目标对象的检测框确定的帧。
  19. 根据权利要求15-18任一所述的方法,其特征在于,所述检测帧为需要进行所述 目标对象检测的当前帧或者当前帧中可能包含所述目标对象的区域图像。
  20. 根据权利要求19所述的方法,其特征在于,所述检测帧为需要进行所述目标对象检测的当前帧中可能包含所述目标对象的区域图像时,所述方法还包括:
    以所述模板帧的中心点为中心点,从当前帧中截取长度和/或宽度对应大于所述模板帧的图像长度和/或宽度的区域图像作为所述检测帧。
  21. 根据权利要求15-20任一所述的方法,其特征在于,将所述检测帧的特征输入所述局部区域检测器,得到所述局部区域检测器输出的多个备选框的分类结果和回归结果,包括:
    利用所述分类权重对所述检测帧的特征进行卷积操作,获得多个备选框的分类结果;以及利用所述回归权重对所述检测帧的特征进行卷积操作,获得多个备选框的回归结果。
  22. 根据权利要求21所述的方法,其特征在于,提取所述检测帧的特征之后,还包括:
    通过第三卷积层对所述检测帧的特征进行卷积操作,获得第三特征,所述第三特征的通道数量与所述检测帧的特征的通道数量相同;
    所述利用所述分类权重对所述检测帧的特征进行卷积操作,获得多个备选框的分类结果,包括:利用所述分类权重对所述第三特征进行卷积操作,获得多个备选框的分类结果。
  23. 根据权利要求21所述的方法,其特征在于,提取所述模板帧的特征之后,还包括:
    通过第四卷积层对所述检测帧的特征进行卷积操作,获得第四特征,所述第四特征的通道数量与所述检测帧的特征的通道数量相同;
    利用所述回归权重对所述检测帧的特征进行卷积操作,获得多个备选框的回归结果,包括:利用所述回归权重对所述第四特征进行卷积操作,获得多个备选框的回归结果。
  24. 根据权利要求15-23任一所述的方法,其特征在于,根据所述局部区域检测器输出的多个备选框的分类结果和回归结果,获取所述检测帧中所述目标对象的检测框,包括:
    根据所述分类结果和所述回归结果从所述多个备选框中选取一个备选框,并根据选取的备选框的偏移量对所述选取的备选框进行回归,获得所述检测帧中所述目标对象的检测框。
  25. 根据权利要求24所述的方法,其特征在于,根据所述分类结果和所述回归结果从所述多个备选框中选取一个备选框,包括:
    根据所述分类结果和所述回归结果的权重系数,从所述多个备选框中选取一个备选框。
  26. 根据权利要求25所述的方法,其特征在于,所述获得回归结果之后,还包括:根据所述回归结果对所述分类结果进行调整;
    根据所述分类结果和所述回归结果从所述多个备选框中选取一个备选框,包括:根据调整后的分类结果,从所述多个备选框中选取一个备选框。
  27. 根据权利要求15-26任一所述的方法,其特征在于,所述检测帧的标注信息包括:标注的所述目标对象在所述检测帧中的检测框的位置和大小;
    以获得的所述检测帧中所述目标对象的检测框作为预测检测框,基于所述检测帧的标注信息和所述预测检测框,训练所述神经网络、所述第一卷积层和所述第二卷积层,包括:
    根据所述标注的检测框的位置和大小与所述预测检测框的位置和大小之间的差异,对所述神经网络、所述第一卷积层和所述第二卷积层的权重值进行调整。
  28. 一种目标检测装置,其特征在于,包括:
    神经网络,用于分别提取模版帧和检测帧的特征,其中,所述模版帧为目标对象的检测框图像,所述模版帧的图像大小小于所述检测帧;
    第一卷积层,用于增加所述模板帧的特征的通道,以得到的第一特征作为局部区域检测器的分类权重;
    第二卷积层,用于增加所述模板帧的特征的通道,以得到的第二特征作为所述局部区域检测器的回归权重;
    局部区域检测器,用于根据所述检测帧的特征,输出多个备选框的分类结果和回归结果;
    获取单元,用于根据所述局部区域检测器输出的多个备选框的分类结果和回归结果,获取所述检测帧中所述目标对象的检测框。
  29. 根据权利要求28所述的装置,其特征在于,所述神经网络包括:具有相同结构的、分别用于提取所述模版帧和所述检测帧的特征的不同神经网络。
  30. 根据权利要求28或29所述的装置,其特征在于,所述模板帧为视频序列中检测时序位于所述检测帧之前、且目标对象的检测框确定的帧。
  31. 根据权利要求28-30任一所述的装置,其特征在于,所述检测帧为需要进行所述目标对象检测的当前帧或者当前帧中可能包含所述目标对象的区域图像。
  32. 根据权利要求31所述的装置,其特征在于,还包括:
    预处理单元,用于以所述模板帧的中心点为中心点,从当前帧中截取长度和/或宽度对应大于所述模板帧的图像长度和/或宽度的区域图像作为所述检测帧。
  33. 根据权利要求28-32任一所述的装置,其特征在于,所述局部区域检测器用于:利用所述分类权重对所述检测帧的特征进行卷积操作,获得多个备选框的分类结果;以及利用所述回归权重对所述检测帧的特征进行卷积操作,获得多个备选框的回归结果。
  34. 根据权利要求33所述的装置,其特征在于,还包括:
    第三卷积层,用于对所述检测帧的特征进行卷积操作,获得第三特征,所述第三特征的通道数量与所述检测帧的特征的通道数量相同;
    所述局部区域检测器,用于利用所述分类权重对所述第三特征进行卷积操作。
  35. 根据权利要求33所述的装置,其特征在于,还包括:
    第四卷积层,用于对所述检测帧的特征进行卷积操作,获得第四特征,所述第四特征的通道数量与所述检测帧的特征的通道数量相同;
    所述局部区域检测器,用于利用所述回归权重对所述第四特征进行卷积操作。
  36. 根据权利要求28-35任一所述的装置,其特征在于,所述获取单元用于:根据所述分类结果和所述回归结果从所述多个备选框中选取一个备选框,并根据选取的备选框的偏移量对所述选取的备选框进行回归,获得所述检测帧中所述目标对象的检测框。
  37. 根据权利要求36所述的装置,其特征在于,所述获取单元根据所述分类结果和所述回归结果从所述多个备选框中选取一个备选框时,用于:根据所述分类结果和所述回归结果的权重系数,从所述多个备选框中选取一个备选框。
  38. 根据权利要求36所述的装置,其特征在于,还包括:
    调整单元,用于根据所述回归结果对所述分类结果进行调整;
    所述获取单元根据所述分类结果和所述回归结果从所述多个备选框中选取一个备选框时,用于:根据调整后的分类结果,从所述多个备选框中选取一个备选框。
  39. 根据权利要求28-38任一所述的装置,其特征在于,还包括:
    训练单元,用于以获得的所述检测帧中所述目标对象的检测框作为预测检测框,基于所述检测帧的标注信息和所述预测检测框训练所述神经网络、所述第一卷积层和所述第二卷积层。
  40. 根据权利要求39所述的装置,其特征在于,所述检测帧的标注信息包括:标注的所述目标对象在所述检测帧中的检测框的位置和大小;
    所述训练单元,用于根据所述标注的检测框的位置和大小与所述预测检测框的位置和大小之间的差异,对所述神经网络、所述第一卷积层和所述第二卷积层的权重值进行调整。
  41. 一种电子设备,其特征在于,包括权利要求28-40任一所述的目标检测装置。
  42. 一种电子设备,其特征在于,包括:
    存储器,用于存储可执行指令;以及
    处理器,用于与所述存储器通信以执行所述可执行指令从而完成权利要求1-27任一所述方法的操作。
  43. 一种计算机存储介质,用于存储计算机可读取的指令,其特征在于,所述指令被执行时实现权利要求1-27任一所述方法的操作。
PCT/CN2018/114884 2017-11-12 2018-11-09 目标检测方法和装置、训练方法、电子设备和介质 WO2019091464A1 (zh)

Priority Applications (5)

Application Number Priority Date Filing Date Title
JP2020526040A JP7165731B2 (ja) 2017-11-12 2018-11-09 目標検出方法及び装置、トレーニング方法、電子機器並びに媒体
SG11202004324WA SG11202004324WA (en) 2017-11-12 2018-11-09 Target detection method and apparatus, training method, electronic device and medium
KR1020207016026A KR20200087784A (ko) 2017-11-12 2018-11-09 목표 검출 방법 및 장치, 트레이닝 방법, 전자 기기 및 매체
US16/868,427 US11455782B2 (en) 2017-11-12 2020-05-06 Target detection method and apparatus, training method, electronic device and medium
PH12020550588A PH12020550588A1 (en) 2017-11-12 2020-05-07 Target detection method and apparatus, training method, electronic device and medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711110587.1 2017-11-12
CN201711110587.1A CN108230359B (zh) 2017-11-12 2017-11-12 目标检测方法和装置、训练方法、电子设备、程序和介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/868,427 Continuation US11455782B2 (en) 2017-11-12 2020-05-06 Target detection method and apparatus, training method, electronic device and medium

Publications (1)

Publication Number Publication Date
WO2019091464A1 true WO2019091464A1 (zh) 2019-05-16

Family

ID=62655730

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/114884 WO2019091464A1 (zh) 2017-11-12 2018-11-09 目标检测方法和装置、训练方法、电子设备和介质

Country Status (7)

Country Link
US (1) US11455782B2 (zh)
JP (1) JP7165731B2 (zh)
KR (1) KR20200087784A (zh)
CN (1) CN108230359B (zh)
PH (1) PH12020550588A1 (zh)
SG (1) SG11202004324WA (zh)
WO (1) WO2019091464A1 (zh)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399900A (zh) * 2019-06-26 2019-11-01 腾讯科技(深圳)有限公司 对象检测方法、装置、设备及介质
CN110533184A (zh) * 2019-08-31 2019-12-03 南京人工智能高等研究院有限公司 一种网络模型的训练方法及装置
CN111898701A (zh) * 2020-08-13 2020-11-06 网易(杭州)网络有限公司 模型训练、帧图像生成、插帧方法、装置、设备及介质
CN112465868A (zh) * 2020-11-30 2021-03-09 浙江大华汽车技术有限公司 一种目标检测跟踪方法、装置、存储介质及电子装置
CN112528932A (zh) * 2020-12-22 2021-03-19 北京百度网讯科技有限公司 用于优化位置信息的方法、装置、路侧设备和云控平台
CN113160247A (zh) * 2021-04-22 2021-07-23 福州大学 基于频率分离的抗噪孪生网络目标跟踪方法
CN113327253A (zh) * 2021-05-24 2021-08-31 北京市遥感信息研究所 一种基于星载红外遥感影像的弱小目标检测方法
JP2022511221A (ja) * 2019-09-24 2022-01-31 北京市商▲湯▼科技▲開▼▲発▼有限公司 画像処理方法、画像処理装置、プロセッサ、電子機器、記憶媒体及びコンピュータプログラム
US11429809B2 (en) 2019-09-24 2022-08-30 Beijing Sensetime Technology Development Co., Ltd Image processing method, image processing device, and storage medium
JP2022551396A (ja) * 2019-11-20 2022-12-09 テンセント・テクノロジー・(シェンジェン)・カンパニー・リミテッド 動作認識方法、装置、コンピュータプログラム及びコンピュータデバイス

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108230359B (zh) 2017-11-12 2021-01-26 北京市商汤科技开发有限公司 目标检测方法和装置、训练方法、电子设备、程序和介质
US11430312B2 (en) * 2018-07-05 2022-08-30 Movidius Limited Video surveillance with neural networks
CN109584276B (zh) * 2018-12-04 2020-09-25 北京字节跳动网络技术有限公司 关键点检测方法、装置、设备及可读介质
CN109726683B (zh) * 2018-12-29 2021-06-22 北京市商汤科技开发有限公司 目标对象检测方法和装置、电子设备和存储介质
CN111435432B (zh) * 2019-01-15 2023-05-26 北京市商汤科技开发有限公司 网络优化方法及装置、图像处理方法及装置、存储介质
CN110136107B (zh) * 2019-05-07 2023-09-05 上海交通大学 基于dssd和时域约束x光冠脉造影序列自动分析方法
CN110598785B (zh) * 2019-09-11 2021-09-07 腾讯科技(深圳)有限公司 一种训练样本图像的生成方法及装置
US11080833B2 (en) * 2019-11-22 2021-08-03 Adobe Inc. Image manipulation using deep learning techniques in a patch matching operation
CN110942065B (zh) * 2019-11-26 2023-12-12 Oppo广东移动通信有限公司 文本框选方法、装置、终端设备及计算机可读存储介质
KR102311798B1 (ko) * 2019-12-12 2021-10-08 포항공과대학교 산학협력단 다중 객체 추적 방법 및 장치
JP7490359B2 (ja) * 2019-12-24 2024-05-27 キヤノン株式会社 情報処理装置、情報処理方法及びプログラム
CN111383244B (zh) * 2020-02-28 2023-09-01 浙江大华技术股份有限公司 一种目标检测跟踪方法
CN112215899B (zh) * 2020-09-18 2024-01-30 深圳市瑞立视多媒体科技有限公司 帧数据在线处理方法、装置和计算机设备
CN112381136B (zh) * 2020-11-12 2022-08-19 深兰智能科技(上海)有限公司 目标检测方法和装置
CN112464797B (zh) * 2020-11-25 2024-04-02 创新奇智(成都)科技有限公司 一种吸烟行为检测方法、装置、存储介质及电子设备
CN112465691A (zh) * 2020-11-25 2021-03-09 北京旷视科技有限公司 图像处理方法、装置、电子设备和计算机可读介质
CN112580474B (zh) * 2020-12-09 2021-09-24 云从科技集团股份有限公司 基于计算机视觉的目标对象检测方法、系统、设备及介质
CN112906478B (zh) * 2021-01-22 2024-01-09 北京百度网讯科技有限公司 目标对象的识别方法、装置、设备和存储介质
CN113128564B (zh) * 2021-03-23 2022-03-22 武汉泰沃滋信息技术有限公司 一种基于深度学习的复杂背景下典型目标检测方法及系统
CN113221962B (zh) * 2021-04-21 2022-06-21 哈尔滨工程大学 一种解耦分类与回归任务的三维点云单阶段目标检测方法
CN113076923A (zh) * 2021-04-21 2021-07-06 山东大学 基于轻量型网络MobileNet-SSD的口罩佩戴检测方法、设备及存储介质
CN113065618A (zh) * 2021-06-03 2021-07-02 常州微亿智造科技有限公司 工业质检中的检测方法、检测装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105976400A (zh) * 2016-05-10 2016-09-28 北京旷视科技有限公司 基于神经网络模型的目标跟踪方法及装置
CN106326837A (zh) * 2016-08-09 2017-01-11 北京旷视科技有限公司 对象追踪方法和装置
CN106355188A (zh) * 2015-07-13 2017-01-25 阿里巴巴集团控股有限公司 图像检测方法及装置
EP3229206A1 (en) * 2016-04-04 2017-10-11 Xerox Corporation Deep data association for online multi-class multi-object tracking
CN108230359A (zh) * 2017-11-12 2018-06-29 北京市商汤科技开发有限公司 目标检测方法和装置、训练方法、电子设备、程序和介质

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6192360B1 (en) * 1998-06-23 2001-02-20 Microsoft Corporation Methods and apparatus for classifying text and for building a text classifier
US20070230792A1 (en) * 2004-04-08 2007-10-04 Mobileye Technologies Ltd. Pedestrian Detection
CN104424634B (zh) * 2013-08-23 2017-05-03 株式会社理光 对象跟踪方法和装置
CN105900116A (zh) * 2014-02-10 2016-08-24 三菱电机株式会社 分层型神经网络装置、判别器学习方法以及判别方法
CN105740910A (zh) * 2016-02-02 2016-07-06 北京格灵深瞳信息技术有限公司 一种车辆物件检测方法及装置
JP6832504B2 (ja) 2016-08-08 2021-02-24 パナソニックIpマネジメント株式会社 物体追跡方法、物体追跡装置およびプログラム
CN106650630B (zh) * 2016-11-11 2019-08-23 纳恩博(北京)科技有限公司 一种目标跟踪方法及电子设备
CN106709936A (zh) * 2016-12-14 2017-05-24 北京工业大学 一种基于卷积神经网络的单目标跟踪方法
CN107066990B (zh) * 2017-05-04 2019-10-11 厦门美图之家科技有限公司 一种目标跟踪方法及移动设备
CN109726683B (zh) * 2018-12-29 2021-06-22 北京市商汤科技开发有限公司 目标对象检测方法和装置、电子设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106355188A (zh) * 2015-07-13 2017-01-25 阿里巴巴集团控股有限公司 图像检测方法及装置
EP3229206A1 (en) * 2016-04-04 2017-10-11 Xerox Corporation Deep data association for online multi-class multi-object tracking
CN105976400A (zh) * 2016-05-10 2016-09-28 北京旷视科技有限公司 基于神经网络模型的目标跟踪方法及装置
CN106326837A (zh) * 2016-08-09 2017-01-11 北京旷视科技有限公司 对象追踪方法和装置
CN108230359A (zh) * 2017-11-12 2018-06-29 北京市商汤科技开发有限公司 目标检测方法和装置、训练方法、电子设备、程序和介质

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399900A (zh) * 2019-06-26 2019-11-01 腾讯科技(深圳)有限公司 对象检测方法、装置、设备及介质
CN110533184A (zh) * 2019-08-31 2019-12-03 南京人工智能高等研究院有限公司 一种网络模型的训练方法及装置
JP2022511221A (ja) * 2019-09-24 2022-01-31 北京市商▲湯▼科技▲開▼▲発▼有限公司 画像処理方法、画像処理装置、プロセッサ、電子機器、記憶媒体及びコンピュータプログラム
JP7108123B2 (ja) 2019-09-24 2022-07-27 北京市商▲湯▼科技▲開▼▲発▼有限公司 画像処理方法、画像処理装置、プロセッサ、電子機器、記憶媒体及びコンピュータプログラム
US11429809B2 (en) 2019-09-24 2022-08-30 Beijing Sensetime Technology Development Co., Ltd Image processing method, image processing device, and storage medium
JP7274048B2 (ja) 2019-11-20 2023-05-15 テンセント・テクノロジー・(シェンジェン)・カンパニー・リミテッド 動作認識方法、装置、コンピュータプログラム及びコンピュータデバイス
US11928893B2 (en) 2019-11-20 2024-03-12 Tencent Technology (Shenzhen) Company Limited Action recognition method and apparatus, computer storage medium, and computer device
JP2022551396A (ja) * 2019-11-20 2022-12-09 テンセント・テクノロジー・(シェンジェン)・カンパニー・リミテッド 動作認識方法、装置、コンピュータプログラム及びコンピュータデバイス
CN111898701A (zh) * 2020-08-13 2020-11-06 网易(杭州)网络有限公司 模型训练、帧图像生成、插帧方法、装置、设备及介质
CN111898701B (zh) * 2020-08-13 2023-07-25 网易(杭州)网络有限公司 模型训练、帧图像生成、插帧方法、装置、设备及介质
CN112465868A (zh) * 2020-11-30 2021-03-09 浙江大华汽车技术有限公司 一种目标检测跟踪方法、装置、存储介质及电子装置
CN112465868B (zh) * 2020-11-30 2024-01-12 浙江华锐捷技术有限公司 一种目标检测跟踪方法、装置、存储介质及电子装置
CN112528932A (zh) * 2020-12-22 2021-03-19 北京百度网讯科技有限公司 用于优化位置信息的方法、装置、路侧设备和云控平台
CN112528932B (zh) * 2020-12-22 2023-12-08 阿波罗智联(北京)科技有限公司 用于优化位置信息的方法、装置、路侧设备和云控平台
CN113160247B (zh) * 2021-04-22 2022-07-05 福州大学 基于频率分离的抗噪孪生网络目标跟踪方法
CN113160247A (zh) * 2021-04-22 2021-07-23 福州大学 基于频率分离的抗噪孪生网络目标跟踪方法
CN113327253A (zh) * 2021-05-24 2021-08-31 北京市遥感信息研究所 一种基于星载红外遥感影像的弱小目标检测方法
CN113327253B (zh) * 2021-05-24 2024-05-24 北京市遥感信息研究所 一种基于星载红外遥感影像的弱小目标检测方法

Also Published As

Publication number Publication date
PH12020550588A1 (en) 2021-04-26
SG11202004324WA (en) 2020-06-29
US20200265255A1 (en) 2020-08-20
JP7165731B2 (ja) 2022-11-04
JP2021502645A (ja) 2021-01-28
KR20200087784A (ko) 2020-07-21
CN108230359B (zh) 2021-01-26
US11455782B2 (en) 2022-09-27
CN108230359A (zh) 2018-06-29

Similar Documents

Publication Publication Date Title
WO2019091464A1 (zh) 目标检测方法和装置、训练方法、电子设备和介质
TWI773189B (zh) 基於人工智慧的物體檢測方法、裝置、設備及儲存媒體
WO2018099473A1 (zh) 场景分析方法和系统、电子设备
WO2020134557A1 (zh) 目标对象检测方法和装置、电子设备和存储介质
JP6397144B2 (ja) 画像からの事業発見
WO2018019126A1 (zh) 视频类别识别方法和装置、数据处理装置和电子设备
US10769496B2 (en) Logo detection
WO2019105337A1 (zh) 基于视频的人脸识别方法、装置、设备、介质及程序
CN108154222B (zh) 深度神经网络训练方法和系统、电子设备
WO2018054329A1 (zh) 物体检测方法和装置、电子设备、计算机程序和存储介质
WO2018121737A1 (zh) 关键点预测、网络训练及图像处理方法和装置、电子设备
US20150010203A1 (en) Methods, apparatuses and computer program products for performing accurate pose estimation of objects
US20210124928A1 (en) Object tracking methods and apparatuses, electronic devices and storage media
WO2019020062A1 (zh) 视频物体分割方法和装置、电子设备、存储介质和程序
CN113971751A (zh) 训练特征提取模型、检测相似图像的方法和装置
US9129152B2 (en) Exemplar-based feature weighting
WO2022161302A1 (zh) 动作识别方法、装置、设备、存储介质及计算机程序产品
US20240153240A1 (en) Image processing method, apparatus, computing device, and medium
CN113569740B (zh) 视频识别模型训练方法与装置、视频识别方法与装置
WO2022143366A1 (zh) 图像处理方法、装置、电子设备、介质及计算机程序产品
CN108154153B (zh) 场景分析方法和系统、电子设备
WO2019170024A1 (zh) 目标跟踪方法和装置、电子设备、存储介质
US20200357137A1 (en) Determining a Pose of an Object in the Surroundings of the Object by Means of Multi-Task Learning
US9081800B2 (en) Object detection via visual search
CN108229320B (zh) 选帧方法和装置、电子设备、程序和介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18876911

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020526040

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20207016026

Country of ref document: KR

Kind code of ref document: A

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11.09.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18876911

Country of ref document: EP

Kind code of ref document: A1