WO2019091464A1 - 目标检测方法和装置、训练方法、电子设备和介质 - Google Patents
目标检测方法和装置、训练方法、电子设备和介质 Download PDFInfo
- Publication number
- WO2019091464A1 WO2019091464A1 PCT/CN2018/114884 CN2018114884W WO2019091464A1 WO 2019091464 A1 WO2019091464 A1 WO 2019091464A1 CN 2018114884 W CN2018114884 W CN 2018114884W WO 2019091464 A1 WO2019091464 A1 WO 2019091464A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- frame
- detection frame
- feature
- detection
- regression
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
- G06T7/248—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/255—Detecting or recognising potential candidate objects based on visual cues, e.g. shapes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Definitions
- the present disclosure relates to computer vision technology, and more particularly to an object detection method and apparatus, training method, electronic device, and medium.
- Single-target tracking is an important issue in the field of artificial intelligence, and can be used in a series of tasks such as automatic driving and multi-target tracking.
- the main task of single-target tracking is to specify a target to be tracked in a certain frame image of a video sequence, and to track the specified target in the subsequent frame image.
- Embodiments of the present disclosure provide a technical solution for performing target tracking.
- a target tracking method including:
- a training method for a target detection network including:
- the detection frame of the target object as a prediction detection frame, training the neural network, the first convolution layer and the said based on the annotation information of the detection frame and the prediction detection frame
- the second convolution layer Obtaining, in the obtained detection frame, the detection frame of the target object as a prediction detection frame, training the neural network, the first convolution layer and the said based on the annotation information of the detection frame and the prediction detection frame
- the second convolution layer Obtaining, in the obtained detection frame, the detection frame of the target object as a prediction detection frame, training the neural network, the first convolution layer and the said based on the annotation information of the detection frame and the prediction detection frame.
- a target detecting apparatus including:
- a neural network configured to respectively extract a feature of a template frame and a detection frame, wherein the template frame is a detection frame image of the target object, and an image size of the template frame is smaller than the detection frame;
- a first convolution layer a channel for increasing features of the template frame, to obtain a first feature as a classification weight of the local area detector
- a second convolution layer a channel for increasing features of the template frame, to obtain a second feature as a regression weight of the local area detector
- a local area detector configured to output a classification result and a regression result of the multiple candidate boxes according to the characteristics of the detection frame
- an obtaining unit configured to acquire, according to the classification result and the regression result of the multiple candidate frames output by the local area detector, a detection frame of the target object in the detection frame.
- an electronic device including the object detecting device of any one of the embodiments of the present disclosure.
- another electronic device including:
- a memory for storing executable instructions
- a processor for communicating with the memory to execute the executable instructions to perform the operations of the method of any of the embodiments of the present disclosure.
- a computer storage medium for storing computer readable instructions that, when executed, implement the operations of the method of any of the embodiments of the present disclosure.
- a computer program comprising computer readable instructions, when a computer readable instruction is run in a device, a processor in the device executes An executable instruction that implements the steps in the method of any of the embodiments of the present disclosure.
- the characteristics of the template frame and the detection frame are respectively extracted by the neural network, and the classification weight and the regression weight of the local area detector are obtained based on the features of the template frame, and the feature of the detection frame is input into the local area detector to obtain a local part.
- the neural network with the same or the same result can better extract the similar features of the same target object, so that the feature changes of the target object extracted in different frames are small, which is helpful to improve in the detection frame.
- the accuracy of the target object detection result; the classification weight and the regression weight of the local area detector are obtained based on the feature of the template frame, and the local area detector can obtain the classification result and the regression result of the multiple candidate frames of the detection frame, and then obtain the detection frame.
- the detection frame of the target object can better estimate the position and size change of the target object, and can more accurately find the position of the target object in the detection frame, thereby improving the speed and accuracy of the target tracking, and the tracking effect is good. ,high speed.
- FIG. 1 is a flow chart of an embodiment of an object detection method of the present disclosure.
- FIG. 2 is a flow chart of another embodiment of the object detection method of the present disclosure.
- FIG. 3 is a flow chart of an embodiment of a training method for a target detection network of the present disclosure.
- FIG. 4 is a flow chart of another embodiment of a training method for a target detection network of the present disclosure.
- FIG. 5 is a schematic structural diagram of an embodiment of an object detecting apparatus according to the present disclosure.
- FIG. 6 is a schematic structural view of another embodiment of an object detecting apparatus according to the present disclosure.
- FIG. 7 is a schematic structural view of still another embodiment of the object detecting device of the present disclosure.
- FIG. 8 is a schematic structural diagram of an application embodiment of an object detecting apparatus according to the present disclosure.
- FIG. 9 is a schematic structural diagram of another application embodiment of the object detecting apparatus of the present disclosure.
- FIG. 10 is a schematic structural diagram of an application embodiment of an electronic device according to the present disclosure.
- a plurality may mean two or more, and “at least one” may mean one, two or more.
- Embodiments of the present disclosure may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which may operate with numerous other general purpose or special purpose computing system environments or configurations.
- Examples of well-known terminal devices, computing systems, environments, and/or configurations suitable for use with electronic devices such as terminal devices, computer systems, servers, and the like include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients Machines, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the above, and the like.
- Electronic devices such as terminal devices, computer systems, servers, etc., can be described in the general context of computer system executable instructions (such as program modules) being executed by a computer system.
- program modules may include routines, programs, target programs, components, logic, data structures, and the like that perform particular tasks or implement particular abstract data types.
- the computer system/server can be implemented in a distributed cloud computing environment where tasks are performed by remote processing devices that are linked through a communication network.
- program modules may be located on a local or remote computing system storage medium including storage devices.
- the target detection method of this embodiment includes:
- the template frame is a detection frame image of the target object, and the image size of the template frame is smaller than the detection frame; the detection frame is a current frame that needs to be detected by the target object or an area image that may include the target object in the current frame.
- the detection frame is an area image that may include the target object in the current frame that needs to be detected by the target object
- the area image is larger than the image size of the template frame, for example, the area image may be
- the center point of the template frame image is the center point, and the size can be 2-4 times the size of the template frame image.
- the template frame is a frame in the video sequence whose detection timing is before the detection frame and determined by the detection frame of the target object, and may be a start frame in the video sequence that needs to perform target tracking.
- the position of the starting frame in the sequence of video frames is very flexible, for example it can be the first frame or any intermediate frame in the sequence of video frames.
- the detection frame is a frame that needs to perform target tracking. After the detection frame of the target object is determined in the detection frame image, the image corresponding to the detection frame in the detection frame can be used as the template frame image of the next detection frame.
- the features of the template frame and the detection frame may be respectively extracted through the same neural network; or the template frames and the detection frames are respectively extracted by different neural networks having the same structure. feature.
- the operation 102 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a neural network operated by the processor.
- the feature of the template frame may be convoluted by the first convolutional layer, and the first feature obtained by the convolution operation is used as the classification weight of the local area detector.
- the classification weight of the local area detector may be obtained by increasing the number of channels of the feature of the template frame by the first convolution layer, and obtaining the first feature, where the number of channels of the first feature is The feature of the template frame is 2k times the number of channels, where k is an integer greater than zero.
- the feature of the template frame may be convoluted by the second convolution layer, and the second feature obtained by the convolution operation is used as the regression weight of the local area detector.
- the regression weight of the local area detector may be obtained by increasing the number of channels of the feature of the template frame by the second convolutional layer to obtain a second feature, the number of channels of the second feature Is 4k times the number of channels of the feature of the template frame, where k is an integer greater than zero.
- the operation 104 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a first convolutional layer and a second convolutional layer, respectively, executed by the processor.
- the classification result includes each candidate box as a probability value of the detection frame of the target object, and the regression result includes an offset of each candidate frame relative to the detection frame corresponding to the template frame.
- the plurality of candidate blocks may include: detecting K candidate boxes at various locations in the frame.
- K is a preset integer greater than one.
- the ratio of the length to the width of the K candidate boxes is different.
- the ratio of the length to the width of the K candidate frames may include: 1:1, 2:1, 2:1, 3:1, 1:3 ,Wait.
- the classification result is used to indicate whether the K candidate boxes at each position are probability values of the detection frame of the target object.
- the method further includes: normalizing the classification result, The sum of the probability values of the detection frames of whether the candidate boxes are the target objects is 1, thereby helping to determine whether each of the candidate frames is the detection frame of the target object.
- the regression result includes detecting an offset of the K candidate boxes at each position in the frame image with respect to the detection frame of the target object in the template frame, wherein the offset is The amount of change in position and size may be included, where the position may be the position of the center point, the position of the four vertices of the reference frame, and the like.
- the offset of each candidate frame relative to the detection frame of the target object in the template frame may include, for example, the abscissa of the position of the center point.
- the operation 106 may include: performing a convolution operation on the features of the detection frame by using the classification weight to obtain a classification result of the plurality of candidate frames; and using the regression weight to detect the characteristics of the frame A convolution operation is performed to obtain regression results for multiple candidate boxes.
- the operation 106 may be performed by a processor invoking a corresponding instruction stored in a memory or by a local area detector being executed by the processor.
- the operation 108 may be performed by a processor invoking a corresponding instruction stored in a memory or by an acquisition unit executed by the processor.
- the features of the template frame and the detection frame are respectively extracted by the neural network, and the classification weight and the regression weight of the local area detector are obtained based on the features of the template frame, and the feature input local area detection of the detection frame is performed.
- the neural network with the same or the same result can better extract the similar features of the same target object, so that the feature changes of the target object extracted in different frames are small, which is helpful to improve in the detection frame.
- the accuracy of the target object detection result; the classification weight and the regression weight of the local area detector are obtained based on the feature of the template frame, and the local area detector can obtain the classification result and the regression result of the multiple candidate frames of the detection frame, and then obtain the detection frame.
- the detection frame of the target object can better estimate the position and size change of the target object, and can more accurately find the position of the target object in the detection frame, thereby improving the speed and accuracy of the target tracking, and the tracking effect is good. ,high speed.
- the embodiment of the present disclosure is based on a template frame, and the local area detector can quickly generate a large number of candidate frames from the detection frame, and obtain detection frames of the K candidate boxes at each position in the detection frame relative to the target object in the template frame.
- the offset can better estimate the position and size of the target object, and can more accurately find the position of the target object in the detection frame, thereby improving the speed and accuracy of the target tracking, and the tracking effect is good and fast.
- the method may further include:
- the classification result of the box and the regression result perform operation 108.
- the detection frame when the detection frame is an area image that may include the target object in the current frame that needs to be detected by the target object, the detection frame may further include: centering on the center point of the template frame, An area image in which the length and/or width in the current frame corresponds to an image length and/or width larger than the template frame is taken as the detection frame.
- the target detection method of this embodiment includes:
- the template frame is a detection frame image of the target object, and the image size of the template frame is smaller than the detection frame; the detection frame is a current frame that needs to be detected by the target object or an area image that may include the target object in the current frame.
- the template frame is a frame in the video sequence whose detection timing is before the detection frame and determined by the detection frame of the target object.
- the features of the template frame and the detection frame may be respectively extracted through the same neural network; or the template frames and the detection frames are respectively extracted by different neural networks having the same structure. feature.
- the operation 202 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a neural network operated by the processor.
- 204 Perform a convolution operation on a feature of the detection frame by using a third convolution layer to obtain a third feature, where the number of channels of the third feature is the same as the number of channels of the feature of the detection frame; and detecting the frame by using the fourth convolution layer
- the feature is subjected to a convolution operation to obtain a fourth feature having the same number of channels as the feature of the detected frame.
- the operation 204 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a third convolutional layer and a fourth convolutional layer, respectively, executed by the processor.
- the feature of the template frame may be convoluted by the first convolutional layer, and the first feature obtained by the convolution operation is used as the classification weight of the local area detector.
- the feature of the template frame may be convoluted by the second convolution layer, and the second feature obtained by the convolution operation is used as the regression weight of the local area detector.
- the operation 206 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a first convolutional layer and a second convolutional layer, respectively, executed by the processor.
- the classification result includes each candidate box as a probability value of the detection frame of the target object, and the regression result includes an offset of each candidate frame relative to the detection frame corresponding to the template frame.
- the operation 208 may be performed by a processor invoking a corresponding instruction stored in a memory, or by a local area detector being executed by the processor.
- the operation 210 may be performed by a processor invoking a corresponding instruction stored in a memory, or by an acquisition unit executed by the processor.
- the operation 108 or 210 may include: selecting an candidate box from the plurality of candidate boxes according to the classification result and the regression result, and according to the offset pair of the selected candidate frame The selected candidate box is subjected to regression to obtain a detection frame of the target object in the detection frame.
- the method when selecting an candidate box from multiple candidate boxes according to the classification result and the regression result, the method may be implemented as follows: multiple candidate boxes according to the classification result and the weight coefficient of the regression result An alternative box is selected, for example, based on the classification result and the weight coefficient of the regression result, the sum of the product of the probability value of each candidate box and the weight coefficient of the classification result, and the product of the offset coefficient and the weight coefficient of the regression result, respectively. Calculating a composite score, and selecting an alternative box from the plurality of candidate boxes according to the combined score of the plurality of candidate boxes.
- the method further includes: adjusting the probability value of the candidate box according to the change amount of the position and the size in the regression result, for example, according to the regression result.
- the amount of change in position and size adjusts the probability value of the candidate box. For example, the probability value of the candidate box with a large amount of change in position (ie, a large positional movement) and a large amount of change in magnitude (ie, a large change in shape) is punished, and the probability value thereof is lowered.
- the method may be implemented as follows: according to the adjusted classification result, one candidate is selected from multiple candidate boxes.
- the marquee for example, selects an alternative box with the highest probability value from a plurality of candidate boxes according to the adjusted probability value.
- the above-mentioned operation of adjusting the probability value of the candidate box according to the change amount of the position and the size in the regression result may be performed by the processor calling the corresponding instruction stored in the memory, or may be executed by the processor.
- the adjustment unit is executed.
- the target detection network of the embodiment of the present disclosure includes the neural network, the first convolutional layer, and the second convolutional layer of the embodiments of the present disclosure.
- the training method of this embodiment includes:
- the template frame is a detection frame image of the target object, and the image size of the template frame is smaller than the detection frame; the detection frame is a current frame that needs to be detected by the target object or an area image that may include the target object in the current frame.
- the template frame is a frame in the video sequence whose detection timing is before the detection frame and determined by the detection frame of the target object.
- the features of the template frame and the detection frame may be separately extracted through the same neural network; or the template frames and the detection frames are respectively extracted by different neural networks having the same structure. feature.
- the operation 302 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a neural network operated by the processor.
- the operation 304 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a first convolutional layer and a second convolutional layer, respectively, executed by the processor.
- the classification result includes each candidate box as a probability value of the detection frame of the target object, and the regression result includes an offset of each candidate frame relative to the detection frame corresponding to the template frame.
- the operation 306 may include: performing a convolution operation on the feature of the detection frame by using the classification weight to obtain a classification result of the plurality of candidate frames; and using the regression weight pair to detect the feature of the frame A convolution operation is performed to obtain regression results for multiple candidate boxes.
- the operation 306 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by an area detector operated by the processor.
- the operation 308 may be performed by a processor invoking a corresponding instruction stored in a memory or by an acquisition unit executed by the processor.
- the obtained detection frame of the target object in the detection frame is used as a prediction detection frame, and the neural network, the first convolutional layer, and the second convolution layer are trained based on the annotation information of the detection frame and the prediction detection frame.
- the operation 310 may be performed by a processor invoking a corresponding instruction stored in a memory or by a training unit executed by the processor.
- the features of the template frame and the detection frame are respectively extracted by the neural network, and the classification weight and the regression weight of the local area detector are obtained based on the features of the template frame, and the feature input of the detection frame is detected.
- the local area detector obtains the classification result and the regression result of the multiple candidate frames output by the local area detector, and obtains the target object in the detection frame according to the classification result and the regression result of the multiple candidate boxes output by the local area detector
- the detection frame is based on the labeling information of the detection frame and the prediction detection frame training target detection network.
- the neural network with the same result or the same result can better extract the similar features of the same target object, so that the feature changes of the target object extracted in different frames are small, Helping to improve the accuracy of the target object detection result in the detection frame; obtaining the classification weight and regression weight of the local area detector based on the feature of the template frame, and the local area detector can obtain the classification result and regression of the multiple candidate frames of the detection frame
- the detection frame of the target object in the detection frame is further obtained, which can better estimate the position and size change of the target object, and can more accurately find the position of the target object in the detection frame, thereby improving the speed of the target tracking. And accuracy, tracking effect is good, fast.
- the method further includes: extracting, by the neural network, a feature of the at least one other detection frame whose timing is located after the detection frame in the video sequence;
- the method may further include: pre-focusing on the center point of the template frame, from the current An area image whose length and/or width corresponds to an image length and/or width larger than the template frame is taken as a detection frame in the frame.
- the target detection network of the embodiment of the present disclosure includes the neural network, the first convolutional layer, the second convolutional layer, the third convolutional layer, and the fourth convolutional layer of the embodiments of the present disclosure.
- the training method of this embodiment includes:
- the template frame is a detection frame image of the target object, and the image size of the template frame is smaller than the detection frame; the detection frame is a current frame that needs to be detected by the target object or an area image that may include the target object in the current frame.
- the template frame is a frame in the video sequence whose detection timing is before the detection frame and determined by the detection frame of the target object.
- the features of the template frame and the detection frame may be separately extracted through the same neural network; or the template frames and the detection frames are respectively extracted by different neural networks having the same structure. feature.
- the operation 402 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a neural network operated by the processor.
- the operation 404 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a third convolutional layer and a fourth convolutional layer, respectively, executed by the processor.
- the operation 406 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a first convolutional layer and a second convolutional layer, respectively, executed by the processor.
- the classification result includes each candidate box as a probability value of the detection frame of the target object, and the regression result includes an offset of each candidate frame relative to the detection frame corresponding to the template frame.
- the operation 408 may be performed by a processor invoking a corresponding instruction stored in a memory or by a local area detector being executed by the processor.
- the operation 410 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a first feature extraction unit 701 that is executed by the processor.
- the obtained detection frame of the target object in the detection frame is used as a prediction detection frame, and the difference between the position and the size of the detection frame in the detection frame according to the marked target object and the position and size of the prediction detection frame is The weight values of the network, the first convolutional layer, and the second convolutional layer are adjusted.
- the operation 412 may be performed by a processor invoking a corresponding instruction stored in a memory or by a training unit executed by the processor.
- the operation 308 or 410 may include: selecting an candidate box from the plurality of candidate frames according to the classification result and the regression result, and according to the offset pair of the selected candidate frame The selected candidate box is subjected to regression to obtain a detection frame of the target object in the detection frame.
- the method when selecting an candidate box from multiple candidate boxes according to the classification result and the regression result, the method may be implemented as follows: multiple candidate boxes according to the classification result and the weight coefficient of the regression result An alternative box is selected, for example, based on the classification result and the weight coefficient of the regression result, the sum of the product of the probability value of each candidate box and the weight coefficient of the classification result, and the product of the offset coefficient and the weight coefficient of the regression result, respectively. Calculating a comprehensive score, and selecting an alternative box with a high probability value and a small offset from the plurality of candidate boxes according to the comprehensive score of the plurality of candidate boxes.
- the method further includes: adjusting the probability value of the candidate box according to the change amount of the position and the size in the regression result, for example, according to the regression result.
- the amount of change in position and size adjusts the probability value of the candidate box.
- the method may be implemented as follows: according to the adjusted classification result, one candidate is selected from multiple candidate boxes.
- the marquee for example, selects an alternative box with the highest probability value from a plurality of candidate boxes according to the adjusted probability value.
- the above-mentioned operation of adjusting the probability value of the candidate box according to the amount of change in the position and size in the regression result may be performed by the processor calling the corresponding instruction stored in the memory, or may be executed by the processor. Adjustment unit execution.
- the operation 308 or 410 may include: selecting an candidate box from the plurality of candidate frames according to the classification result and the regression result, and according to the offset pair of the selected candidate frame The selected candidate box is subjected to regression to obtain a detection frame of the target object in the detection frame.
- the method when selecting an candidate box from multiple candidate boxes according to the classification result and the regression result, the method may be implemented as follows: multiple candidate boxes according to the classification result and the weight coefficient of the regression result Selecting an alternative box, for example, calculating a comprehensive score from the probability value and the offset of each candidate box according to the classification result and the weight coefficient of the regression result, according to the comprehensive score of the plurality of candidate boxes, from the above Select an alternate box in multiple candidate boxes.
- the method further includes: adjusting the probability value of the candidate box according to the change amount of the position and the size in the regression result, for example, according to the regression result.
- the amount of change in position and size adjusts the probability value of the candidate box. For example, the probability value of the candidate box with a large amount of change in position (ie, a large positional movement) and a large amount of change in magnitude (ie, a large change in shape) is punished, and the probability value thereof is lowered.
- the method may be implemented as follows: according to the adjusted classification result, one candidate is selected from multiple candidate boxes.
- the marquee for example, selects an alternative box with the highest probability value from a plurality of candidate boxes according to the adjusted probability value.
- the local area detector may include a third convolutional layer, a fourth convolutional layer, and two convolution operation units.
- the local area detector formed by combining the local area detector with the first convolution layer and the second convolution layer may also be referred to as an area proposal network.
- the target detection method and the training method of the target detection network provided by the embodiments of the present disclosure may be performed by any suitable device having data processing capability, including but not limited to: a terminal device, a server, and the like.
- any one of the object detection method and the target detection network training method provided by the embodiments of the present disclosure may be executed by a processor, such as the processor executing any one of the targets mentioned in the embodiments of the present disclosure by calling corresponding instructions stored in the memory. Detection method, training method of target detection network. This will not be repeated below.
- the foregoing program may be stored in a computer readable storage medium, and the program is executed when executed.
- the foregoing steps include the steps of the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
- FIG. 5 is a schematic structural diagram of an embodiment of an object detecting apparatus according to the present disclosure.
- the object detecting device of each embodiment of the present disclosure can be used to implement the above-described various object detecting method embodiments of the present disclosure.
- the object detecting apparatus of this embodiment includes a neural network, a first convolutional layer, a second convolutional layer, a local area detector, and an acquisition unit. among them:
- the neural network is configured to separately extract features of the template frame and the detection frame, wherein the template frame is a detection frame image of the target object, and the image size of the template frame is smaller than the detection frame.
- the template frame is a detection frame image of the target object, and the image size of the template frame is smaller than the detection frame; the detection frame is a current frame that needs to be detected by the target object or an area image that may include the target object in the current frame.
- the template frame is a frame in the video sequence whose detection timing is before the detection frame and determined by the detection frame of the target object.
- the value network for extracting the features of the template frame and the detection frame may be the same neural network, or may be different neural networks having the same structure.
- a first convolution layer configured to perform a convolution operation on the feature of the template frame, and use the first feature obtained by the convolution operation as the classification weight of the local area detector.
- a second convolution layer configured to perform a convolution operation on the feature of the template frame by the second convolution layer, and use a second feature obtained by the convolution operation as a regression weight of the local area detector.
- a local area detector is configured to output a classification result and a regression result of the plurality of candidate frames according to the characteristics of the detection frame; wherein the classification result includes a probability value of each detection box as a detection frame of the target object, and the regression result includes each The offset of the candidate box relative to the detection frame corresponding to the template frame.
- an obtaining unit configured to acquire a detection frame of the target object in the detection frame according to the classification result and the regression result of the multiple candidate frames output by the local area detector.
- the features of the template frame and the detection frame are respectively extracted by the neural network, and the classification weight and the regression weight of the local area detector are acquired based on the features of the template frame, and the feature input local area detection of the detection frame is performed.
- the neural network with the same or the same result can better extract the similar features of the same target object, so that the feature changes of the target object extracted in different frames are small, which is helpful to improve in the detection frame.
- the accuracy of the target object detection result; the classification weight and the regression weight of the local area detector are obtained based on the feature of the template frame, and the local area detector can obtain the classification result and the regression result of the multiple candidate frames of the detection frame, and then obtain the detection frame.
- the detection frame of the target object can better estimate the position and size change of the target object, and can more accurately find the position of the target object in the detection frame, thereby improving the speed and accuracy of the target tracking, and the tracking effect is good. ,high speed.
- the local area detector is configured to: perform a convolution operation on the feature of the detection frame by using the classification weight, obtain a classification result of the multiple candidate frames; and use the regression weight pair
- the feature of the detection frame is subjected to a convolution operation to obtain regression results of a plurality of candidate frames.
- the apparatus may further include: a preprocessing unit, configured to use the center point of the template frame as The center point intercepts, as a detection frame, an area image whose length and/or width corresponds to the image length and/or width of the template frame from the current frame.
- FIG. 6 is a schematic structural view of another embodiment of the target detecting device of the present disclosure.
- the method further includes: a third convolution layer, configured to perform a convolution operation on the feature of the detection frame to obtain a third feature, the third feature.
- the number of channels is the same as the number of channels that detect the characteristics of the frame.
- the local area detector is configured to perform a convolution operation on the third feature using the classification weight.
- the method further includes: a fourth convolution layer, configured to perform a convolution operation on the feature of the detection frame to obtain the fourth feature, the fourth feature.
- the number of channels is the same as the number of channels that detect the characteristics of the frame.
- the local area detector is configured to perform a convolution operation on the fourth feature using the regression weights.
- the acquiring unit is configured to: select an candidate box from the plurality of candidate boxes according to the classification result and the regression result, and according to the selected candidate box The shifting is performed on the selected candidate box to obtain a detection frame of the target object in the detection frame.
- the obtaining unit selects an candidate box from the plurality of candidate frames according to the classification result and the regression result, and is configured to: select an candidate from the multiple candidate boxes according to the weighting coefficient of the classification result and the regression result frame.
- the method further includes: an adjustment unit, configured to adjust the classification result according to the regression result.
- the obtaining unit selects an candidate box from the plurality of candidate boxes according to the adjusted classification result according to the classification result and the regression result.
- FIG. 7 is a schematic structural diagram of still another embodiment of the object detecting apparatus of the present disclosure.
- the object detecting apparatus of this embodiment can be used to implement the training method embodiment of any of the target detecting networks of FIGS. 3 to 4 of the present disclosure.
- the object detecting apparatus of the embodiment further includes: a training unit, configured to obtain a detection frame of the target object in the detection frame as a prediction detection frame, The neural network, the first convolutional layer, and the second convolutional layer are trained based on the annotation information of the detection frame and the prediction detection frame.
- the tag information of the detection frame includes: a position and a size of the detection frame of the tagged target object in the detection frame.
- the training unit is configured to compare the position and size of the marked detection frame with the position and size of the prediction detection frame for the neural network, the first convolutional layer and the second convolutional layer. The weight value is adjusted.
- the characteristics of the template frame and the detection frame are respectively extracted by the neural network, and the classification weight and the regression weight of the local area detector are obtained based on the features of the template frame, and the feature of the detection frame is input into the local area detector to obtain a local part.
- the annotation information and the prediction detection frame train the target detection network.
- the neural network with the same result or the same result can better extract the similar features of the same target object, so that the feature changes of the target object extracted in different frames are small, Helping to improve the accuracy of the target object detection result in the detection frame; obtaining the classification weight and regression weight of the local area detector based on the feature of the template frame, and the local area detector can obtain the classification result and regression of the multiple candidate frames of the detection frame
- the detection frame of the target object in the detection frame is further obtained, which can better estimate the position and size change of the target object, and can more accurately find the position of the target object in the detection frame, thereby improving the speed of the target tracking. And accuracy, tracking effect is good, fast.
- FIG. 8 is a schematic structural diagram of an application embodiment of the target detecting device of the present disclosure.
- FIG. 9 is a schematic structural diagram of another application embodiment of the target detecting device of the present disclosure.
- LxMxN for example, 256x20x20
- L represents the number of channels
- M and N represent the height (ie, length) and width, respectively.
- An embodiment of the present disclosure further provides an electronic device comprising the object detecting device of any of the above embodiments of the present disclosure.
- An embodiment of the present disclosure further provides another electronic device, including: a memory for storing executable instructions; and a processor for communicating with the memory to execute executable instructions to perform target detection of any of the above embodiments of the present disclosure
- the method or target detects the operation of the training method of the network.
- FIG. 10 is a schematic structural diagram of an application embodiment of an electronic device according to the present disclosure.
- the electronic device includes one or more processors, a communication unit, etc., such as one or more central processing units (CPUs), and/or one or more images.
- processors such as one or more central processing units (CPUs), and/or one or more images.
- a processor GPU or the like, the processor can perform various appropriate actions and processes according to executable instructions stored in a read only memory (ROM) or executable instructions loaded from a storage portion into a random access memory (RAM) .
- ROM read only memory
- RAM random access memory
- the communication portion may include, but is not limited to, a network card, which may include, but is not limited to, an IB (Infiniband) network card, and the processor may communicate with the read only memory and/or the random access memory to execute executable instructions, and connect to the communication portion through the bus. And communicating with the other target device by the communication unit, so as to complete the operations corresponding to any method provided by the embodiment of the present application, for example, extracting the features of the template frame and the detection frame respectively by the neural network, wherein the template frame is the target object.
- a network card which may include, but is not limited to, an IB (Infiniband) network card
- the processor may communicate with the read only memory and/or the random access memory to execute executable instructions, and connect to the communication portion through the bus. And communicating with the other target device by the communication unit, so as to complete the operations corresponding to any method provided by the embodiment of the present application, for example, extracting the features of the template frame and the detection frame respectively by the neural network, wherein the template frame is the
- a detection frame image the image size of the template frame is smaller than the detection frame; acquiring a classification weight and a regression weight of the local area detector based on the feature of the template frame; and inputting the feature of the detection frame into the local area detection Obtaining a classification result and a regression result of the plurality of candidate frames output by the local area detector; acquiring the detection frame according to the classification result and the regression result of the plurality of candidate boxes output by the local area detector The detection frame of the target object.
- the feature of the template frame and the detection frame are separately extracted by the neural network, wherein the template frame is a detection frame image of the target object, and the image size of the template frame is smaller than the detection frame; a channel of the feature of the template frame, using the obtained first feature as a classification weight of the local area detector; and a channel for increasing a feature of the template frame by a second convolution layer, to obtain a second feature a regression weight of the local area detector; inputting a feature of the detection frame into the local area detector to obtain a classification result and a regression result of the plurality of candidate frames output by the local area detector; Obtaining a detection frame of the target object in the detection frame, and obtaining a detection frame of the target object in the detection frame as a prediction detection frame And training the neural network, the first convolutional layer, and the second convolutional layer based on the annotation information of the detection frame and the prediction detection frame.
- the CPU, ROM, and RAM are connected to each other through a bus.
- the ROM is an optional module.
- the RAM stores executable instructions, or writes executable instructions to the ROM at runtime, the executable instructions causing the processor to perform operations corresponding to any of the methods described above.
- An input/output (I/O) interface is also connected to the bus.
- the communication unit can be integrated or set up with multiple sub-modules (eg multiple IB network cards) and on the bus link.
- the following components are connected to the I/O interface: an input portion including a keyboard, a mouse, and the like; an output portion including a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a speaker; a storage portion including a hard disk or the like; The communication part of the network interface card of the LAN card, modem, etc.
- the communication section performs communication processing via a network such as the Internet.
- the drive is also connected to the I/O interface as needed.
- a removable medium such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like is mounted on the drive as needed so that a computer program read therefrom is installed into the storage portion as needed.
- FIG. 10 is only an optional implementation manner.
- the number and type of components in the foregoing FIG. 10 may be selected, deleted, added, or replaced according to actual needs;
- Different function components can also be implemented in separate settings or integrated settings, such as GPU and CPU detachable settings or GPU can be integrated on the CPU, the communication part can be separated, or integrated on the CPU or GPU. and many more.
- an embodiment of the present disclosure further provides a computer storage medium for storing a computer readable instruction, when the instruction is executed, implementing the target detection method of the foregoing embodiment of the present disclosure or the training method of the target detection network. operating.
- embodiments of the present disclosure also provide a computer program comprising computer readable instructions, when the computer readable instructions are run in a device, the processor in the device executes the above-described An executable instruction of a step in a target detection method or a target detection network training method of an embodiment.
- Embodiments of the present disclosure may perform single-target tracking. For example, in a multi-target tracking system, target detection may not be performed every frame, but a fixed detection interval, for example, every 10 frames, and the middle 9 frames may be tracked by a single target. Determine the location of the target of the intermediate frame. Since the algorithm of the embodiment of the present disclosure is faster, the multi-target tracking system as a whole can complete the tracking faster and achieve better results.
- the foregoing program may be stored in a computer readable storage medium, and the program is executed when executed.
- the foregoing steps include the steps of the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
- the methods and apparatus of the present disclosure may be implemented in a number of ways.
- the methods and apparatus of the present disclosure may be implemented in software, hardware, firmware or any combination of software, hardware, firmware.
- the above-described sequence of steps for the method is for illustrative purposes only, and the steps of the method of the present disclosure are not limited to the order described above unless otherwise specifically stated.
- the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine readable instructions for implementing a method in accordance with the present disclosure.
- the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Biodiversity & Conservation Biology (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims (43)
- 一种目标检测方法,其特征在于,包括:经神经网络分别提取模版帧和检测帧的特征,其中,所述模版帧为目标对象的检测框图像,所述模版帧的图像大小小于所述检测帧;基于所述模版帧的特征获取局部区域检测器的分类权重和回归权重;将所述检测帧的特征输入所述局部区域检测器,得到所述局部区域检测器输出的多个备选框的分类结果和回归结果;根据所述局部区域检测器输出的多个备选框的分类结果和回归结果,获取所述检测帧中所述目标对象的检测框。
- 根据权利要求1所述的方法,其特征在于,还包括:经所述神经网络提取视频序列中时序位于所述检测帧之后的至少一其他检测帧的特征;将所述至少一其他检测帧的特征依次输入所述局部区域检测器,依次得到所述局部区域检测器输出的所述至少一其他检测帧中的多个备选框、以及各备选框的分类结果和回归结果;依次根据所述至少一其他检测帧的多个备选框的分类结果和回归结果,获取所述至少一其他检测帧中所述目标对象的检测框。
- 根据权利要求1或2所述的方法,其特征在于,所述经神经网络分别提取模版帧和检测帧的特征,包括:经同一神经网络分别提取所述模版帧和所述检测帧的特征;或者,经具有相同结构的不同神经网络分别提取所述模版帧和所述检测帧的特征。
- 根据权利要求1-3任一所述的方法,其特征在于,所述模板帧为视频序列中检测时序位于所述检测帧之前、且目标对象的检测框确定的帧。
- 根据权利要求1-4任一所述的方法,其特征在于,所述检测帧为需要进行所述目标对象检测的当前帧或者当前帧中可能包含所述目标对象的区域图像。
- 根据权利要求5所述的方法,其特征在于,所述检测帧为需要进行所述目标对象检测的当前帧中可能包含所述目标对象的区域图像时,所述方法还包括:以所述模板帧的中心点为中心点,从当前帧中截取长度和/或宽度对应大于所述模板帧的图像长度和/或宽度的区域图像作为所述检测帧。
- 根据权利要求1-6任一所述的方法,其特征在于,基于所述模版帧的特征获取局部区域检测器的分类权重,包括:通过第一卷积层对所述模板帧的特征进行卷积操作,以卷积操作得到的第一特征作为所述局部区域检测器的分类权重。
- 根据权利要求1-7任一所述的方法,其特征在于,基于所述模版帧的特征获取局部区域检测器的回归权重,包括:通过第二卷积层对所述模板帧的特征进行卷积操作,以卷积操作得到的第二特征作为所述局部区域检测器的回归权重。
- 根据权利要求1-8任一所述的方法,其特征在于,将所述检测帧的特征输入所述局部区域检测器,得到所述局部区域检测器输出的多个备选框的分类结果和回归结果,包括:利用所述分类权重对所述检测帧的特征进行卷积操作,获得多个备选框的分类结果;以及利用所述回归权重对所述检测帧的特征进行卷积操作,获得多个备选框的回归结果。
- 根据权利要求9所述的方法,其特征在于,提取所述检测帧的特征之后,还包括:通过第三卷积层对所述检测帧的特征进行卷积操作,获得第三特征,所述第三特征的通道数量与所述检测帧的特征的通道数量相同;所述利用所述分类权重对所述检测帧的特征进行卷积操作,获得多个备选框的分类结 果,包括:利用所述分类权重对所述第三特征进行卷积操作,获得多个备选框的分类结果。
- 根据权利要求9或10所述的方法,其特征在于,提取所述模板帧的特征之后,还包括:通过第四卷积层对所述检测帧的特征进行卷积操作,获得第四特征,所述第四特征的通道数量与所述检测帧的特征的通道数量相同;利用所述回归权重对所述检测帧的特征进行卷积操作,获得多个备选框的回归结果,包括:利用所述回归权重对所述第四特征进行卷积操作,获得多个备选框的回归结果。
- 根据权利要求1-11任一所述的方法,其特征在于,根据所述局部区域检测器输出的多个备选框的分类结果和回归结果,获取所述检测帧中所述目标对象的检测框,包括:根据所述分类结果和所述回归结果从所述多个备选框中选取一个备选框,并根据选取的备选框的偏移量对所述选取的备选框进行回归,获得所述检测帧中所述目标对象的检测框。
- 根据权利要求12所述的方法,其特征在于,根据所述分类结果和所述回归结果从所述多个备选框中选取一个备选框,包括:根据所述分类结果和所述回归结果的权重系数,从所述多个备选框中选取一个备选框。
- 根据权利要求12所述的方法,其特征在于,所述获得回归结果之后,还包括:根据所述回归结果对所述分类结果进行调整;根据所述分类结果和所述回归结果从所述多个备选框中选取一个备选框,包括:根据调整后的分类结果,从所述多个备选框中选取一个备选框。
- 一种目标检测网络的训练方法,其特征在于,包括:经神经网络分别提取模版帧和检测帧的特征,其中,所述模版帧为目标对象的检测框图像,所述模版帧的图像大小小于所述检测帧;通过一卷积层对所述模板帧的特征进行卷积操作,以卷积操作得到的第一特征作为所述局部区域检测器的分类权重;以及通过第二卷积层对所述模板帧的特征进行卷积操作,以卷积操作得到的第二特征作为所述局部区域检测器的回归权重;将所述检测帧的特征输入所述局部区域检测器,得到所述局部区域检测器输出的多个备选框的分类结果和回归结果;根据所述局部区域检测器输出的多个备选框的分类结果和回归结果,获取所述检测帧中所述目标对象的检测框;以获得的所述检测帧中所述目标对象的检测框作为预测检测框,基于所述检测帧的标注信息和所述预测检测框训练所述神经网络、所述第一卷积层和所述第二卷积层。
- 根据权利要求15所述的方法,其特征在于,还包括:经所述神经网络提取视频序列中时序位于所述检测帧之后的至少一其他检测帧的特征;将所述至少一其他检测帧的特征依次输入所述局部区域检测器,依次得到所述局部区域检测器输出的所述至少一其他检测帧中的多个备选框、以及各备选框的分类结果和回归结果;依次根据所述至少一其他检测帧的多个备选框的分类结果和回归结果,获取所述至少一其他检测帧中所述目标对象的检测框。
- 根据权利要求15或16所述的方法,其特征在于,经神经网络分别提取模版帧和检测帧的特征,包括:经同一神经网络分别提取所述模版帧和所述检测帧的特征;或者,经具有相同结构的不同神经网络分别提取所述模版帧和所述检测帧的特征。
- 根据权利要求15-17任一所述的方法,其特征在于,所述模板帧为视频序列中检测时序位于所述检测帧之前、且目标对象的检测框确定的帧。
- 根据权利要求15-18任一所述的方法,其特征在于,所述检测帧为需要进行所述 目标对象检测的当前帧或者当前帧中可能包含所述目标对象的区域图像。
- 根据权利要求19所述的方法,其特征在于,所述检测帧为需要进行所述目标对象检测的当前帧中可能包含所述目标对象的区域图像时,所述方法还包括:以所述模板帧的中心点为中心点,从当前帧中截取长度和/或宽度对应大于所述模板帧的图像长度和/或宽度的区域图像作为所述检测帧。
- 根据权利要求15-20任一所述的方法,其特征在于,将所述检测帧的特征输入所述局部区域检测器,得到所述局部区域检测器输出的多个备选框的分类结果和回归结果,包括:利用所述分类权重对所述检测帧的特征进行卷积操作,获得多个备选框的分类结果;以及利用所述回归权重对所述检测帧的特征进行卷积操作,获得多个备选框的回归结果。
- 根据权利要求21所述的方法,其特征在于,提取所述检测帧的特征之后,还包括:通过第三卷积层对所述检测帧的特征进行卷积操作,获得第三特征,所述第三特征的通道数量与所述检测帧的特征的通道数量相同;所述利用所述分类权重对所述检测帧的特征进行卷积操作,获得多个备选框的分类结果,包括:利用所述分类权重对所述第三特征进行卷积操作,获得多个备选框的分类结果。
- 根据权利要求21所述的方法,其特征在于,提取所述模板帧的特征之后,还包括:通过第四卷积层对所述检测帧的特征进行卷积操作,获得第四特征,所述第四特征的通道数量与所述检测帧的特征的通道数量相同;利用所述回归权重对所述检测帧的特征进行卷积操作,获得多个备选框的回归结果,包括:利用所述回归权重对所述第四特征进行卷积操作,获得多个备选框的回归结果。
- 根据权利要求15-23任一所述的方法,其特征在于,根据所述局部区域检测器输出的多个备选框的分类结果和回归结果,获取所述检测帧中所述目标对象的检测框,包括:根据所述分类结果和所述回归结果从所述多个备选框中选取一个备选框,并根据选取的备选框的偏移量对所述选取的备选框进行回归,获得所述检测帧中所述目标对象的检测框。
- 根据权利要求24所述的方法,其特征在于,根据所述分类结果和所述回归结果从所述多个备选框中选取一个备选框,包括:根据所述分类结果和所述回归结果的权重系数,从所述多个备选框中选取一个备选框。
- 根据权利要求25所述的方法,其特征在于,所述获得回归结果之后,还包括:根据所述回归结果对所述分类结果进行调整;根据所述分类结果和所述回归结果从所述多个备选框中选取一个备选框,包括:根据调整后的分类结果,从所述多个备选框中选取一个备选框。
- 根据权利要求15-26任一所述的方法,其特征在于,所述检测帧的标注信息包括:标注的所述目标对象在所述检测帧中的检测框的位置和大小;以获得的所述检测帧中所述目标对象的检测框作为预测检测框,基于所述检测帧的标注信息和所述预测检测框,训练所述神经网络、所述第一卷积层和所述第二卷积层,包括:根据所述标注的检测框的位置和大小与所述预测检测框的位置和大小之间的差异,对所述神经网络、所述第一卷积层和所述第二卷积层的权重值进行调整。
- 一种目标检测装置,其特征在于,包括:神经网络,用于分别提取模版帧和检测帧的特征,其中,所述模版帧为目标对象的检测框图像,所述模版帧的图像大小小于所述检测帧;第一卷积层,用于增加所述模板帧的特征的通道,以得到的第一特征作为局部区域检测器的分类权重;第二卷积层,用于增加所述模板帧的特征的通道,以得到的第二特征作为所述局部区域检测器的回归权重;局部区域检测器,用于根据所述检测帧的特征,输出多个备选框的分类结果和回归结果;获取单元,用于根据所述局部区域检测器输出的多个备选框的分类结果和回归结果,获取所述检测帧中所述目标对象的检测框。
- 根据权利要求28所述的装置,其特征在于,所述神经网络包括:具有相同结构的、分别用于提取所述模版帧和所述检测帧的特征的不同神经网络。
- 根据权利要求28或29所述的装置,其特征在于,所述模板帧为视频序列中检测时序位于所述检测帧之前、且目标对象的检测框确定的帧。
- 根据权利要求28-30任一所述的装置,其特征在于,所述检测帧为需要进行所述目标对象检测的当前帧或者当前帧中可能包含所述目标对象的区域图像。
- 根据权利要求31所述的装置,其特征在于,还包括:预处理单元,用于以所述模板帧的中心点为中心点,从当前帧中截取长度和/或宽度对应大于所述模板帧的图像长度和/或宽度的区域图像作为所述检测帧。
- 根据权利要求28-32任一所述的装置,其特征在于,所述局部区域检测器用于:利用所述分类权重对所述检测帧的特征进行卷积操作,获得多个备选框的分类结果;以及利用所述回归权重对所述检测帧的特征进行卷积操作,获得多个备选框的回归结果。
- 根据权利要求33所述的装置,其特征在于,还包括:第三卷积层,用于对所述检测帧的特征进行卷积操作,获得第三特征,所述第三特征的通道数量与所述检测帧的特征的通道数量相同;所述局部区域检测器,用于利用所述分类权重对所述第三特征进行卷积操作。
- 根据权利要求33所述的装置,其特征在于,还包括:第四卷积层,用于对所述检测帧的特征进行卷积操作,获得第四特征,所述第四特征的通道数量与所述检测帧的特征的通道数量相同;所述局部区域检测器,用于利用所述回归权重对所述第四特征进行卷积操作。
- 根据权利要求28-35任一所述的装置,其特征在于,所述获取单元用于:根据所述分类结果和所述回归结果从所述多个备选框中选取一个备选框,并根据选取的备选框的偏移量对所述选取的备选框进行回归,获得所述检测帧中所述目标对象的检测框。
- 根据权利要求36所述的装置,其特征在于,所述获取单元根据所述分类结果和所述回归结果从所述多个备选框中选取一个备选框时,用于:根据所述分类结果和所述回归结果的权重系数,从所述多个备选框中选取一个备选框。
- 根据权利要求36所述的装置,其特征在于,还包括:调整单元,用于根据所述回归结果对所述分类结果进行调整;所述获取单元根据所述分类结果和所述回归结果从所述多个备选框中选取一个备选框时,用于:根据调整后的分类结果,从所述多个备选框中选取一个备选框。
- 根据权利要求28-38任一所述的装置,其特征在于,还包括:训练单元,用于以获得的所述检测帧中所述目标对象的检测框作为预测检测框,基于所述检测帧的标注信息和所述预测检测框训练所述神经网络、所述第一卷积层和所述第二卷积层。
- 根据权利要求39所述的装置,其特征在于,所述检测帧的标注信息包括:标注的所述目标对象在所述检测帧中的检测框的位置和大小;所述训练单元,用于根据所述标注的检测框的位置和大小与所述预测检测框的位置和大小之间的差异,对所述神经网络、所述第一卷积层和所述第二卷积层的权重值进行调整。
- 一种电子设备,其特征在于,包括权利要求28-40任一所述的目标检测装置。
- 一种电子设备,其特征在于,包括:存储器,用于存储可执行指令;以及处理器,用于与所述存储器通信以执行所述可执行指令从而完成权利要求1-27任一所述方法的操作。
- 一种计算机存储介质,用于存储计算机可读取的指令,其特征在于,所述指令被执行时实现权利要求1-27任一所述方法的操作。
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2020526040A JP7165731B2 (ja) | 2017-11-12 | 2018-11-09 | 目標検出方法及び装置、トレーニング方法、電子機器並びに媒体 |
SG11202004324WA SG11202004324WA (en) | 2017-11-12 | 2018-11-09 | Target detection method and apparatus, training method, electronic device and medium |
KR1020207016026A KR20200087784A (ko) | 2017-11-12 | 2018-11-09 | 목표 검출 방법 및 장치, 트레이닝 방법, 전자 기기 및 매체 |
US16/868,427 US11455782B2 (en) | 2017-11-12 | 2020-05-06 | Target detection method and apparatus, training method, electronic device and medium |
PH12020550588A PH12020550588A1 (en) | 2017-11-12 | 2020-05-07 | Target detection method and apparatus, training method, electronic device and medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711110587.1 | 2017-11-12 | ||
CN201711110587.1A CN108230359B (zh) | 2017-11-12 | 2017-11-12 | 目标检测方法和装置、训练方法、电子设备、程序和介质 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/868,427 Continuation US11455782B2 (en) | 2017-11-12 | 2020-05-06 | Target detection method and apparatus, training method, electronic device and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019091464A1 true WO2019091464A1 (zh) | 2019-05-16 |
Family
ID=62655730
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/114884 WO2019091464A1 (zh) | 2017-11-12 | 2018-11-09 | 目标检测方法和装置、训练方法、电子设备和介质 |
Country Status (7)
Country | Link |
---|---|
US (1) | US11455782B2 (zh) |
JP (1) | JP7165731B2 (zh) |
KR (1) | KR20200087784A (zh) |
CN (1) | CN108230359B (zh) |
PH (1) | PH12020550588A1 (zh) |
SG (1) | SG11202004324WA (zh) |
WO (1) | WO2019091464A1 (zh) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110399900A (zh) * | 2019-06-26 | 2019-11-01 | 腾讯科技(深圳)有限公司 | 对象检测方法、装置、设备及介质 |
CN110533184A (zh) * | 2019-08-31 | 2019-12-03 | 南京人工智能高等研究院有限公司 | 一种网络模型的训练方法及装置 |
CN111898701A (zh) * | 2020-08-13 | 2020-11-06 | 网易(杭州)网络有限公司 | 模型训练、帧图像生成、插帧方法、装置、设备及介质 |
CN112465868A (zh) * | 2020-11-30 | 2021-03-09 | 浙江大华汽车技术有限公司 | 一种目标检测跟踪方法、装置、存储介质及电子装置 |
CN112528932A (zh) * | 2020-12-22 | 2021-03-19 | 北京百度网讯科技有限公司 | 用于优化位置信息的方法、装置、路侧设备和云控平台 |
CN113160247A (zh) * | 2021-04-22 | 2021-07-23 | 福州大学 | 基于频率分离的抗噪孪生网络目标跟踪方法 |
CN113327253A (zh) * | 2021-05-24 | 2021-08-31 | 北京市遥感信息研究所 | 一种基于星载红外遥感影像的弱小目标检测方法 |
JP2022511221A (ja) * | 2019-09-24 | 2022-01-31 | 北京市商▲湯▼科技▲開▼▲発▼有限公司 | 画像処理方法、画像処理装置、プロセッサ、電子機器、記憶媒体及びコンピュータプログラム |
US11429809B2 (en) | 2019-09-24 | 2022-08-30 | Beijing Sensetime Technology Development Co., Ltd | Image processing method, image processing device, and storage medium |
JP2022551396A (ja) * | 2019-11-20 | 2022-12-09 | テンセント・テクノロジー・(シェンジェン)・カンパニー・リミテッド | 動作認識方法、装置、コンピュータプログラム及びコンピュータデバイス |
Families Citing this family (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108230359B (zh) | 2017-11-12 | 2021-01-26 | 北京市商汤科技开发有限公司 | 目标检测方法和装置、训练方法、电子设备、程序和介质 |
US11430312B2 (en) * | 2018-07-05 | 2022-08-30 | Movidius Limited | Video surveillance with neural networks |
CN109584276B (zh) * | 2018-12-04 | 2020-09-25 | 北京字节跳动网络技术有限公司 | 关键点检测方法、装置、设备及可读介质 |
CN109726683B (zh) * | 2018-12-29 | 2021-06-22 | 北京市商汤科技开发有限公司 | 目标对象检测方法和装置、电子设备和存储介质 |
CN111435432B (zh) * | 2019-01-15 | 2023-05-26 | 北京市商汤科技开发有限公司 | 网络优化方法及装置、图像处理方法及装置、存储介质 |
CN110136107B (zh) * | 2019-05-07 | 2023-09-05 | 上海交通大学 | 基于dssd和时域约束x光冠脉造影序列自动分析方法 |
CN110598785B (zh) * | 2019-09-11 | 2021-09-07 | 腾讯科技(深圳)有限公司 | 一种训练样本图像的生成方法及装置 |
US11080833B2 (en) * | 2019-11-22 | 2021-08-03 | Adobe Inc. | Image manipulation using deep learning techniques in a patch matching operation |
CN110942065B (zh) * | 2019-11-26 | 2023-12-12 | Oppo广东移动通信有限公司 | 文本框选方法、装置、终端设备及计算机可读存储介质 |
KR102311798B1 (ko) * | 2019-12-12 | 2021-10-08 | 포항공과대학교 산학협력단 | 다중 객체 추적 방법 및 장치 |
JP7490359B2 (ja) * | 2019-12-24 | 2024-05-27 | キヤノン株式会社 | 情報処理装置、情報処理方法及びプログラム |
CN111383244B (zh) * | 2020-02-28 | 2023-09-01 | 浙江大华技术股份有限公司 | 一种目标检测跟踪方法 |
CN112215899B (zh) * | 2020-09-18 | 2024-01-30 | 深圳市瑞立视多媒体科技有限公司 | 帧数据在线处理方法、装置和计算机设备 |
CN112381136B (zh) * | 2020-11-12 | 2022-08-19 | 深兰智能科技(上海)有限公司 | 目标检测方法和装置 |
CN112464797B (zh) * | 2020-11-25 | 2024-04-02 | 创新奇智(成都)科技有限公司 | 一种吸烟行为检测方法、装置、存储介质及电子设备 |
CN112465691A (zh) * | 2020-11-25 | 2021-03-09 | 北京旷视科技有限公司 | 图像处理方法、装置、电子设备和计算机可读介质 |
CN112580474B (zh) * | 2020-12-09 | 2021-09-24 | 云从科技集团股份有限公司 | 基于计算机视觉的目标对象检测方法、系统、设备及介质 |
CN112906478B (zh) * | 2021-01-22 | 2024-01-09 | 北京百度网讯科技有限公司 | 目标对象的识别方法、装置、设备和存储介质 |
CN113128564B (zh) * | 2021-03-23 | 2022-03-22 | 武汉泰沃滋信息技术有限公司 | 一种基于深度学习的复杂背景下典型目标检测方法及系统 |
CN113221962B (zh) * | 2021-04-21 | 2022-06-21 | 哈尔滨工程大学 | 一种解耦分类与回归任务的三维点云单阶段目标检测方法 |
CN113076923A (zh) * | 2021-04-21 | 2021-07-06 | 山东大学 | 基于轻量型网络MobileNet-SSD的口罩佩戴检测方法、设备及存储介质 |
CN113065618A (zh) * | 2021-06-03 | 2021-07-02 | 常州微亿智造科技有限公司 | 工业质检中的检测方法、检测装置 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105976400A (zh) * | 2016-05-10 | 2016-09-28 | 北京旷视科技有限公司 | 基于神经网络模型的目标跟踪方法及装置 |
CN106326837A (zh) * | 2016-08-09 | 2017-01-11 | 北京旷视科技有限公司 | 对象追踪方法和装置 |
CN106355188A (zh) * | 2015-07-13 | 2017-01-25 | 阿里巴巴集团控股有限公司 | 图像检测方法及装置 |
EP3229206A1 (en) * | 2016-04-04 | 2017-10-11 | Xerox Corporation | Deep data association for online multi-class multi-object tracking |
CN108230359A (zh) * | 2017-11-12 | 2018-06-29 | 北京市商汤科技开发有限公司 | 目标检测方法和装置、训练方法、电子设备、程序和介质 |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6192360B1 (en) * | 1998-06-23 | 2001-02-20 | Microsoft Corporation | Methods and apparatus for classifying text and for building a text classifier |
US20070230792A1 (en) * | 2004-04-08 | 2007-10-04 | Mobileye Technologies Ltd. | Pedestrian Detection |
CN104424634B (zh) * | 2013-08-23 | 2017-05-03 | 株式会社理光 | 对象跟踪方法和装置 |
CN105900116A (zh) * | 2014-02-10 | 2016-08-24 | 三菱电机株式会社 | 分层型神经网络装置、判别器学习方法以及判别方法 |
CN105740910A (zh) * | 2016-02-02 | 2016-07-06 | 北京格灵深瞳信息技术有限公司 | 一种车辆物件检测方法及装置 |
JP6832504B2 (ja) | 2016-08-08 | 2021-02-24 | パナソニックIpマネジメント株式会社 | 物体追跡方法、物体追跡装置およびプログラム |
CN106650630B (zh) * | 2016-11-11 | 2019-08-23 | 纳恩博(北京)科技有限公司 | 一种目标跟踪方法及电子设备 |
CN106709936A (zh) * | 2016-12-14 | 2017-05-24 | 北京工业大学 | 一种基于卷积神经网络的单目标跟踪方法 |
CN107066990B (zh) * | 2017-05-04 | 2019-10-11 | 厦门美图之家科技有限公司 | 一种目标跟踪方法及移动设备 |
CN109726683B (zh) * | 2018-12-29 | 2021-06-22 | 北京市商汤科技开发有限公司 | 目标对象检测方法和装置、电子设备和存储介质 |
-
2017
- 2017-11-12 CN CN201711110587.1A patent/CN108230359B/zh active Active
-
2018
- 2018-11-09 JP JP2020526040A patent/JP7165731B2/ja active Active
- 2018-11-09 WO PCT/CN2018/114884 patent/WO2019091464A1/zh active Application Filing
- 2018-11-09 KR KR1020207016026A patent/KR20200087784A/ko not_active Application Discontinuation
- 2018-11-09 SG SG11202004324WA patent/SG11202004324WA/en unknown
-
2020
- 2020-05-06 US US16/868,427 patent/US11455782B2/en active Active
- 2020-05-07 PH PH12020550588A patent/PH12020550588A1/en unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106355188A (zh) * | 2015-07-13 | 2017-01-25 | 阿里巴巴集团控股有限公司 | 图像检测方法及装置 |
EP3229206A1 (en) * | 2016-04-04 | 2017-10-11 | Xerox Corporation | Deep data association for online multi-class multi-object tracking |
CN105976400A (zh) * | 2016-05-10 | 2016-09-28 | 北京旷视科技有限公司 | 基于神经网络模型的目标跟踪方法及装置 |
CN106326837A (zh) * | 2016-08-09 | 2017-01-11 | 北京旷视科技有限公司 | 对象追踪方法和装置 |
CN108230359A (zh) * | 2017-11-12 | 2018-06-29 | 北京市商汤科技开发有限公司 | 目标检测方法和装置、训练方法、电子设备、程序和介质 |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110399900A (zh) * | 2019-06-26 | 2019-11-01 | 腾讯科技(深圳)有限公司 | 对象检测方法、装置、设备及介质 |
CN110533184A (zh) * | 2019-08-31 | 2019-12-03 | 南京人工智能高等研究院有限公司 | 一种网络模型的训练方法及装置 |
JP2022511221A (ja) * | 2019-09-24 | 2022-01-31 | 北京市商▲湯▼科技▲開▼▲発▼有限公司 | 画像処理方法、画像処理装置、プロセッサ、電子機器、記憶媒体及びコンピュータプログラム |
JP7108123B2 (ja) | 2019-09-24 | 2022-07-27 | 北京市商▲湯▼科技▲開▼▲発▼有限公司 | 画像処理方法、画像処理装置、プロセッサ、電子機器、記憶媒体及びコンピュータプログラム |
US11429809B2 (en) | 2019-09-24 | 2022-08-30 | Beijing Sensetime Technology Development Co., Ltd | Image processing method, image processing device, and storage medium |
JP7274048B2 (ja) | 2019-11-20 | 2023-05-15 | テンセント・テクノロジー・(シェンジェン)・カンパニー・リミテッド | 動作認識方法、装置、コンピュータプログラム及びコンピュータデバイス |
US11928893B2 (en) | 2019-11-20 | 2024-03-12 | Tencent Technology (Shenzhen) Company Limited | Action recognition method and apparatus, computer storage medium, and computer device |
JP2022551396A (ja) * | 2019-11-20 | 2022-12-09 | テンセント・テクノロジー・(シェンジェン)・カンパニー・リミテッド | 動作認識方法、装置、コンピュータプログラム及びコンピュータデバイス |
CN111898701A (zh) * | 2020-08-13 | 2020-11-06 | 网易(杭州)网络有限公司 | 模型训练、帧图像生成、插帧方法、装置、设备及介质 |
CN111898701B (zh) * | 2020-08-13 | 2023-07-25 | 网易(杭州)网络有限公司 | 模型训练、帧图像生成、插帧方法、装置、设备及介质 |
CN112465868A (zh) * | 2020-11-30 | 2021-03-09 | 浙江大华汽车技术有限公司 | 一种目标检测跟踪方法、装置、存储介质及电子装置 |
CN112465868B (zh) * | 2020-11-30 | 2024-01-12 | 浙江华锐捷技术有限公司 | 一种目标检测跟踪方法、装置、存储介质及电子装置 |
CN112528932A (zh) * | 2020-12-22 | 2021-03-19 | 北京百度网讯科技有限公司 | 用于优化位置信息的方法、装置、路侧设备和云控平台 |
CN112528932B (zh) * | 2020-12-22 | 2023-12-08 | 阿波罗智联(北京)科技有限公司 | 用于优化位置信息的方法、装置、路侧设备和云控平台 |
CN113160247B (zh) * | 2021-04-22 | 2022-07-05 | 福州大学 | 基于频率分离的抗噪孪生网络目标跟踪方法 |
CN113160247A (zh) * | 2021-04-22 | 2021-07-23 | 福州大学 | 基于频率分离的抗噪孪生网络目标跟踪方法 |
CN113327253A (zh) * | 2021-05-24 | 2021-08-31 | 北京市遥感信息研究所 | 一种基于星载红外遥感影像的弱小目标检测方法 |
CN113327253B (zh) * | 2021-05-24 | 2024-05-24 | 北京市遥感信息研究所 | 一种基于星载红外遥感影像的弱小目标检测方法 |
Also Published As
Publication number | Publication date |
---|---|
PH12020550588A1 (en) | 2021-04-26 |
SG11202004324WA (en) | 2020-06-29 |
US20200265255A1 (en) | 2020-08-20 |
JP7165731B2 (ja) | 2022-11-04 |
JP2021502645A (ja) | 2021-01-28 |
KR20200087784A (ko) | 2020-07-21 |
CN108230359B (zh) | 2021-01-26 |
US11455782B2 (en) | 2022-09-27 |
CN108230359A (zh) | 2018-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019091464A1 (zh) | 目标检测方法和装置、训练方法、电子设备和介质 | |
TWI773189B (zh) | 基於人工智慧的物體檢測方法、裝置、設備及儲存媒體 | |
WO2018099473A1 (zh) | 场景分析方法和系统、电子设备 | |
WO2020134557A1 (zh) | 目标对象检测方法和装置、电子设备和存储介质 | |
JP6397144B2 (ja) | 画像からの事業発見 | |
WO2018019126A1 (zh) | 视频类别识别方法和装置、数据处理装置和电子设备 | |
US10769496B2 (en) | Logo detection | |
WO2019105337A1 (zh) | 基于视频的人脸识别方法、装置、设备、介质及程序 | |
CN108154222B (zh) | 深度神经网络训练方法和系统、电子设备 | |
WO2018054329A1 (zh) | 物体检测方法和装置、电子设备、计算机程序和存储介质 | |
WO2018121737A1 (zh) | 关键点预测、网络训练及图像处理方法和装置、电子设备 | |
US20150010203A1 (en) | Methods, apparatuses and computer program products for performing accurate pose estimation of objects | |
US20210124928A1 (en) | Object tracking methods and apparatuses, electronic devices and storage media | |
WO2019020062A1 (zh) | 视频物体分割方法和装置、电子设备、存储介质和程序 | |
CN113971751A (zh) | 训练特征提取模型、检测相似图像的方法和装置 | |
US9129152B2 (en) | Exemplar-based feature weighting | |
WO2022161302A1 (zh) | 动作识别方法、装置、设备、存储介质及计算机程序产品 | |
US20240153240A1 (en) | Image processing method, apparatus, computing device, and medium | |
CN113569740B (zh) | 视频识别模型训练方法与装置、视频识别方法与装置 | |
WO2022143366A1 (zh) | 图像处理方法、装置、电子设备、介质及计算机程序产品 | |
CN108154153B (zh) | 场景分析方法和系统、电子设备 | |
WO2019170024A1 (zh) | 目标跟踪方法和装置、电子设备、存储介质 | |
US20200357137A1 (en) | Determining a Pose of an Object in the Surroundings of the Object by Means of Multi-Task Learning | |
US9081800B2 (en) | Object detection via visual search | |
CN108229320B (zh) | 选帧方法和装置、电子设备、程序和介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18876911 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2020526040 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 20207016026 Country of ref document: KR Kind code of ref document: A |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11.09.2020) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18876911 Country of ref document: EP Kind code of ref document: A1 |