CN108388879B - Target detection method, device and storage medium - Google Patents

Target detection method, device and storage medium Download PDF

Info

Publication number
CN108388879B
CN108388879B CN201810214503.7A CN201810214503A CN108388879B CN 108388879 B CN108388879 B CN 108388879B CN 201810214503 A CN201810214503 A CN 201810214503A CN 108388879 B CN108388879 B CN 108388879B
Authority
CN
China
Prior art keywords
target
detected
frame image
category
current frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810214503.7A
Other languages
Chinese (zh)
Other versions
CN108388879A (en
Inventor
李朝辉
吴颖谦
蒋宗杰
张燕昆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zebred Network Technology Co Ltd
Original Assignee
Zebred Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zebred Network Technology Co Ltd filed Critical Zebred Network Technology Co Ltd
Priority to CN201810214503.7A priority Critical patent/CN108388879B/en
Publication of CN108388879A publication Critical patent/CN108388879A/en
Application granted granted Critical
Publication of CN108388879B publication Critical patent/CN108388879B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/48Matching video sequences
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a method, a device and a storage medium for detecting a target, wherein the method comprises the following steps: initially detecting to obtain a target to be detected in a current frame image in video data; matching the target to be detected with at least one target in the previous frame image of the current frame image; if the target matched with the target to be detected exists in the previous frame image, determining the category and the position information of the target to be detected according to the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images of the current frame image, wherein m is a positive integer. The target detection method, the target detection device and the storage medium provided by the invention not only can reduce the difficulty of detection, but also can improve the accuracy of detection.

Description

Target detection method, device and storage medium
Technical Field
The present invention relates to image detection technologies, and in particular, to a method and an apparatus for detecting an object, and a storage medium.
Background
The accuracy requirement for detecting objects such as vehicles and pedestrians in the auxiliary driving of the automobile is very strict. The current detection technology is more accurate for rigid targets such as vehicles, traffic signs and lane lines, and the detection accuracy for non-rigid targets such as pedestrians or bicycles is lower.
At present, a pedestrian detection method is mainly based on a single frame image in a video stream, and detection is performed by using a traditional feature extraction and classification method or a deep learning method such as a convolutional neural network. The traditional feature extraction and classification method mainly designs pedestrian features in advance and classifies the features by using a machine learning algorithm. For example, using histogram of gradient (HOG) of image as feature, using Support Vector Machine (SVM) to perform binary classification, and HOG feature is calculated by gradient of image and counting according to direction and module value. In addition, the deep learning-based method is to automatically learn features through a convolutional neural network, and currently, popular methods mainly include fast rcnn based on extracting candidate boxes to perform secondary classification, ssd (single-shot multi-box detector) and YOLO algorithms based on a multi-scale feature layer, and Feature Pyramid Network (FPN) improved algorithms based on an image pyramid.
Because targets such as pedestrians and the like can generate various deformations, when the above modes are adopted for detection, in order to improve the detection accuracy, the data volume needs to be enlarged to contain enough samples, and meanwhile, the model capacity needs to be improved to cover possible various deformations, so that the detection difficulty can be increased, and the detection accuracy is not high.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a method and a device for detecting a target and a storage medium, which can not only reduce the detection difficulty, but also improve the detection accuracy.
In a first aspect, an embodiment of the present invention provides a method for detecting a target, including:
initially detecting to obtain a target to be detected in a current frame image in video data;
matching the target to be detected with at least one target in the previous frame image of the current frame image;
if the target matched with the target to be detected exists in the previous frame image, determining the category and the position information of the target to be detected according to the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images of the current frame image, wherein m is a positive integer.
Optionally, the matching the target to be detected with at least one target in a previous frame image of the current frame image includes:
acquiring a candidate frame of the target to be detected in the current frame image;
and matching the candidate frame with at least one target in the previous frame image.
Optionally, the matching the candidate frame and the at least one target in the previous frame of image includes:
tracking the at least one target in the current frame image to obtain a tracking frame of each target in the current frame image;
calculating an intersection ratio IOU between each tracking frame and the candidate frame;
and determining that the target corresponding to the tracking frame of which the IOU is greater than a preset threshold value is successfully matched with the candidate frame.
Optionally, the calculating an intersection-to-parallel ratio IOU between each tracking frame and the candidate frame includes:
calculating the IOU according to the formula IOU ═ TkBBox I CandBox)/(TkBBox U CandBox, wherein the TkBBox is the tracking box and the CandBox is the candidate box.
Optionally, the determining the category and the position information of the target to be detected according to the feature layer of the target to be detected in the current frame image and the feature layer of the target to be detected in the previous m frame images of the current frame image respectively includes:
inputting the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images into a long-term cyclic convolution network LRCN to obtain the position information of the target to be detected and the probability value of the target to be detected in each category;
selecting the category with the maximum probability value as an intermediate category;
and determining the category of the target to be detected in the current frame image according to the probability value of the intermediate category and the probability value of the category of the target to be detected in the previous frame image.
Optionally, the determining the category of the target to be detected in the current frame image according to the probability value of the intermediate category and the probability value of the category of the target to be detected in the previous frame image includes:
comparing the probability value corresponding to the middle category with the probability value of the category of the target to be detected in the previous frame of image;
if the probability value corresponding to the intermediate category is greater than or equal to the probability value of the category of the target to be detected in the previous frame image, determining the intermediate category as the category of the target to be detected in the current frame image;
and if the probability value corresponding to the intermediate category is smaller than the probability value of the category of the target to be detected in the previous frame image, determining the category of the target to be detected in the previous frame image as the category of the target to be detected in the current frame image.
Optionally, before inputting the feature layer of the object to be detected in the current frame image and the feature layer in the previous m frame images into a long-term cyclic convolution network LRCN, the method further includes:
respectively carrying out scaling processing on the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images to obtain a characteristic layer with a preset size;
the inputting the feature layer of the target to be detected in the current frame image and the feature layer of the target to be detected in the previous m frame image into a long-term cyclic convolution network LRCN includes:
and inputting the feature layer with the preset size into the LRCN.
In a second aspect, an embodiment of the present invention provides an apparatus for detecting an object, including:
the detection module is used for initially detecting a target to be detected in a current frame image in the obtained video data;
the matching module is used for matching the target to be detected with at least one target in the previous frame image of the current frame image;
and the determining module is used for determining the category and the position information of the target to be detected according to the feature layer of the target to be detected in the current frame image and the feature layer of the target to be detected in the previous m frame images of the current frame image when the matching module matches that the target matched with the target to be detected exists in the previous frame image, wherein m is a positive integer.
Optionally, the matching module is specifically configured to:
acquiring a candidate frame of the target to be detected in the current frame image;
and matching the candidate frame with at least one target in the previous frame image.
Optionally, the matching module is specifically configured to:
tracking the at least one target in the current frame image to obtain a tracking frame of each target in the current frame image;
calculating an intersection ratio IOU between each tracking frame and the candidate frame;
and determining that the target corresponding to the tracking frame of which the IOU is greater than a preset threshold value is successfully matched with the candidate frame.
Optionally, the matching module is specifically configured to:
calculating the IOU according to the formula IOU ═ TkBBox I CandBox)/(TkBBox U CandBox, wherein the TkBBox is the tracking box and the CandBox is the candidate box.
Optionally, the determining module is specifically configured to:
inputting the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images into a long-term cyclic convolution network LRCN to obtain the position information of the target to be detected and the probability value of the target to be detected in each category;
selecting the category with the maximum probability value as an intermediate category;
and determining the category of the target to be detected in the current frame image according to the probability value of the intermediate category and the probability value of the category of the target to be detected in the previous frame image.
Optionally, the determining module is specifically configured to:
comparing the probability value corresponding to the middle category with the probability value of the category of the target to be detected in the previous frame of image;
if the probability value corresponding to the intermediate category is greater than or equal to the probability value of the category of the target to be detected in the previous frame image, determining the intermediate category as the category of the target to be detected in the current frame image;
and if the probability value corresponding to the intermediate category is smaller than the probability value of the category of the target to be detected in the previous frame image, determining the category of the target to be detected in the previous frame image as the category of the target to be detected in the current frame image.
Optionally, the determining module is specifically configured to:
respectively carrying out scaling processing on the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images to obtain a characteristic layer with a preset size;
and inputting the feature layer with the preset size into the LRCN.
In a third aspect, an embodiment of the present invention provides a terminal device, including:
a processor;
a memory; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor, the computer program comprising instructions for performing the method of the first aspect.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored, and the computer program causes a server to execute the method in the first aspect.
The target detection method, the target detection device and the storage medium provided by the invention have the advantages that the target to be detected in the current frame image in the video data is obtained through initial detection, the target to be detected is matched with at least one target in the previous frame image of the current frame image, and if the target matched with the target to be detected exists in the previous frame image, the category and the position information of the target to be detected are determined according to the characteristic layer of the target to be detected in the current frame image and the characteristic layer in the previous m frame image of the current frame image respectively. The terminal equipment can be matched with the target in the previous frame image of the current frame image when determining the category and the position information of the target to be detected in the current frame image, and after the matching is successful, the category and the position information of the target to be detected are jointly determined according to the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images of the current frame image, so that the phenomenon that the target is detected only according to a single frame image in the prior art is avoided, and the posture change of the target to be detected can be detected according to multiple frame images, so that the detection difficulty can be reduced, and the detection accuracy can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a first embodiment of a target detection method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating candidate extraction boxes;
FIG. 3 is a schematic flow chart of the LRCN algorithm;
FIG. 4 is a pedestrian time series flow diagram;
fig. 5 is a schematic structural diagram of a first embodiment of a target detection apparatus according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The target detection method provided by the embodiment of the invention can be applied to a detection scene of a target object in an image, and particularly applied to a scene of non-rigid target detection in which the posture of the target changes or various deformations occur. At present, detection of non-rigid targets such as pedestrians is mainly performed by using a traditional feature extraction and classification method based on a single frame image in a video stream or a deep learning method based on a convolutional neural network. However, since the targets such as pedestrians may generate various deformations, when the above methods are used for detection, in order to improve the detection accuracy, the data size needs to be enlarged to include enough samples, and meanwhile, the model capacity needs to be increased to cover possible various deformations, which not only increases the detection difficulty, but also increases the detection accuracy.
In view of the above problems, an embodiment of the present invention provides a target detection method, in which a target to be detected in a current frame image in video data is obtained through initial detection, and the target to be detected is matched with at least one target in a previous frame image of the current frame image, and if a target matched with the target to be detected exists in the previous frame image, a category and position information of the target to be detected are determined according to a feature layer of the target to be detected in the current frame image and a feature layer of the target to be detected in a previous m frame image of the current frame image, respectively. The terminal equipment can be matched with the target in the previous frame image of the current frame image when determining the category and the position information of the target to be detected in the current frame image, and after the matching is successful, the category and the position information of the target to be detected are jointly determined according to the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images of the current frame image, so that the phenomenon that the target is detected only according to a single frame image in the prior art is avoided, and the posture change of the target to be detected can be detected according to multiple frame images, so that the detection difficulty can be reduced, and the detection accuracy can be improved.
The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.
Fig. 1 is a schematic flowchart of a first embodiment of a target detection method according to an embodiment of the present invention. The embodiment of the invention provides a target detection method, which can be executed by any device for executing the target detection method, and the device can be realized by software and/or hardware. In this embodiment, the apparatus may be integrated in a terminal device. As shown in fig. 1, the method for detecting a target provided in the embodiment of the present invention includes the following steps:
step 101, initially detecting to obtain a target to be detected in a current frame image in video data.
In this embodiment, the camera may collect video data in real time, and send the collected video data to the terminal device, and after receiving the video data, the terminal device obtains a current frame image from the video data, and performs initial detection on the current frame image by using a candidate frame extraction network (RPN) to obtain whether each target in the current frame image is a target to be detected. The number of targets to be detected may be one or more. In this embodiment, the object to be detected may include a non-rigid object such as a pedestrian or a bicycle.
The terminal device may be, for example, a mobile phone, a tablet, a wearable device, or an in-vehicle device.
And 102, matching the target to be detected with at least one target in the previous frame image of the current frame image.
In this embodiment, each frame of image includes at least one target, and after acquiring the target to be detected in the current frame of image, the terminal device matches the target to be detected with at least one target in the previous frame of image of the current frame of image. It should be noted that, if there are a plurality of targets to be detected, each target to be detected may be respectively matched with at least one target in the previous frame image of the current frame image.
In a possible implementation manner, matching the target to be detected with at least one target in the previous frame image of the current frame image includes acquiring a candidate frame of the target to be detected in the current frame image, and matching the candidate frame with at least one target in the previous frame image.
Specifically, fig. 2 is a schematic diagram of extracting a candidate frame, and as shown in fig. 2, after a current frame image in video data is acquired, a candidate frame extraction network (RPN) is used to extract a candidate frame 1 from the current frame image, and in addition, a feature layer of an object to be detected calculated by using the RPN needs to be stored. Wherein, each extracted candidate frame 1 comprises an object to be detected.
After the candidate frame 1 is extracted, the extracted candidate frame will be matched with the target in the previous frame image of the current frame image. In the embodiment of the present invention, a tracking algorithm may be used for matching, and in a specific implementation process, at least one target may be tracked in a current frame image to obtain tracking frames of the targets in the current frame image, an Intersection Over Unit (IOU) between each tracking frame and a candidate frame is calculated, and it is determined that a target corresponding to a tracking frame whose IOU is greater than a preset threshold is successfully matched with the candidate frame.
Specifically, all the targets in the previous frame image may be tracked in the current frame by using a Kernel Correlation Filter (KCF) algorithm, so as to obtain a tracking frame of all the targets in the previous frame image in the current frame. After the tracking frame of each target in the current frame image is calculated, the IOU between each tracking frame and the candidate frame of the target to be detected is calculated.
In one possible implementation, the IOU may be calculated according to the formula IOU ═ I TkBBox I cantdbox)/(TkBBox U cantdbox), where TkBBox is a tracking frame and cantdbox is a candidate frame, that is, the intersection between the tracking frame and the candidate frame is calculated first, then the union between the tracking frame and the candidate frame is calculated, and then the two are used as the ratio to determine the intersection-union ratio IOU between the tracking frame and the candidate frame.
After the IOU is calculated, whether the calculated value of the IOU is larger than a preset threshold value or not is judged, if the calculated value of the IOU is larger than the preset threshold value, the matching between the target corresponding to the tracking frame of the IOU and the candidate frame is successful, and if not, the matching between the target corresponding to the tracking frame and the candidate frame is unsuccessful. The value of the preset threshold may be selected according to actual conditions or experience, and the specific value of the preset threshold is not limited herein.
It should be noted that, if a candidate frame in the current frame is not successfully matched with any target in the previous frame image, it indicates that the target to be detected corresponding to the candidate frame may be a target that newly appears in the current frame, and at this time, the target to be detected may be marked as an initial frame. If a certain target in the previous frame image is not successfully matched with the candidate frame corresponding to any target in the current frame, it indicates that the target in the previous frame image has disappeared in the current frame, and at this time, the target will be discarded.
And 103, if the target matched with the target to be detected exists in the previous frame image, determining the category and the position information of the target to be detected according to the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images of the current frame image, wherein m is a positive integer.
In this embodiment, the terminal device may calculate the feature layer of each target to be detected through a candidate frame extraction Network (region pro-social Network). If the terminal equipment finds that the target matched with the target to be detected exists in the previous frame image, the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frames of images are obtained, and the category and the position information of the target to be detected are determined according to the obtained characteristic layers.
In a possible implementation mode, determining the category and the position information of the target to be detected according to the feature layer of the target to be detected in the current frame image and the feature layer of the target to be detected in the previous m frame images of the current frame image respectively, wherein the step of inputting the feature layer of the target to be detected in the current frame image and the feature layer of the target to be detected in the previous m frame images into a Long-term cyclic Convolution Network (LRCN) to obtain the position information of the target to be detected and the probability value of the target to be detected in each category; selecting the category with the maximum probability value as an intermediate category; and determining the category of the target to be detected in the current frame image according to the probability value of the middle category and the probability value of the category of the target to be detected in the previous frame image.
Specifically, the terminal device may calculate a feature layer of each object to be detected in the current frame image through a candidate frame extraction Network (RPN), and similarly, when detecting each previous frame image, also calculate and store the feature layer of each previous frame image of the object to be detected.
When the terminal device determines that an object matched with the object to be detected exists in the previous frame image, it indicates that the object to be detected appears in both the previous frame image and the current frame image, at this time, the features of the stored convolution layer of the object to be detected in the current frame image and the features of the convolution layer in the previous m frame image are obtained, and the obtained features of the convolution layer are used as input to be transmitted into a time series network, for example, the time series network can be transmitted into an LRCN, wherein the LRCN network is composed of a plurality of long-short term memory (LSTM) layers, each layer receives the feature input of the current frame object, outputs the position information and the category information of the object to be detected corresponding to the frame, and transmits the state to the next layer.
Fig. 3 is a schematic flow diagram of an LRCN algorithm, and as shown in fig. 3, after all targets in a previous frame image obtained by tracking through a KCF algorithm are in a tracking frame in a current frame, matching the tracking frame with a candidate frame of the target to be detected, if matching is successful, acquiring a CNN (Convolutional Neural Network) feature layer of the target to be detected in the current frame image and a CNN feature layer in a previous m frame image of the current frame image, and transmitting the acquired CNN feature layers as inputs to an LSTM Network, thereby acquiring position information of the target to be detected and probability values of the target to be detected in each category.
In this embodiment, m may be set according to an actual situation or experience, for example, may be set to 10, 15, and the like.
In addition, the number or the type of the categories may be preset, for example, the number or the type may include a background, a pedestrian, a bicycle, a car, and the like, and after the terminal device inputs the feature layer into the LRCN, the coordinate position of the object to be detected in the current frame image and the probability value that the object to be detected is in each category will be obtained.
For example, if the current frame image is the 30 th frame image, the feature layer of the object to be detected in the 30 th frame image and the feature layer of the object to be detected in the 20 th to 29 th frame images are input into the LRCN, so that the coordinate position of the position information of the object to be detected in the current frame image can be obtained, and probability values of the object to be detected in various categories can also be obtained, such as a probability of 0.1 being a background, a probability of 0.7 being a pedestrian, a probability of 0.1 being a bicycle, a probability of 0.1 being a car, and the like.
After the probability values of the targets to be detected in the categories are determined, the category with the maximum probability value is selected as an intermediate category, and if pedestrians are selected as the intermediate category.
Further, comparing the probability value corresponding to the determined middle category with the probability value of the category of the target to be detected in the previous frame of image; if the probability value corresponding to the intermediate category is greater than or equal to the probability value of the category of the target to be detected in the previous frame of image, determining the intermediate category as the category of the target to be detected in the current frame of image; and if the probability value corresponding to the middle category is smaller than the probability value of the category of the target to be detected in the previous frame image, determining the category of the target to be detected in the previous frame image as the category of the target to be detected in the current frame image.
Specifically, for each frame of image, the class of the object to be detected in the frame of image is determined according to the above manner, and therefore, after the intermediate class is determined, the terminal device compares the probability value of the intermediate class with the probability value of the class of the object to be detected in the previous frame of image, and when the probability value corresponding to the intermediate class is greater than or equal to the probability value of the class of the object to be detected in the previous frame of image, the intermediate class is determined as the class of the object to be detected in the current frame of image. For example: and if the intermediate class is a pedestrian and the probability value is 0.7, the class of the target to be detected in the previous frame of image is also a pedestrian and the probability value is 0.6, determining the intermediate class pedestrian as the class of the target to be detected in the current frame of image. For another example: and if the intermediate category is a pedestrian and the probability value is 0.7, the category of the target to be detected in the previous frame of image is a bicycle and the probability value is 0.6, determining the intermediate category pedestrian as the category of the target to be detected in the current frame of image.
In addition, if the probability value corresponding to the middle category is smaller than the probability value of the category of the target to be detected in the previous frame image, the category of the target to be detected in the previous frame image is determined as the category of the target to be detected in the current frame image. For example: and if the intermediate category is a pedestrian and the probability value is 0.7, the category of the target to be detected in the previous frame image is also a pedestrian and the probability value is 0.8, determining the category of the pedestrian of the target to be detected in the previous frame image as the category of the target to be detected in the current frame image. For another example: and if the intermediate category is a pedestrian and the probability value is 0.7, the category of the target to be detected in the previous frame image is a bicycle and the probability value is 0.8, determining the category of the bicycle of the target to be detected in the previous frame image as the category of the target to be detected in the current frame image.
Further, before inputting the feature layer of the object to be detected in the current frame image and the feature layer in the previous m frame images into the LRCN, the method further includes: and respectively carrying out scaling treatment on the characteristic layer of the target to be detected in the current frame image and the characteristic layer in the previous m frame images to obtain the characteristic layer with the preset size, so that the characteristic layer with the preset size is only required to be input into the LRCN.
Specifically, fig. 4 is a schematic diagram of a pedestrian time sequence flow, and as shown in fig. 4, sizes of targets to be detected in different frames are different, so before inputting to the LRCN network, in this embodiment, an algorithm in fast rcnn needs to be adopted, and a region of interest (ROI) scaling process is performed on a convolutional layer first, and the scaling is performed to a fixed size, where a specific implementation manner is: assuming that the region of interest ROI is H × W, the scaled feature size is H × W, dividing the ROI into H × W grids, each grid having a size of H/H × W/W, performing maximum scaling (max scaling) on each grid, and finally generating a feature layer having a size of H × W.
In addition, because one frame of image contains a plurality of targets, the convolution feature can be directly calculated for the whole image when the feature layer is calculated, and then the corresponding feature layer is taken out according to the coordinate and the size of the candidate frame of the target to be detected for ROI scaling processing.
Further, for training and detection, the target to be detected may also be used as a unit, specifically, for each target in each frame, the convolution feature corresponding to each frame is first calculated, then ROI scaling is performed to transform to a fixed size, and the result is transmitted to the LRCN network.
The target detection method provided by the embodiment of the invention obtains the target to be detected in the current frame image in the video data through initial detection, matches the target to be detected with at least one target in the previous frame image of the current frame image, and determines the category and the position information of the target to be detected according to the feature layer of the target to be detected in the current frame image and the feature layer in the m frame image before the current frame image if the target to be detected matched with the target to be detected exists in the previous frame image. The terminal equipment can be matched with the target in the previous frame image of the current frame image when determining the category and the position information of the target to be detected in the current frame image, and after the matching is successful, the category and the position information of the target to be detected are jointly determined according to the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images of the current frame image, so that the phenomenon that the target is detected only according to a single frame image in the prior art is avoided, and the posture change of the target to be detected can be detected according to multiple frame images, so that the detection difficulty can be reduced, and the detection accuracy can be improved.
Fig. 5 is a schematic structural diagram of a first embodiment of a target detection apparatus according to an embodiment of the present invention. The target detection device may be an independent terminal device, or may be a device integrated in a terminal device, and the device may be implemented by software, hardware, or a combination of software and hardware. As shown in fig. 5, the apparatus includes:
the detection module 11 is configured to initially detect a target to be detected in a current frame image in the obtained video data;
the matching module 12 is configured to match the target to be detected with at least one target in a previous frame image of the current frame image;
the determining module 13 is configured to determine the category and the position information of the target to be detected according to the feature layer of the target to be detected in the current frame image and the feature layer in the m frame image before the current frame image when the matching module matches that the target matched with the target to be detected exists in the previous frame image, where m is a positive integer.
The target detection device provided by the embodiment of the invention can execute the method embodiment, and the implementation principle and the technical effect are similar, so that the details are not repeated.
Optionally, the matching module 12 is specifically configured to:
acquiring a candidate frame of the target to be detected in the current frame image;
and matching the candidate frame with at least one target in the previous frame image.
Optionally, the matching module 12 is specifically configured to:
tracking the at least one target in the current frame image to obtain a tracking frame of each target in the current frame image;
calculating an intersection ratio IOU between each tracking frame and the candidate frame;
and determining that the target corresponding to the tracking frame of which the IOU is greater than a preset threshold value is successfully matched with the candidate frame.
Optionally, the matching module 12 is specifically configured to:
calculating the IOU according to the formula IOU ═ TkBBox I CandBox)/(TkBBox U CandBox, wherein the TkBBox is the tracking box and the CandBox is the candidate box.
Optionally, the determining module 13 is specifically configured to:
inputting the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images into a long-term cyclic convolution network LRCN to obtain the position information of the target to be detected and the probability value of the target to be detected in each category;
selecting the category with the maximum probability value as an intermediate category;
and determining the category of the target to be detected in the current frame image according to the probability value of the intermediate category and the probability value of the category of the target to be detected in the previous frame image.
Optionally, the determining module 13 is specifically configured to:
comparing the probability value corresponding to the middle category with the probability value of the category of the target to be detected in the previous frame of image;
if the probability value corresponding to the intermediate category is greater than or equal to the probability value of the category of the target to be detected in the previous frame image, determining the intermediate category as the category of the target to be detected in the current frame image;
and if the probability value corresponding to the intermediate category is smaller than the probability value of the category of the target to be detected in the previous frame image, determining the category of the target to be detected in the previous frame image as the category of the target to be detected in the current frame image.
Optionally, the determining module 13 is specifically configured to:
respectively carrying out scaling processing on the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images to obtain a characteristic layer with a preset size;
and inputting the feature layer with the preset size into the LRCN.
The target detection device provided by the embodiment of the invention can execute the method embodiment, and the implementation principle and the technical effect are similar, so that the details are not repeated.
Fig. 6 is a schematic structural diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 6, the terminal device may include a transmitter 60, a processor 61, a memory 62, a receiver 64, and at least one communication bus 63. The communication bus 63 is used to realize communication connection between the elements. The memory 62 may comprise a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, in which various computer programs may be stored for performing various processing functions and implementing the method steps of any of the preceding embodiments.
An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program enables a server to execute the method for detecting an object provided in any of the foregoing embodiments.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A method of detecting an object, comprising:
initially detecting to obtain a target to be detected in a current frame image in video data;
matching the target to be detected with at least one target in the previous frame image of the current frame image;
if the target matched with the target to be detected exists in the previous frame image, inputting a feature layer of the target to be detected in the current frame image and a feature layer of the target to be detected in the previous m frame images into a long-term cyclic convolution network (LRCN), obtaining position information of the target to be detected and probability values of the target to be detected in all categories, and selecting the category with the maximum probability value as an intermediate category, wherein m is a positive integer;
comparing the probability value corresponding to the middle category with the probability value of the category of the target to be detected in the previous frame of image;
if the probability value corresponding to the intermediate category is greater than or equal to the probability value of the category of the target to be detected in the previous frame image, determining the intermediate category as the category of the target to be detected in the current frame image;
and if the probability value corresponding to the intermediate category is smaller than the probability value of the category of the target to be detected in the previous frame image, determining the category of the target to be detected in the previous frame image as the category of the target to be detected in the current frame image.
2. The method according to claim 1, wherein the matching the target to be detected with at least one target in a previous frame image of the current frame image comprises:
acquiring a candidate frame of the target to be detected in the current frame image;
and matching the candidate frame with at least one target in the previous frame image.
3. The method of claim 2, wherein matching the candidate frame with the at least one object in the previous frame of image comprises:
tracking the at least one target in the current frame image to obtain a tracking frame of each target in the current frame image;
calculating an intersection ratio IOU between each tracking frame and the candidate frame;
and determining that the target corresponding to the tracking frame of which the IOU is greater than a preset threshold value is successfully matched with the candidate frame.
4. The method of claim 3, wherein calculating the intersection-to-parallel ratio IOU between each of the tracking boxes and the candidate box comprises:
calculating the IOU according to the formula IOU ═ TkBBox I CandBox)/(TkBBox U CandBox, wherein the TkBBox is the tracking box and the CandBox is the candidate box.
5. The method according to claim 1, wherein before inputting the feature layer of the object to be detected in the current frame image and the feature layer in the previous m frame images into a long-term cyclic convolution network LRCN, the method further comprises:
respectively carrying out scaling processing on the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images to obtain a characteristic layer with a preset size;
the inputting the feature layer of the target to be detected in the current frame image and the feature layer of the target to be detected in the previous m frame image into a long-term cyclic convolution network LRCN includes:
and inputting the feature layer with the preset size into the LRCN.
6. An apparatus for detecting an object, comprising:
the detection module is used for initially detecting a target to be detected in a current frame image in the obtained video data;
the matching module is used for matching the target to be detected with at least one target in the previous frame image of the current frame image;
a determining module, configured to, when the matching module matches that there is a target matching the target to be detected in the previous frame of image, input a feature layer of the target to be detected in the current frame of image and a feature layer of the previous m frames of image into a long-term cyclic convolution network LRCN, obtain position information of the target to be detected and probability values of the target to be detected in each category, and select a category with the highest probability value as an intermediate category, where m is a positive integer;
comparing the probability value corresponding to the middle category with the probability value of the category of the target to be detected in the previous frame of image;
if the probability value corresponding to the intermediate category is greater than or equal to the probability value of the category of the target to be detected in the previous frame image, determining the intermediate category as the category of the target to be detected in the current frame image;
and if the probability value corresponding to the intermediate category is smaller than the probability value of the category of the target to be detected in the previous frame image, determining the category of the target to be detected in the previous frame image as the category of the target to be detected in the current frame image.
7. A terminal device, comprising:
a processor;
a memory; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-5.
8. A computer-readable storage medium, characterized in that it stores a computer program that causes a terminal device to execute the method of any one of claims 1-5.
CN201810214503.7A 2018-03-15 2018-03-15 Target detection method, device and storage medium Active CN108388879B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810214503.7A CN108388879B (en) 2018-03-15 2018-03-15 Target detection method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810214503.7A CN108388879B (en) 2018-03-15 2018-03-15 Target detection method, device and storage medium

Publications (2)

Publication Number Publication Date
CN108388879A CN108388879A (en) 2018-08-10
CN108388879B true CN108388879B (en) 2022-04-15

Family

ID=63067779

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810214503.7A Active CN108388879B (en) 2018-03-15 2018-03-15 Target detection method, device and storage medium

Country Status (1)

Country Link
CN (1) CN108388879B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308469B (en) * 2018-09-21 2019-12-10 北京字节跳动网络技术有限公司 Method and apparatus for generating information
CN109658438A (en) * 2018-12-05 2019-04-19 王家万 Tracking, device and the storage medium of target are detected in video
CN109784173A (en) * 2018-12-14 2019-05-21 合肥阿巴赛信息科技有限公司 A kind of shop guest's on-line tracking of single camera
CN111325075B (en) * 2018-12-17 2023-11-07 北京华航无线电测量研究所 Video sequence target detection method
CN109903312B (en) * 2019-01-25 2021-04-30 北京工业大学 Football player running distance statistical method based on video multi-target tracking
CN111489284B (en) * 2019-01-29 2024-02-06 北京搜狗科技发展有限公司 Image processing method and device for image processing
CN109993091B (en) * 2019-03-25 2020-12-15 浙江大学 Monitoring video target detection method based on background elimination
CN110210304B (en) * 2019-04-29 2021-06-11 北京百度网讯科技有限公司 Method and system for target detection and tracking
CN110378381B (en) * 2019-06-17 2024-01-19 华为技术有限公司 Object detection method, device and computer storage medium
CN110246160B (en) * 2019-06-20 2022-12-06 腾讯科技(深圳)有限公司 Video target detection method, device, equipment and medium
CN112347817B (en) * 2019-08-08 2022-05-17 魔门塔(苏州)科技有限公司 Video target detection and tracking method and device
CN110619279B (en) * 2019-08-22 2023-03-17 天津大学 Road traffic sign instance segmentation method based on tracking
CN110517293A (en) 2019-08-29 2019-11-29 京东方科技集团股份有限公司 Method for tracking target, device, system and computer readable storage medium
CN114641799A (en) * 2019-11-20 2022-06-17 Oppo广东移动通信有限公司 Object detection device, method and system
CN111126399B (en) * 2019-12-28 2022-07-26 苏州科达科技股份有限公司 Image detection method, device and equipment and readable storage medium
CN113065650B (en) * 2021-04-02 2023-11-17 中山大学 Multichannel neural network instance separation method based on long-term memory learning

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982559A (en) * 2012-11-28 2013-03-20 大唐移动通信设备有限公司 Vehicle tracking method and system
CN103810696A (en) * 2012-11-15 2014-05-21 浙江大华技术股份有限公司 Method for detecting image of target object and device thereof
CN103940824A (en) * 2014-04-29 2014-07-23 长春工程学院 Air electric transmission line insulator detecting method
EP2840528A2 (en) * 2013-08-20 2015-02-25 Ricoh Company, Ltd. Method and apparatus for tracking object
CN106127776A (en) * 2016-06-28 2016-11-16 北京工业大学 Based on multiple features space-time context robot target identification and motion decision method
CN106296723A (en) * 2015-05-28 2017-01-04 展讯通信(天津)有限公司 Target location method for tracing and device
CN106570490A (en) * 2016-11-15 2017-04-19 华南理工大学 Pedestrian real-time tracking method based on fast clustering
CN106707296A (en) * 2017-01-09 2017-05-24 华中科技大学 Dual-aperture photoelectric imaging system-based unmanned aerial vehicle detection and recognition method
CN106919918A (en) * 2017-02-27 2017-07-04 腾讯科技(上海)有限公司 A kind of face tracking method and device
CN106951841A (en) * 2017-03-09 2017-07-14 广东顺德中山大学卡内基梅隆大学国际联合研究院 A kind of multi-object tracking method based on color and apart from cluster
CN107016357A (en) * 2017-03-23 2017-08-04 北京工业大学 A kind of video pedestrian detection method based on time-domain convolutional neural networks
CN107273800A (en) * 2017-05-17 2017-10-20 大连理工大学 A kind of action identification method of the convolution recurrent neural network based on attention mechanism
CN107563313A (en) * 2017-08-18 2018-01-09 北京航空航天大学 Multiple target pedestrian detection and tracking based on deep learning

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810696A (en) * 2012-11-15 2014-05-21 浙江大华技术股份有限公司 Method for detecting image of target object and device thereof
CN102982559A (en) * 2012-11-28 2013-03-20 大唐移动通信设备有限公司 Vehicle tracking method and system
EP2840528A2 (en) * 2013-08-20 2015-02-25 Ricoh Company, Ltd. Method and apparatus for tracking object
CN103940824A (en) * 2014-04-29 2014-07-23 长春工程学院 Air electric transmission line insulator detecting method
CN106296723A (en) * 2015-05-28 2017-01-04 展讯通信(天津)有限公司 Target location method for tracing and device
CN106127776A (en) * 2016-06-28 2016-11-16 北京工业大学 Based on multiple features space-time context robot target identification and motion decision method
CN106570490A (en) * 2016-11-15 2017-04-19 华南理工大学 Pedestrian real-time tracking method based on fast clustering
CN106707296A (en) * 2017-01-09 2017-05-24 华中科技大学 Dual-aperture photoelectric imaging system-based unmanned aerial vehicle detection and recognition method
CN106919918A (en) * 2017-02-27 2017-07-04 腾讯科技(上海)有限公司 A kind of face tracking method and device
CN106951841A (en) * 2017-03-09 2017-07-14 广东顺德中山大学卡内基梅隆大学国际联合研究院 A kind of multi-object tracking method based on color and apart from cluster
CN107016357A (en) * 2017-03-23 2017-08-04 北京工业大学 A kind of video pedestrian detection method based on time-domain convolutional neural networks
CN107273800A (en) * 2017-05-17 2017-10-20 大连理工大学 A kind of action identification method of the convolution recurrent neural network based on attention mechanism
CN107563313A (en) * 2017-08-18 2018-01-09 北京航空航天大学 Multiple target pedestrian detection and tracking based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Visual Localisation and Individual Identification of Holstein Friesian Cattle via Deep Learning;William Andrew等;《2017 IEEE International Conference on Computer Vision Workshops》;20180123;正文第3-6节,图1,图5 *
基于改进Faster R-CNN的空中目标检测;冯小雨等;《https://t.cnki.net/kcms/detail/31.1252.O4.20180227.1700.008.html》;20180227;1-9 *

Also Published As

Publication number Publication date
CN108388879A (en) 2018-08-10

Similar Documents

Publication Publication Date Title
CN108388879B (en) Target detection method, device and storage medium
CN108960211B (en) Multi-target human body posture detection method and system
US9965719B2 (en) Subcategory-aware convolutional neural networks for object detection
EP3338248B1 (en) Systems and methods for object tracking
EP3379460B1 (en) Quality measurement weighting of image objects
CN110020592B (en) Object detection model training method, device, computer equipment and storage medium
EP3295424B1 (en) Systems and methods for reducing a plurality of bounding regions
US8792722B2 (en) Hand gesture detection
US8750573B2 (en) Hand gesture detection
CN109272509B (en) Target detection method, device and equipment for continuous images and storage medium
US8897575B2 (en) Multi-scale, perspective context, and cascade features for object detection
US20180114071A1 (en) Method for analysing media content
CN109389086B (en) Method and system for detecting unmanned aerial vehicle image target
CN108805016B (en) Head and shoulder area detection method and device
CN113284168A (en) Target tracking method and device, electronic equipment and storage medium
CN110765860A (en) Tumble determination method, tumble determination device, computer apparatus, and storage medium
US11321945B2 (en) Video blocking region selection method and apparatus, electronic device, and system
EP2864933A1 (en) Method, apparatus and computer program product for human-face features extraction
CN107851192B (en) Apparatus and method for detecting face part and face
CN109858552B (en) Target detection method and device for fine-grained classification
CN110610123A (en) Multi-target vehicle detection method and device, electronic equipment and storage medium
CN109726621B (en) Pedestrian detection method, device and equipment
US20230069608A1 (en) Object Tracking Apparatus and Method
EP4332910A1 (en) Behavior detection method, electronic device, and computer readable storage medium
CN114898306A (en) Method and device for detecting target orientation and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant