CN108388879B

CN108388879B - Target detection method, device and storage medium

Info

Publication number: CN108388879B
Application number: CN201810214503.7A
Authority: CN
Inventors: 李朝辉; 吴颖谦; 蒋宗杰; 张燕昆
Original assignee: Zebred Network Technology Co Ltd
Current assignee: Zebred Network Technology Co Ltd
Priority date: 2018-03-15
Filing date: 2018-03-15
Publication date: 2022-04-15
Anticipated expiration: 2038-03-15
Also published as: CN108388879A

Abstract

The invention provides a method, a device and a storage medium for detecting a target, wherein the method comprises the following steps: initially detecting to obtain a target to be detected in a current frame image in video data; matching the target to be detected with at least one target in the previous frame image of the current frame image; if the target matched with the target to be detected exists in the previous frame image, determining the category and the position information of the target to be detected according to the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images of the current frame image, wherein m is a positive integer. The target detection method, the target detection device and the storage medium provided by the invention not only can reduce the difficulty of detection, but also can improve the accuracy of detection.

Description

Target detection method, device and storage medium

Technical Field

The present invention relates to image detection technologies, and in particular, to a method and an apparatus for detecting an object, and a storage medium.

Background

The accuracy requirement for detecting objects such as vehicles and pedestrians in the auxiliary driving of the automobile is very strict. The current detection technology is more accurate for rigid targets such as vehicles, traffic signs and lane lines, and the detection accuracy for non-rigid targets such as pedestrians or bicycles is lower.

At present, a pedestrian detection method is mainly based on a single frame image in a video stream, and detection is performed by using a traditional feature extraction and classification method or a deep learning method such as a convolutional neural network. The traditional feature extraction and classification method mainly designs pedestrian features in advance and classifies the features by using a machine learning algorithm. For example, using histogram of gradient (HOG) of image as feature, using Support Vector Machine (SVM) to perform binary classification, and HOG feature is calculated by gradient of image and counting according to direction and module value. In addition, the deep learning-based method is to automatically learn features through a convolutional neural network, and currently, popular methods mainly include fast rcnn based on extracting candidate boxes to perform secondary classification, ssd (single-shot multi-box detector) and YOLO algorithms based on a multi-scale feature layer, and Feature Pyramid Network (FPN) improved algorithms based on an image pyramid.

Because targets such as pedestrians and the like can generate various deformations, when the above modes are adopted for detection, in order to improve the detection accuracy, the data volume needs to be enlarged to contain enough samples, and meanwhile, the model capacity needs to be improved to cover possible various deformations, so that the detection difficulty can be increased, and the detection accuracy is not high.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a method and a device for detecting a target and a storage medium, which can not only reduce the detection difficulty, but also improve the detection accuracy.

In a first aspect, an embodiment of the present invention provides a method for detecting a target, including:

initially detecting to obtain a target to be detected in a current frame image in video data;

matching the target to be detected with at least one target in the previous frame image of the current frame image;

if the target matched with the target to be detected exists in the previous frame image, determining the category and the position information of the target to be detected according to the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images of the current frame image, wherein m is a positive integer.

Optionally, the matching the target to be detected with at least one target in a previous frame image of the current frame image includes:

acquiring a candidate frame of the target to be detected in the current frame image;

and matching the candidate frame with at least one target in the previous frame image.

Optionally, the matching the candidate frame and the at least one target in the previous frame of image includes:

tracking the at least one target in the current frame image to obtain a tracking frame of each target in the current frame image;

calculating an intersection ratio IOU between each tracking frame and the candidate frame;

and determining that the target corresponding to the tracking frame of which the IOU is greater than a preset threshold value is successfully matched with the candidate frame.

Optionally, the calculating an intersection-to-parallel ratio IOU between each tracking frame and the candidate frame includes:

calculating the IOU according to the formula IOU ═ TkBBox I CandBox)/(TkBBox U CandBox, wherein the TkBBox is the tracking box and the CandBox is the candidate box.

Optionally, the determining the category and the position information of the target to be detected according to the feature layer of the target to be detected in the current frame image and the feature layer of the target to be detected in the previous m frame images of the current frame image respectively includes:

inputting the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images into a long-term cyclic convolution network LRCN to obtain the position information of the target to be detected and the probability value of the target to be detected in each category;

selecting the category with the maximum probability value as an intermediate category;

and determining the category of the target to be detected in the current frame image according to the probability value of the intermediate category and the probability value of the category of the target to be detected in the previous frame image.

Optionally, the determining the category of the target to be detected in the current frame image according to the probability value of the intermediate category and the probability value of the category of the target to be detected in the previous frame image includes:

comparing the probability value corresponding to the middle category with the probability value of the category of the target to be detected in the previous frame of image;

if the probability value corresponding to the intermediate category is greater than or equal to the probability value of the category of the target to be detected in the previous frame image, determining the intermediate category as the category of the target to be detected in the current frame image;

and if the probability value corresponding to the intermediate category is smaller than the probability value of the category of the target to be detected in the previous frame image, determining the category of the target to be detected in the previous frame image as the category of the target to be detected in the current frame image.

Optionally, before inputting the feature layer of the object to be detected in the current frame image and the feature layer in the previous m frame images into a long-term cyclic convolution network LRCN, the method further includes:

respectively carrying out scaling processing on the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images to obtain a characteristic layer with a preset size;

the inputting the feature layer of the target to be detected in the current frame image and the feature layer of the target to be detected in the previous m frame image into a long-term cyclic convolution network LRCN includes:

and inputting the feature layer with the preset size into the LRCN.

In a second aspect, an embodiment of the present invention provides an apparatus for detecting an object, including:

the detection module is used for initially detecting a target to be detected in a current frame image in the obtained video data;

the matching module is used for matching the target to be detected with at least one target in the previous frame image of the current frame image;

and the determining module is used for determining the category and the position information of the target to be detected according to the feature layer of the target to be detected in the current frame image and the feature layer of the target to be detected in the previous m frame images of the current frame image when the matching module matches that the target matched with the target to be detected exists in the previous frame image, wherein m is a positive integer.

Optionally, the matching module is specifically configured to:

Optionally, the determining module is specifically configured to:

and inputting the feature layer with the preset size into the LRCN.

In a third aspect, an embodiment of the present invention provides a terminal device, including:

a processor;

a memory; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor, the computer program comprising instructions for performing the method of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored, and the computer program causes a server to execute the method in the first aspect.

The target detection method, the target detection device and the storage medium provided by the invention have the advantages that the target to be detected in the current frame image in the video data is obtained through initial detection, the target to be detected is matched with at least one target in the previous frame image of the current frame image, and if the target matched with the target to be detected exists in the previous frame image, the category and the position information of the target to be detected are determined according to the characteristic layer of the target to be detected in the current frame image and the characteristic layer in the previous m frame image of the current frame image respectively. The terminal equipment can be matched with the target in the previous frame image of the current frame image when determining the category and the position information of the target to be detected in the current frame image, and after the matching is successful, the category and the position information of the target to be detected are jointly determined according to the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images of the current frame image, so that the phenomenon that the target is detected only according to a single frame image in the prior art is avoided, and the posture change of the target to be detected can be detected according to multiple frame images, so that the detection difficulty can be reduced, and the detection accuracy can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a first embodiment of a target detection method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating candidate extraction boxes;

FIG. 3 is a schematic flow chart of the LRCN algorithm;

FIG. 4 is a pedestrian time series flow diagram;

fig. 5 is a schematic structural diagram of a first embodiment of a target detection apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The target detection method provided by the embodiment of the invention can be applied to a detection scene of a target object in an image, and particularly applied to a scene of non-rigid target detection in which the posture of the target changes or various deformations occur. At present, detection of non-rigid targets such as pedestrians is mainly performed by using a traditional feature extraction and classification method based on a single frame image in a video stream or a deep learning method based on a convolutional neural network. However, since the targets such as pedestrians may generate various deformations, when the above methods are used for detection, in order to improve the detection accuracy, the data size needs to be enlarged to include enough samples, and meanwhile, the model capacity needs to be increased to cover possible various deformations, which not only increases the detection difficulty, but also increases the detection accuracy.

In view of the above problems, an embodiment of the present invention provides a target detection method, in which a target to be detected in a current frame image in video data is obtained through initial detection, and the target to be detected is matched with at least one target in a previous frame image of the current frame image, and if a target matched with the target to be detected exists in the previous frame image, a category and position information of the target to be detected are determined according to a feature layer of the target to be detected in the current frame image and a feature layer of the target to be detected in a previous m frame image of the current frame image, respectively. The terminal equipment can be matched with the target in the previous frame image of the current frame image when determining the category and the position information of the target to be detected in the current frame image, and after the matching is successful, the category and the position information of the target to be detected are jointly determined according to the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images of the current frame image, so that the phenomenon that the target is detected only according to a single frame image in the prior art is avoided, and the posture change of the target to be detected can be detected according to multiple frame images, so that the detection difficulty can be reduced, and the detection accuracy can be improved.

The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 1 is a schematic flowchart of a first embodiment of a target detection method according to an embodiment of the present invention. The embodiment of the invention provides a target detection method, which can be executed by any device for executing the target detection method, and the device can be realized by software and/or hardware. In this embodiment, the apparatus may be integrated in a terminal device. As shown in fig. 1, the method for detecting a target provided in the embodiment of the present invention includes the following steps:

step 101, initially detecting to obtain a target to be detected in a current frame image in video data.

In this embodiment, the camera may collect video data in real time, and send the collected video data to the terminal device, and after receiving the video data, the terminal device obtains a current frame image from the video data, and performs initial detection on the current frame image by using a candidate frame extraction network (RPN) to obtain whether each target in the current frame image is a target to be detected. The number of targets to be detected may be one or more. In this embodiment, the object to be detected may include a non-rigid object such as a pedestrian or a bicycle.

The terminal device may be, for example, a mobile phone, a tablet, a wearable device, or an in-vehicle device.

And 102, matching the target to be detected with at least one target in the previous frame image of the current frame image.

In this embodiment, each frame of image includes at least one target, and after acquiring the target to be detected in the current frame of image, the terminal device matches the target to be detected with at least one target in the previous frame of image of the current frame of image. It should be noted that, if there are a plurality of targets to be detected, each target to be detected may be respectively matched with at least one target in the previous frame image of the current frame image.

In a possible implementation manner, matching the target to be detected with at least one target in the previous frame image of the current frame image includes acquiring a candidate frame of the target to be detected in the current frame image, and matching the candidate frame with at least one target in the previous frame image.

Specifically, fig. 2 is a schematic diagram of extracting a candidate frame, and as shown in fig. 2, after a current frame image in video data is acquired, a candidate frame extraction network (RPN) is used to extract a candidate frame 1 from the current frame image, and in addition, a feature layer of an object to be detected calculated by using the RPN needs to be stored. Wherein, each extracted candidate frame 1 comprises an object to be detected.

After the candidate frame 1 is extracted, the extracted candidate frame will be matched with the target in the previous frame image of the current frame image. In the embodiment of the present invention, a tracking algorithm may be used for matching, and in a specific implementation process, at least one target may be tracked in a current frame image to obtain tracking frames of the targets in the current frame image, an Intersection Over Unit (IOU) between each tracking frame and a candidate frame is calculated, and it is determined that a target corresponding to a tracking frame whose IOU is greater than a preset threshold is successfully matched with the candidate frame.

Specifically, all the targets in the previous frame image may be tracked in the current frame by using a Kernel Correlation Filter (KCF) algorithm, so as to obtain a tracking frame of all the targets in the previous frame image in the current frame. After the tracking frame of each target in the current frame image is calculated, the IOU between each tracking frame and the candidate frame of the target to be detected is calculated.

In one possible implementation, the IOU may be calculated according to the formula IOU ═ I TkBBox I cantdbox)/(TkBBox U cantdbox), where TkBBox is a tracking frame and cantdbox is a candidate frame, that is, the intersection between the tracking frame and the candidate frame is calculated first, then the union between the tracking frame and the candidate frame is calculated, and then the two are used as the ratio to determine the intersection-union ratio IOU between the tracking frame and the candidate frame.

After the IOU is calculated, whether the calculated value of the IOU is larger than a preset threshold value or not is judged, if the calculated value of the IOU is larger than the preset threshold value, the matching between the target corresponding to the tracking frame of the IOU and the candidate frame is successful, and if not, the matching between the target corresponding to the tracking frame and the candidate frame is unsuccessful. The value of the preset threshold may be selected according to actual conditions or experience, and the specific value of the preset threshold is not limited herein.

It should be noted that, if a candidate frame in the current frame is not successfully matched with any target in the previous frame image, it indicates that the target to be detected corresponding to the candidate frame may be a target that newly appears in the current frame, and at this time, the target to be detected may be marked as an initial frame. If a certain target in the previous frame image is not successfully matched with the candidate frame corresponding to any target in the current frame, it indicates that the target in the previous frame image has disappeared in the current frame, and at this time, the target will be discarded.

And 103, if the target matched with the target to be detected exists in the previous frame image, determining the category and the position information of the target to be detected according to the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images of the current frame image, wherein m is a positive integer.

In this embodiment, the terminal device may calculate the feature layer of each target to be detected through a candidate frame extraction Network (region pro-social Network). If the terminal equipment finds that the target matched with the target to be detected exists in the previous frame image, the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frames of images are obtained, and the category and the position information of the target to be detected are determined according to the obtained characteristic layers.

In a possible implementation mode, determining the category and the position information of the target to be detected according to the feature layer of the target to be detected in the current frame image and the feature layer of the target to be detected in the previous m frame images of the current frame image respectively, wherein the step of inputting the feature layer of the target to be detected in the current frame image and the feature layer of the target to be detected in the previous m frame images into a Long-term cyclic Convolution Network (LRCN) to obtain the position information of the target to be detected and the probability value of the target to be detected in each category; selecting the category with the maximum probability value as an intermediate category; and determining the category of the target to be detected in the current frame image according to the probability value of the middle category and the probability value of the category of the target to be detected in the previous frame image.

Specifically, the terminal device may calculate a feature layer of each object to be detected in the current frame image through a candidate frame extraction Network (RPN), and similarly, when detecting each previous frame image, also calculate and store the feature layer of each previous frame image of the object to be detected.

When the terminal device determines that an object matched with the object to be detected exists in the previous frame image, it indicates that the object to be detected appears in both the previous frame image and the current frame image, at this time, the features of the stored convolution layer of the object to be detected in the current frame image and the features of the convolution layer in the previous m frame image are obtained, and the obtained features of the convolution layer are used as input to be transmitted into a time series network, for example, the time series network can be transmitted into an LRCN, wherein the LRCN network is composed of a plurality of long-short term memory (LSTM) layers, each layer receives the feature input of the current frame object, outputs the position information and the category information of the object to be detected corresponding to the frame, and transmits the state to the next layer.

Fig. 3 is a schematic flow diagram of an LRCN algorithm, and as shown in fig. 3, after all targets in a previous frame image obtained by tracking through a KCF algorithm are in a tracking frame in a current frame, matching the tracking frame with a candidate frame of the target to be detected, if matching is successful, acquiring a CNN (Convolutional Neural Network) feature layer of the target to be detected in the current frame image and a CNN feature layer in a previous m frame image of the current frame image, and transmitting the acquired CNN feature layers as inputs to an LSTM Network, thereby acquiring position information of the target to be detected and probability values of the target to be detected in each category.

In this embodiment, m may be set according to an actual situation or experience, for example, may be set to 10, 15, and the like.

In addition, the number or the type of the categories may be preset, for example, the number or the type may include a background, a pedestrian, a bicycle, a car, and the like, and after the terminal device inputs the feature layer into the LRCN, the coordinate position of the object to be detected in the current frame image and the probability value that the object to be detected is in each category will be obtained.

For example, if the current frame image is the 30 th frame image, the feature layer of the object to be detected in the 30 th frame image and the feature layer of the object to be detected in the 20 th to 29 th frame images are input into the LRCN, so that the coordinate position of the position information of the object to be detected in the current frame image can be obtained, and probability values of the object to be detected in various categories can also be obtained, such as a probability of 0.1 being a background, a probability of 0.7 being a pedestrian, a probability of 0.1 being a bicycle, a probability of 0.1 being a car, and the like.

After the probability values of the targets to be detected in the categories are determined, the category with the maximum probability value is selected as an intermediate category, and if pedestrians are selected as the intermediate category.

Further, comparing the probability value corresponding to the determined middle category with the probability value of the category of the target to be detected in the previous frame of image; if the probability value corresponding to the intermediate category is greater than or equal to the probability value of the category of the target to be detected in the previous frame of image, determining the intermediate category as the category of the target to be detected in the current frame of image; and if the probability value corresponding to the middle category is smaller than the probability value of the category of the target to be detected in the previous frame image, determining the category of the target to be detected in the previous frame image as the category of the target to be detected in the current frame image.

Specifically, for each frame of image, the class of the object to be detected in the frame of image is determined according to the above manner, and therefore, after the intermediate class is determined, the terminal device compares the probability value of the intermediate class with the probability value of the class of the object to be detected in the previous frame of image, and when the probability value corresponding to the intermediate class is greater than or equal to the probability value of the class of the object to be detected in the previous frame of image, the intermediate class is determined as the class of the object to be detected in the current frame of image. For example: and if the intermediate class is a pedestrian and the probability value is 0.7, the class of the target to be detected in the previous frame of image is also a pedestrian and the probability value is 0.6, determining the intermediate class pedestrian as the class of the target to be detected in the current frame of image. For another example: and if the intermediate category is a pedestrian and the probability value is 0.7, the category of the target to be detected in the previous frame of image is a bicycle and the probability value is 0.6, determining the intermediate category pedestrian as the category of the target to be detected in the current frame of image.

In addition, if the probability value corresponding to the middle category is smaller than the probability value of the category of the target to be detected in the previous frame image, the category of the target to be detected in the previous frame image is determined as the category of the target to be detected in the current frame image. For example: and if the intermediate category is a pedestrian and the probability value is 0.7, the category of the target to be detected in the previous frame image is also a pedestrian and the probability value is 0.8, determining the category of the pedestrian of the target to be detected in the previous frame image as the category of the target to be detected in the current frame image. For another example: and if the intermediate category is a pedestrian and the probability value is 0.7, the category of the target to be detected in the previous frame image is a bicycle and the probability value is 0.8, determining the category of the bicycle of the target to be detected in the previous frame image as the category of the target to be detected in the current frame image.

Further, before inputting the feature layer of the object to be detected in the current frame image and the feature layer in the previous m frame images into the LRCN, the method further includes: and respectively carrying out scaling treatment on the characteristic layer of the target to be detected in the current frame image and the characteristic layer in the previous m frame images to obtain the characteristic layer with the preset size, so that the characteristic layer with the preset size is only required to be input into the LRCN.

Specifically, fig. 4 is a schematic diagram of a pedestrian time sequence flow, and as shown in fig. 4, sizes of targets to be detected in different frames are different, so before inputting to the LRCN network, in this embodiment, an algorithm in fast rcnn needs to be adopted, and a region of interest (ROI) scaling process is performed on a convolutional layer first, and the scaling is performed to a fixed size, where a specific implementation manner is: assuming that the region of interest ROI is H × W, the scaled feature size is H × W, dividing the ROI into H × W grids, each grid having a size of H/H × W/W, performing maximum scaling (max scaling) on each grid, and finally generating a feature layer having a size of H × W.

In addition, because one frame of image contains a plurality of targets, the convolution feature can be directly calculated for the whole image when the feature layer is calculated, and then the corresponding feature layer is taken out according to the coordinate and the size of the candidate frame of the target to be detected for ROI scaling processing.

Further, for training and detection, the target to be detected may also be used as a unit, specifically, for each target in each frame, the convolution feature corresponding to each frame is first calculated, then ROI scaling is performed to transform to a fixed size, and the result is transmitted to the LRCN network.

The target detection method provided by the embodiment of the invention obtains the target to be detected in the current frame image in the video data through initial detection, matches the target to be detected with at least one target in the previous frame image of the current frame image, and determines the category and the position information of the target to be detected according to the feature layer of the target to be detected in the current frame image and the feature layer in the m frame image before the current frame image if the target to be detected matched with the target to be detected exists in the previous frame image. The terminal equipment can be matched with the target in the previous frame image of the current frame image when determining the category and the position information of the target to be detected in the current frame image, and after the matching is successful, the category and the position information of the target to be detected are jointly determined according to the characteristic layer of the target to be detected in the current frame image and the characteristic layer of the target to be detected in the previous m frame images of the current frame image, so that the phenomenon that the target is detected only according to a single frame image in the prior art is avoided, and the posture change of the target to be detected can be detected according to multiple frame images, so that the detection difficulty can be reduced, and the detection accuracy can be improved.

Fig. 5 is a schematic structural diagram of a first embodiment of a target detection apparatus according to an embodiment of the present invention. The target detection device may be an independent terminal device, or may be a device integrated in a terminal device, and the device may be implemented by software, hardware, or a combination of software and hardware. As shown in fig. 5, the apparatus includes:

the detection module 11 is configured to initially detect a target to be detected in a current frame image in the obtained video data;

the matching module 12 is configured to match the target to be detected with at least one target in a previous frame image of the current frame image;

the determining module 13 is configured to determine the category and the position information of the target to be detected according to the feature layer of the target to be detected in the current frame image and the feature layer in the m frame image before the current frame image when the matching module matches that the target matched with the target to be detected exists in the previous frame image, where m is a positive integer.

The target detection device provided by the embodiment of the invention can execute the method embodiment, and the implementation principle and the technical effect are similar, so that the details are not repeated.

Optionally, the matching module 12 is specifically configured to:

Optionally, the determining module 13 is specifically configured to:

and inputting the feature layer with the preset size into the LRCN.

Fig. 6 is a schematic structural diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 6, the terminal device may include a transmitter 60, a processor 61, a memory 62, a receiver 64, and at least one communication bus 63. The communication bus 63 is used to realize communication connection between the elements. The memory 62 may comprise a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, in which various computer programs may be stored for performing various processing functions and implementing the method steps of any of the preceding embodiments.

An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program enables a server to execute the method for detecting an object provided in any of the foregoing embodiments.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of detecting an object, comprising:

if the target matched with the target to be detected exists in the previous frame image, inputting a feature layer of the target to be detected in the current frame image and a feature layer of the target to be detected in the previous m frame images into a long-term cyclic convolution network (LRCN), obtaining position information of the target to be detected and probability values of the target to be detected in all categories, and selecting the category with the maximum probability value as an intermediate category, wherein m is a positive integer;

2. The method according to claim 1, wherein the matching the target to be detected with at least one target in a previous frame image of the current frame image comprises:

3. The method of claim 2, wherein matching the candidate frame with the at least one object in the previous frame of image comprises:

4. The method of claim 3, wherein calculating the intersection-to-parallel ratio IOU between each of the tracking boxes and the candidate box comprises:

5. The method according to claim 1, wherein before inputting the feature layer of the object to be detected in the current frame image and the feature layer in the previous m frame images into a long-term cyclic convolution network LRCN, the method further comprises:

and inputting the feature layer with the preset size into the LRCN.

6. An apparatus for detecting an object, comprising:

a determining module, configured to, when the matching module matches that there is a target matching the target to be detected in the previous frame of image, input a feature layer of the target to be detected in the current frame of image and a feature layer of the previous m frames of image into a long-term cyclic convolution network LRCN, obtain position information of the target to be detected and probability values of the target to be detected in each category, and select a category with the highest probability value as an intermediate category, where m is a positive integer;

7. A terminal device, comprising:

a processor;

a memory; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-5.

8. A computer-readable storage medium, characterized in that it stores a computer program that causes a terminal device to execute the method of any one of claims 1-5.