CN108256506A

CN108256506A - Object detecting method and device, computer storage media in a kind of video

Info

Publication number: CN108256506A
Application number: CN201810151829.XA
Authority: CN
Inventors: 陈恺; 汤晓鸥; 王佳琦; 杨硕; 张行程; 熊元骏; 吕健勤; 林达华
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2018-02-14
Filing date: 2018-02-14
Publication date: 2018-07-06
Anticipated expiration: 2038-02-14
Also published as: CN108256506B

Abstract

The invention discloses object detecting method in a kind of video, the method includes：Several key frames are determined, and carry out object detection to each key frame based on target video, obtain the testing result of each key frame；According to the testing result of each key frame, the testing result of the intermediate frame between determining per adjacent two key frames；The testing result of each intermediate frame is modified, obtains the testing result of revised each intermediate frame；The testing result of testing result based on each key frame, each revised intermediate frame determines the testing result of the target video.The present invention further simultaneously discloses article detection device and computer storage media in a kind of video.

Description

Object detecting method and device, computer storage media in a kind of video

Technical field

The present invention relates to the object detection technologies in technical field of computer vision, and in particular to object is examined in a kind of video Survey method, apparatus and computer storage media.

Background technology

Object detection is the major issue of computer vision field and the basic technology of intelligent video analysis in video.Depending on Object detection can have important application, such as safety monitoring, automatic Pilot and advanced video retrieval at many aspects in frequency.

Object detection is built upon on the basis of picture object detection in video, but due to the introducing of temporal information so that Problem modeling is more complicated.Existing video object detection method in the tradeoff effect between detection speed and accuracy rate also not Practical application request can be met, if needing to consume the plenty of time, efficiency all using object detector to frame each in video Than relatively low, if be sparsely detected, detection performance can have substantial degradation.

Invention content

In view of this, present invention contemplates that providing object detecting method and device, computer storage media, energy in a kind of video Real-time object detection in realizing video under the premise of ensureing compared with high-accuracy.

In order to achieve the above objectives, the technical proposal of the invention is realized in this way：

In a first aspect, an embodiment of the present invention provides object detecting method in a kind of video, the method includes：

Several key frames are determined, and carry out object detection to each key frame based on target video, are obtained each The testing result of the key frame；

According to the testing result of each key frame, the detection of the intermediate frame between determining per adjacent two key frames As a result；

The testing result of each intermediate frame is modified, obtains the detection knot of revised each intermediate frame Fruit；

The testing result of testing result based on each key frame, each revised intermediate frame, determines institute State the testing result of target video.

In said program, optionally, after the testing result for determining the target video, the method further includes：

By the identical detection block of classification in each adjacent two frame, serial operation is carried out according to spatial position overlapping degree, is obtained Object chain, the object chain are made of across multiframe and the identical detection block of classification；

It is reclassified, and obtain the classification confidence of each detection block for the detection block on each object chain respectively.

It is optionally, described that several key frames are determined based on target video in said program, and to each key frame Object detection is carried out, obtains the testing result of each key frame, including：

Multiple initial key frames are chosen according to prefixed time interval, object detection is carried out to each initial key frame, is obtained To the spatial position of the detection block in each initial key frame and classification confidence；

For the detection block in every two neighboring initial key frame, based on spatial position and classification confidence progress Match, be less than predetermined threshold value in response to the matching degree of spatial position and classification confidence, in the two neighboring initial key frame Between each frame in select secondary key frame, and object detection is carried out to each secondary key frame, obtain each described time The spatial position of detection block in grade key frame and classification confidence；

Wherein, the key frame determined for the target video only includes each initial key frame, alternatively, including simultaneously each The initial key frame and each secondary key frame.

In said program, optionally, the testing result according to each key frame is determined per adjacent two passes The testing result of intermediate frame between key frame, including：

For every two adjacent key frames, the frame between left frame, intermediate frame and left frame and intermediate frame is taken, is calculated First motion history image (MHI, Motion History Image), using first nerves network to first motion history Image zooming-out feature, predicts first offset of the detection block from left frame to intermediate frame, and first offset is added to a left side In the detection block of side frame, as the spatial position for the detection block for traveling to intermediate frame, the classification confidence of the detection block of intermediate frame It is identical with the classification confidence of the detection block of left frame；

For every two adjacent key frames, the frame between right side frame, intermediate frame and right side frame and intermediate frame is taken, is calculated Second motion history image extracts feature to second motion history image using first nerves network, predicts detection block Second offset is added in the detection block of right side frame, as traveling to by the second offset from right side frame to intermediate frame The classification confidence of the spatial position of the detection block of intermediate frame, the classification confidence of the detection block of intermediate frame and the detection block of left frame It spends identical；

By propagated from intermediate frame from left frame to intermediate frame as a result, and propagating to obtain from right side frame to intermediate frame It is intermediate frame as a result, merging the testing result as intermediate frame.

In said program, optionally, the testing result to each intermediate frame is modified, and is obtained revised The testing result of each intermediate frame, including：

The image of intermediate frame and testing result are subjected to change of scale operation according to target scale, the target scale is big In current scale；

Feature is extracted to described image using nervus opticus network, predicts counterpart position in input frame to described image The offset is added by the offset put with the input frame, the space as gained after being corrected in the target scale Position；

Wherein, the input frame is the detection block of intermediate frame.

In said program, optionally, the testing result based on each key frame, it is each it is revised it is described in Between frame testing result, determine the testing result of the target video, including：

The testing result of testing result based on each key frame, each revised intermediate frame, utilizes line Property interpolation algorithm determines in the target video other frames except the key frame unless each and each intermediate frame Testing result.

It is optionally, described to be reclassified, and obtain for the detection block on each object chain respectively in said program The classification confidence of each detection block, including：

Several detection blocks on each object chain are chosen at equal intervals, cut out the corresponding figure of several described detection blocks Described image is simultaneously zoomed to same size by picture, and feature is extracted to each described image of same size using third nerve network And classify, obtain the classification confidence of each detection block on each object chain.

Second aspect, an embodiment of the present invention provides article detection device in a kind of video, described device includes：

First determining module, for determining several key frames based on target video；

Key frame detection module for carrying out object detection to each key frame, obtains each key frame Testing result；

Second determining module for the testing result according to each key frame, is determined per adjacent two key frames Between intermediate frame testing result；

Correcting module is modified for the testing result to each intermediate frame, obtains revised each described The testing result of intermediate frame；

Third determining module, for the testing result based on each key frame, each revised intermediate frame Testing result, determine the testing result of the target video.

In said program, optionally, described device further includes：

Sort module again, will after determining the testing result of the target video in the third determining module The identical detection block of the classification of each adjacent two frames kind carries out serial operation according to spatial position overlapping degree, obtains object chain, institute Object chain is stated across multiframe and is made of the identical detection block of classification；It is carried out again for the detection block on each object chain respectively Classification, obtains the classification confidence of each detection block.

In said program, optionally, first determining module is additionally operable to：

For the detection block in every two neighboring initial key frame, based on spatial position and classification confidence progress Match；

It is less than predetermined threshold value in response to the matching degree of spatial position and classification confidence, in the two neighboring initial pass Secondary key frame is selected in each frame between key frame, and object detection is carried out to each secondary key frame, obtains each institute State spatial position and the classification confidence of the detection block in secondary key frame；

In said program, optionally, second determining module is used for：

For every two adjacent key frames, the frame between left frame, intermediate frame and left frame and intermediate frame is taken, is calculated First motion history image (MHI) extracts feature to first motion history image using first nerves network, predicts inspection First offset of the frame from left frame to intermediate frame is surveyed, first offset is added in the detection block of left frame, as biography It is multicast to the spatial position of the detection block of intermediate frame, the classification of the classification confidence of the detection block of intermediate frame and the detection block of left frame Confidence level is identical；

In said program, optionally, the correcting module is additionally operable to：

Wherein, the input frame is the detection block of intermediate frame.

The third aspect an embodiment of the present invention provides a kind of computer storage media, is deposited in the computer storage media Computer program is contained, the computer program is used to perform object detecting method in above-described video.

Object detecting method and device, computer storage media in the video that the embodiment of the present invention proposes, are regarded based on target Frequency determines several key frames, and carries out object detection to each key frame, obtains the detection knot of each key frame Fruit；According to the testing result of each key frame, the testing result of the intermediate frame between determining per adjacent two key frames； The testing result of each intermediate frame is modified, obtains the testing result of revised each intermediate frame；It is based on The testing result of the testing result of each key frame, each revised intermediate frame, determines the target video Testing result；In this way, object detection only is carried out by detector to key frame, without using detector to intermediate frame and except pass Other frames except key frame and intermediate frame are detected, and not only can guarantee the accuracy of the testing result of key frame, but also can save meter It is counted as this and time；The testing result of intermediate frame is predicted using the testing result of key frame, passes through the result to intermediate frame It is modified, the accuracy of the testing result of the intermediate frame of prediction gained can be improved；Inspection based on key frame and intermediate frame It surveys as a result, estimate the testing result of each frame in target video, can save and calculate cost and time；By of the present invention Technical solution can be realized and calculate the well balanced of cost and detection performance, can be realized under the premise of ensureing compared with high-accuracy Real-time object detection in video.

Description of the drawings

Fig. 1 is a kind of realization flow diagram of object detecting method in video provided in an embodiment of the present invention；

Fig. 2 m- scale gridding analysis schematic diagrames when being provided in an embodiment of the present invention；

Fig. 3 is the exemplary plot of detection framework provided in an embodiment of the present invention；

Fig. 4 is a kind of composition structure diagram of article detection device in video provided in an embodiment of the present invention.

Specific embodiment

The technical solution of the present invention is further elaborated in the following with reference to the drawings and specific embodiments.

The embodiment of the present invention provides object detecting method in a kind of video, as shown in Figure 1, the method mainly includes：

Step 101 determines several key frames based on target video, and carries out object detection to each key frame, Obtain the testing result of each key frame.

Here, the target video can be real-time video, can also be history video.

Here, the target video is collected by image acquisition device such as camera or camera etc..

Wherein, the key frame determined for the target video only includes initial key frame, alternatively, true for the target video Fixed key frame is simultaneously including initial key frame and secondary key frame.

As an alternative embodiment, described determine several key frames based on target video, including：

That is, if matching degree between the detection block of certain two neighboring initial key frame is less than predetermined threshold value, Secondary key frame is selected in each frame between the two neighboring initial key frame；If the inspection of certain two neighboring initial key frame The matching degree surveyed between frame is greater than or equal to predetermined threshold value, then without each frame between the two neighboring initial key frame The middle secondary key frame of selection.

Here, the value of the prefixed time interval can be set or adjusted according to accuracy of detection and/or detection speed.

Here, the testing result obtained after object detection, the spatial position including detection block are carried out to each key frame And classification confidence.

In general, object detection is carried out to key frame using based on the object detector of picture.

Here, after secondary key frame is determined, if also need to the secondary key frame and initial pass adjacent thereto Key frame carries out matching degree verification, can also be set or adjusted according to the demand of accuracy of detection and/or detection speed.

For example, a target video shares 121 frames, selected if choosing a series of initial key frames at interval of 24 frames The initial key frame taken is the 1st frame, the 25th frame, the 49th frame, the 73rd frame, the 97th frame, the 121st frame.It is examined with the object based on picture It surveys device and object detection is carried out to a series of this initial key frame, then calculate detection knot of the testing result with the 25th frame of the 1st frame The matching degree of fruit, the matching degree of the testing result of the 25th frame and the testing result of the 49th frame, the testing result of the 49th frame with The matching degree of the testing result of 73rd frame, the matching degree of the testing result of the 73rd frame and the testing result of the 97th frame, the 97th The matching degree of the testing result of frame and the testing result of the 121st frame, it is assumed that the only testing result of the 1st frame and the detection of the 25th frame As a result matching degree and the matching degree of the testing result of the testing result and the 73rd frame of the 49th frame is less than certain threshold value Then, it is secondary key frame that the 13rd frame is determined between the 1st frame and the 25th frame, determines that the 85th frame is between the 73rd frame and the 97th frame Secondary key frame, and object detection is carried out to this series of secondary key frame based on the object detector of picture.

It, will be with the average value phase if the frame number average value per adjacent two initial key frame is non-integer in practical application Frame corresponding to integer value that is adjacent and being less than the average value, is determined as secondary key frame.

For example, if two initial key frames are the 1st frame and the 24th frame, the average value of the 1st frame and the 24th frame is 12.5, Due to 12 ＜, 12.5,13 ＞ 12.5, then the secondary key frame being determined as the 12nd frame between the 1st frame and the 24th frame.

It is described that several key frames are determined based on target video in a specific embodiment, and to each key Frame carries out object detection, obtains the testing result of each key frame, including：

Choose multiple initial key frames according to prefixed time interval, with based on the object detector of picture to the initial pass Key frame carries out object detection, obtains spatial position and the classification confidence of detection block corresponding with each initial key frame；

For the detection block in every two neighboring initial key frame, based on spatial position and classification confidence progress Match, be less than predetermined threshold value in response to the matching degree of spatial position and classification confidence, in the two neighboring initial key frame Between each frame in select secondary key frame, and object detection is carried out to each secondary key frame, obtain each described time The spatial position of detection block in grade key frame and classification confidence.

That is, initial key frame and secondary key frame are answered together as the key frame of entire frame on key frame Testing result is obtained with object detector.

Step 102, the testing result according to each key frame, the centre between determining per adjacent two key frames The testing result of frame.

As an alternative embodiment, the testing result according to each key frame, determines per adjacent two The testing result of intermediate frame between the key frame, including：

Wherein, the first nerves network is by the special trained neural network of the first training set.By to One neural network input, two key frames of left and right and its image of testing result and intermediate frame, first nerves network can export intermediate frame Testing result.Here, the left frame and the right side frame, take out from two adjacent key frames respectively.The centre Frame is the frame between two adjacent key frames.

In this way, when asking for the testing result of intermediate frame, without being examined based on the detector of picture to intermediate frame It surveys, it is only necessary to can predict the testing result of intermediate frame using the testing result of the key frame obtained in step 101.

121 frames are still shared with above-mentioned target video, the key frame determined includes the 1st frame of initial key frame, the 25th frame, the 49 frames, the 73rd frame, the 97th frame, the 121st frame are illustrated for secondary the 13rd frame of key frame, the 85th frame, then adjacent key frame Including (1,13), (13,25), (25,49), (49,73), (73,85), (85,97), (97,121), per adjacent two keys Intermediate frame between frame includes the 7th frame, the 17th frame, the 37th frame, the 61st frame, the 79th frame, the 91st frame, the 109th frame；Only with adjacent pass For key frame is the 1st frame and the 7th frame, the intermediate frame between them is the 4th frame, is propagated from the 1st frame of left frame to the 4th frame of intermediate frame Propagation result be denoted as the first testing result of the 4th frame of intermediate frame, the biography propagated from the 7th frame of right side frame to the 4th frame of intermediate frame Second of testing result that result is denoted as the 4th frame of intermediate frame is broadcast, therefore, the testing result of the 4th frame of intermediate frame predicted includes The first testing result and second of testing result.

Wherein, the first testing result includes spatial position and the classification confidence of detection block, second of testing result Spatial position and classification confidence including detection block.

According to the 1st frame of left frame, the 4th frame of intermediate frame and the 2nd frame of frame, the 3rd frame between them, MHI is calculated, utilizes the One neural network extracts feature to the MHI, predicts offset of the detection block from the 1st frame of left frame to the 4th frame of intermediate frame, will The offset is added in the detection block of the 1st frame of left frame, as the spatial position for the detection block for traveling to the 4th frame of intermediate frame, The classification confidence of the detection block of the 4th frame of intermediate frame is identical with the classification confidence of the detection block of the 1st frame of left frame.

Similarly, according to the 7th frame of right side frame, the 4th frame of intermediate frame and the 5th frame of frame, the 6th frame between them, MHI is calculated, Feature is extracted to the MHI using first nerves network, predicts detection block from the 7th frame of right side frame to the inclined of the 4th frame of intermediate frame The offset is added in the detection block of right side the 7th frame of frame, the sky as the detection block for traveling to the 4th frame of intermediate frame by shifting amount Between position, the classification confidence of the detection block of the 4th frame of intermediate frame is identical with the classification confidence of the detection block of the 7th frame of right side frame.

By taking the 13rd frame of intermediate frame of adjacent the 1st frame of key frame and the 25th frame as an example, be propagated through to obtain from the 1st frame the 13rd The testing result of frame is detection block A+ classification confidences A '；The testing result of the 13rd frame for being propagated through to obtain from the 13rd frame is inspection Frame B+ classification confidence B ' are surveyed, if detection block A includes detection block a1, a2, a3 totally 3 frames, classification confidence A ' is put including classification Reliability a1 ', a2 ', a3 ' correspond to detection block a1, a2, a3 respectively；Detection block B includes detection block b1, b2 totally 2 frames, confidence of classifying Degree B ' includes classification confidence b1 ', b2 ', corresponds to detection block b1, b2 respectively；So, the testing result packet of the 13rd frame of intermediate frame It includes：Totally 5 frames, this corresponding classification confidence of 5 frames are a1 ', a2 ', a3 ', b1, b2 by a1, a2, a3, b1, b2.

Step 103 is modified the testing result of each intermediate frame, obtains revised each intermediate frame Testing result.

In this way, being modified by the testing result to intermediate frame, it can make predicted obtaining rather than device is examined after testing The testing result of the intermediate frame measured is more accurate, while can also save calculating cost.

As an alternative embodiment, the testing result to each intermediate frame is modified, repaiied The testing result of each intermediate frame after just, including：

Wherein, the input frame is the detection block of intermediate frame.

Specifically, if ask for the testing result of intermediate frame using first nerves network, completed based on the first scale, When being then modified using the testing result of frame between nervus opticus Internet on middle, completed based on the second scale, wherein, first Scale is less than the second scale.Here, the scale can be understood as the size of the resolution ratio of image.That is, in scale dimension On degree, the spatial position of frame is corrected one by one from low resolution to high-resolution.

Still using above-mentioned adjacent key frame as the 1st frame and the 7th frame, for the 4th frame of intermediate frame between them, the second god is utilized Testing result through the 4th frame of frame between Internet on middle is modified, then the input of nervus opticus network is：The figure of the 4th frame of intermediate frame The testing result of the 4th frame of picture and intermediate frame；The output of nervus opticus network is：The testing result of revised the 4th frame of intermediate frame.

Here, the testing result of the 4th frame of intermediate frame includes spatial position and the classification confidence of detection block, passes through the second god When being modified through network to the testing result of the 4th frame, it is only necessary to which the spatial position of the detection block of the 4th frame is modified.

Wherein, the nervus opticus network is by the special trained neural network of the second training set.By to The testing result of two neural networks input intermediate frame and the image of intermediate frame, nervus opticus network can export revised centre The testing result of frame.

After above-mentioned steps 102, step 103 carry out several grades, the testing result on each frame can be obtained.

Here, how many grade are specifically performed, can be set or adjusted according to the demand of accuracy of detection and/or detection speed.

During testing result due to determining intermediate frame, not using detector, this will be than directly using detector speed than Soon, time cost can be saved.

Assuming that without being inserted into secondary key frame among adjacent the 1st frame of key frame and the 25th frame, then, take the 1st frame and the 25th The 13rd frame between frame is as first order intermediate frame；Take the 7th frame between the 1st frame and the 13rd frame, take the 13rd frame and the 25th frame it Between the 19th frame as second level intermediate frame；In view of time and the balance of precision, the 13rd frame of intermediate frame, the 7th frame, the 19th is obtained After the testing result of frame, the above method is not just recycled to solve in the 1st frame to 25 frames except the 1st frame, the 25th frame, the 13rd frame, the 7th The testing result of other frames except frame, the 19th frame, but use linear interpolation algorithm obtain other frames as a result, with quickly Obtain the testing result of other frames.

Step 104, the testing result based on each key frame, each revised intermediate frame detection knot Fruit determines the testing result of the target video.

As an alternative embodiment, the testing result based on each key frame, each revised The testing result of the intermediate frame determines the testing result of the target video, including：

It so, it is possible while the testing result accuracy for ensureing important frame such as key frame and intermediate frame, moreover it is possible to estimate Go out in target video the testing result of other frames except the key frame unless each and each intermediate frame, meet the time and The balance of precision, the real-time object detection in can realizing video under the premise of high-accuracy is ensured.

Further, after step 104, the method may also include：

Step 105 (not shown in figure 1)：By the identical detection block of classification in each adjacent two frame, it is overlapped according to spatial position Degree carries out serial operation, obtains object chain, and the object chain is made of across multiframe and the identical detection block of classification；Needle respectively Detection block on each object chain is reclassified, and obtains the classification confidence of each detection block.

That is, by the generic detection block of adjacent two frame, it is together in series according to spatial position overlapping degree, finally Form the chain being made of generic detection block across multiframe；Detection block on each chain is reclassified, And determine the classification confidence of the detection block on each chain.

Here, the classification refers to classification described in object, such as the mankind, animal, the vehicles.The classification can root It is set according to the universal standard or customer demand.

Here, the spatial position overlapping degree, refers to：Two detection blocks closest in adjacent two frame are connected Get up.

For example, there are 4 frames in first frame, it is denoted as frame 1, frame 2, frame 3, frame 4 respectively；There are 4 frames on second frame, be denoted as respectively Frame 1 ', frame 2 ', frame 3 ', frame 4 '.If frame 1 and 1 ' distance of frame are near, frame 1 with frame 1 ' is stringed together, forms first chain；If frame 2 It is near with 2 ' distance of frame, then frame 2 with frame 2 ' is stringed together, form Article 2 chain；If frame 3 and 3 ' distance of frame are near, by frame 3 and frame 3 ' It strings together, forms Article 3 chain；If frame 3 and 3 ' distance of frame are near, frame 3 with frame 3 ' is stringed together, forms Article 4 chain.It is practical to answer In, it is possible to which first chain, Article 2 chain, Article 3 chain are identical with the object classification corresponding to Article 4 chain, it is also possible to no Together.For example, corresponding object is people on four chains.For another example, the corresponding object of first chain is behaved, and Article 2 chain is corresponding Object is dog, and the corresponding object of Article 3 chain is tree, and the corresponding object of Article 4 chain is vehicle.

As an alternative embodiment, described reclassified respectively for the detection block on each object chain, And the classification confidence of each detection block is obtained, including：

Here, equally spaced value can be set or adjusted according to the length of chain.Here equally spaced value, with selection The value of constant duration during initial key frame may be the same or different.

Here, several frames are chosen, it can be understood as one frame of interval selection at regular intervals.

Assuming that the length of a chain is 30 frames, then, a frame is chosen every 6 frames, several selected frames can be 1st, 7,13,19,25, correspond to frame on this article of chain on 30 frames.

In fact, for corresponding to a chain, each frame both corresponds to an independent frame on this chain, is not in one There is the situation of the frame of 2 or 2 of a frame image or more on chain.

In practical application, if true picture is one dog of a people, one vehicle, what is detected is 2 people, 2 dog, 1 vehicle；In true people Position detection go out 2 frames, this 2 frames have overlapping, then retain the high frame of classification confidence.

Classifying again here, actually reaffirms the object in frame and the classification confidence of frame, still The spatial position of frame will not be determined again.

For each generic, the 1st to the 25th frame, if each frame has cat, has dog, has vehicle, all someone；Then may A chain in relation to cat is obtained, a chain in relation to dog, a chain in relation to people, one has the chain to cut-off.

For example, confirmed according to step 101 to 104, the object in the 1st frame to the 25th frame frame is cat, confidence level 0.5； It is likely to be obtained after step 105 is classified again, the object in the 1st frame to the 25th frame frame is cat, confidence level 0.8；Or the 1st frame Object into the 25th frame frame is dog, confidence level 0.7.

Wherein, the third nerve network is by the special trained neural network of third training set.By to The chain that three neural networks input each is made of generic detection block, third nerve network can export the inspection on each chain Survey the classification confidence of frame.

In this way, being reclassified by the classification confidence in the testing result to each frame, classification confidence can be promoted Accuracy, so as to further promote the accuracy of the testing result of each frame.

For a target video, if a node is interpreted as a frame, then, which should have multiple defeated Ingress and multiple output nodes.By taking target video has 600 frames as an example, if having chosen 50 key frames altogether, then, which regards Frequency should have 50 input nodes and 600 output nodes.

It is usually only optimized in a dimension in time or scale, and without synthesis modeling relative to existing Method for, present applicant proposes a kind of new video object detection frameworks, are carried out in two dimensions of time and scale comprehensive Modeling analysis is closed, specifically, carrying out latticed asymptotic analysis in two dimensions of time and scale.

M- scale gridding analysis schematic diagram when Fig. 2 is, as shown in Fig. 2, video object detection block proposed by the present invention Frame, by object detection be modeled as when m- scale two-dimensional space in directed acyclic graph, laterally for time dimension, from left to right Time is incremented by successively, and longitudinal direction is scale dimension, and photo resolution is sequentially increased from top to bottom.When each node is some in Fig. 2 Between testing result of the point under some scale, each directed edge is a kind of operation, and the process of object detection is sparse from top Node start, by a series of paths, reach the intensive node of bottom.On time dimension, motion history figure is utilized It is inputted as (MHI) is used as, testing result is traveled on other frames；In scale dimension, one from low resolution to high-resolution Correct to grade level-one the spatial position of frame.It is final to obtain at high resolutions by the propagation of this grid type and amendment path The testing result of each frame.

Fig. 3 is the exemplary plot of detection framework, wherein, T represents time propagation module, and S representation space position correction modules are right For the figure, in this corresponding frame image (being assumed to be key frame, corresponding 1st frame) of time t, detect to obtain 4 using detector A frame, and the object in each frame is people；In this corresponding frame image of time t+4x (being assumed to be key frame, corresponding 25th frame), Detect to obtain 4 frames using detector, and the object in 3 frames is people, the object in 1 frame is vehicle；It is corresponded in time t+2x This frame image (being assumed to be intermediate frame, corresponding 13rd frame), detector is not utilized to detect, but by based on when m- scale It propagates and corrects, obtain 8 frames, including 4 frames propagated from the 1st frame to the 13rd frame, in addition from the 25th frame to the 13rd frame Propagate 4 obtained frames.

Object detecting method in the video that the embodiment of the present invention proposes, it is proposed that a kind of new video object detection framework, Latticed asymptotic analysis is carried out in two dimensions of time and scale.Efficient time dimension is designed under the frame proposed Testing result specifically, on time dimension, by the use of motion history image (MHI) as input, is traveled to it by propagation module On his frame；In scale dimension, the spatial position of frame is corrected one by one from low resolution to high-resolution.Pass through this net The propagation of form and amendment path, the final testing result for obtaining each frame at high resolutions.Pass through technology of the present invention Scheme can be realized and calculate the well balanced of cost and detection performance, can realize video under the premise of ensureing compared with high-accuracy In real-time object detection.

Using technical solution of the present invention, monitor video can be analyzed in real time, detects interested object, The video flowing of vehicle-mounted camera can also be analyzed in real time, the objects such as pedestrian, vehicle in detection road ahead carry out base It is driven in the auxiliary of vision.

Object detecting method in corresponding above-mentioned video, present embodiments provides article detection device in a kind of video, such as Fig. 4 Shown, which includes：

First determining module 10, for determining several key frames based on target video；

Key frame detection module 20 for carrying out object detection to each key frame, obtains each key frame Testing result；

Second determining module 30 for the testing result according to each key frame, is determined per adjacent two keys The testing result of intermediate frame between frame；

Correcting module 40 is modified for the testing result to each intermediate frame, obtains revised each institute State the testing result of intermediate frame；

Third determining module 50, for the testing result based on each key frame, each revised centre The testing result of frame determines the testing result of the target video.

Further, described device further includes：

Sort module 60 again, for determined in the third determining module 50 testing result of the target video it Afterwards, by the identical detection block of the classification of each adjacent two frame, serial operation is carried out according to spatial position overlapping degree, obtains target Chain, the object chain are made of across multiframe and the identical detection block of classification；Respectively for the detection block on each object chain into Row reclassifies, and obtains the classification confidence of each detection block.

As a kind of embodiment, first determining module 10 is additionally operable to：

As a kind of embodiment, second determining module 30 is used for：

As a kind of embodiment, the correcting module 40 is additionally operable to：

Wherein, the input frame is the detection block of intermediate frame.

As a kind of embodiment, the third determining module 50 is additionally operable to：

As a kind of embodiment, sort module 60 again are additionally operable to：

It will be appreciated by those skilled in the art that each processing module in video shown in Fig. 4 in article detection device Realize that function can refer to the associated description of object detecting method in aforementioned video and understand.It will be appreciated by those skilled in the art that The function of each processing unit can be by running on the program on processor and reality in article detection device in video shown in Fig. 4 It is existing, it can also be realized by specific logic circuit.

In practical application, above-mentioned key frame detection module 20 can be by being realized based on the object detector of picture.It is above-mentioned First determining module 10, the second determining module 30, correcting module 40, third determining module 50 and the specific knot of sort module 60 again Structure may both correspond to processor.The specific structure of processor can be central processing unit (CPU, Central Processing Unit), microprocessor (MCU, Micro Controller Unit), digital signal processor (DSP, Digital Signal Processing) or programmable logic device (PLC, Programmable Logic Controller) Deng electronic component or the set of electronic component with processing function.Wherein, the processor includes executable code, institute It states executable code to be stored in storage medium, the processor can be by the communication interfaces such as bus and the storage medium It is connected, in the corresponding function for performing specific each unit, is read from the storage medium and run the executable code. The part that the storage medium is used to store the executable code is preferably non-moment storage medium.

First determining module 10, the second determining module 30, correcting module 40, third determining module 50 and mould of classifying again Block 60 can integrate corresponding to same processor or correspond to respectively different processors；When integrating corresponding to same processor, The processor handles first determining module 10, the second determining module 30, correcting module 40, third using the time-division and determines mould Block 50 and again 60 corresponding function of sort module.

Article detection device in video proposed by the present invention carries out latticed gradual in two dimensions of time and scale Testing result specifically, on time dimension, by the use of motion history image (MHI) as input, is traveled to other frames by analysis On；In scale dimension, the spatial position of frame is corrected one by one from low resolution to high-resolution；Pass through this grid type Propagation and correct path, the final testing result for obtaining each frame at high resolutions；In this way, can realize calculate cost and Well balanced, the real-time object detection in realizing video under the premise of ensureing compared with high-accuracy of detection performance.

The embodiment of the present invention also describes a kind of computer storage media, and calculating is stored in the computer storage media Machine executable instruction, the computer executable instructions are used to perform object detection side in the video described in foregoing individual embodiments Method.That is, after the computer executable instructions are executed by processor, any one aforementioned technical solution can be realized Object detecting method in the video of offer.

It will be appreciated by those skilled in the art that in the computer storage media of the present embodiment each program function, can refer to The associated description of object detecting method in video described in foregoing embodiments and understand.

It should be noted that technical solution of the present invention has stronger versatility.In addition to carrying out above-mentioned object detection Task outside, by the replacement to the particular module such as modules such as the second determining module and correcting module, can complete in video The tasks such as object tracking, object instance segmentation.

By taking object instance is divided as an example, under the frame of method provided by the present invention, from the segmentation result of sparse key frame Start, the propagation of mask is split on a timeline, and walk and correct in spatial position previous step, at this point, by object detector It is substituted for dispenser.

By taking object tracking as an example, under the frame of method provided by the present invention, since the testing result of sparse key frame, It on a timeline into the propagation of line trace, and walks and corrects in spatial position previous step, at this point, object detector is substituted for tracking Device.

Object detecting method and device, computer storage media in video described in the various embodiments described above, can concrete application In application scenarios such as intelligent video analysis, unmanned vehicle automatic Pilots.

The application scenarios for being particularly applicable in unmanned field are given below.In practical application, intelligent automobile passes through above-mentioned Object detecting method and device, computer storage media in video, using on the time from sparse to intensive, from low resolution on scale Rate determines the testing result of each frame in target video, in the premise for ensureing high-accuracy to high-resolution detection process Real-time object detection in lower realization video analyzes the video flowing of vehicle-mounted camera in real time, detects in road ahead The objects such as pedestrian, vehicle, the auxiliary for carrying out view-based access control model drive.

The application scenarios being particularly applicable on intelligent video analysis are given below.In practical application, robot passes through above-mentioned Object detecting method and device, computer storage media in video, using on the time from sparse to intensive, from low resolution on scale Rate is analyzed monitor video, can quickly and accurately be determined in target video in real time to high-resolution detection process The testing result of each frame detects interested object.

In several embodiments provided herein, it should be understood that disclosed device and method can pass through it Its mode is realized.Apparatus embodiments described above are only schematical, for example, the division of the unit, only A kind of division of logic function can have other dividing mode, such as in actual implementation：Multiple units or component can combine or It is desirably integrated into another system or some features can be ignored or does not perform.In addition, shown or discussed each composition portion Point mutual coupling or direct-coupling or communication connection can be the INDIRECT COUPLINGs by some interfaces, equipment or unit Or communication connection, can be electrical, mechanical or other forms.

The above-mentioned unit illustrated as separating component can be or may not be physically separate, be shown as unit The component shown can be or may not be physical unit；Both it can be located at a place, multiple network lists can also be distributed to In member；Part or all of unit therein can be selected according to the actual needs to realize the purpose of this embodiment scheme.

In addition, each functional unit in various embodiments of the present invention can be fully integrated into a processing unit, also may be used To be each unit individually as a unit, can also two or more units integrate in a unit；It is above-mentioned The form that hardware had both may be used in integrated unit is realized, can also be realized in the form of hardware adds SFU software functional unit.

One of ordinary skill in the art will appreciate that：Realizing all or part of step of above method embodiment can pass through The relevant hardware of program instruction is completed, and aforementioned program can be stored in computer read/write memory medium, which exists During execution, step including the steps of the foregoing method embodiments is performed；And aforementioned storage medium includes：Movable storage device read-only is deposited Reservoir (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or The various media that can store program code such as CD.

If alternatively, the above-mentioned integrated unit of the present invention is realized in the form of software function module and is independent product Sale in use, can also be stored in a computer read/write memory medium.Based on such understanding, the present invention is implemented The technical solution of example substantially in other words can be embodied the part that the prior art contributes in the form of software product, The computer software product is stored in a storage medium, and being used including some instructions (can be with so that computer equipment It is personal computer, server or network equipment etc.) perform all or part of each embodiment the method for the present invention. And aforementioned storage medium includes：Movable storage device, ROM, RAM, magnetic disc or CD etc. are various can to store program code Medium.

The above description is merely a specific embodiment, but protection scope of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in change or replacement, should all contain Lid is within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims

1. a kind of object detecting method in video, which is characterized in that the method includes：

Several key frames are determined, and carry out object detection to each key frame based on target video, are obtained each described The testing result of key frame；

According to the testing result of each key frame, the detection knot of the intermediate frame between determining per adjacent two key frames Fruit；

The testing result of each intermediate frame is modified, obtains the testing result of revised each intermediate frame；

The testing result of testing result based on each key frame, each revised intermediate frame, determines the mesh Mark the testing result of video.

2. according to the method described in claim 1, it is characterized in that, it is described determine the target video testing result after, The method further includes：

By the identical detection block of classification in each adjacent two frame, serial operation is carried out according to spatial position overlapping degree, obtains target Chain, the object chain are made of across multiframe and the identical detection block of classification；

3. method according to claim 1 or 2, which is characterized in that it is described that several key frames are determined based on target video, And object detection is carried out to each key frame, the testing result of each key frame is obtained, including：

Multiple initial key frames are chosen according to prefixed time interval, object detection is carried out to each initial key frame, is obtained each The spatial position of detection block in a initial key frame and classification confidence；

For the detection block in every two neighboring initial key frame, matched based on spatial position and classification confidence；

It is less than predetermined threshold value in response to the matching degree of spatial position and classification confidence, in the two neighboring initial key frame Between each frame in select secondary key frame, and object detection is carried out to each secondary key frame, obtain each described time The spatial position of detection block in grade key frame and classification confidence；

Wherein, the key frame determined for the target video only includes each initial key frame, alternatively, including simultaneously each described Initial key frame and each secondary key frame.

4. method according to claim 1 or 2, which is characterized in that the testing result according to each key frame, The testing result of intermediate frame between determining per adjacent two key frames, including：

For every two adjacent key frames, the frame between left frame, intermediate frame and left frame and intermediate frame is taken, calculates first Motion history image extracts feature to first motion history image using first nerves network, predicts detection block from a left side First offset is added in the detection block of left frame to the first offset of intermediate frame by side frame, as traveling to centre The spatial position of the detection block of frame, the classification confidence of the detection block of intermediate frame and the classification confidence phase of the detection block of left frame Together；

For every two adjacent key frames, the frame between right side frame, intermediate frame and right side frame and intermediate frame is taken, calculates second Motion history image extracts feature to second motion history image using first nerves network, predicts detection block from the right side Second offset is added in the detection block of right side frame, to the second offset of intermediate frame as traveling to centre by side frame The spatial position of the detection block of frame, the classification confidence of the detection block of intermediate frame and the classification confidence phase of the detection block of left frame Together；

It is intermediate frame as a result, and from right side frame is propagated to intermediate frame by what is propagated from left frame to intermediate frame Between frame as a result, merging the testing result as intermediate frame.

5. method according to claim 1 or 2, which is characterized in that the testing result to each intermediate frame into Row is corrected, and obtains the testing result of revised each intermediate frame, including：

The image of intermediate frame and testing result are subjected to change of scale operation according to target scale, the target scale, which is more than, works as Preceding scale；

Feature is extracted to described image using nervus opticus network, predicts and object space is corresponded in input frame to described image The offset is added by offset with the input frame, the spatial position as gained after being corrected in the target scale；

Wherein, the input frame is the detection block of intermediate frame.

6. method according to claim 1 or 2, which is characterized in that the testing result based on each key frame, The testing result of each revised intermediate frame determines the testing result of the target video, including：

The testing result of testing result based on each key frame, each revised intermediate frame is inserted using linear Value-based algorithm determines in the target video detection knot of other frames except the key frame unless each and each intermediate frame Fruit.

7. according to the method described in claim 2, it is characterized in that, described carry out respectively for the detection block on each object chain It reclassifies, and obtains the classification confidence of each detection block, including：

Several detection blocks on each object chain are chosen at equal intervals, cut out the corresponding image of several described detection blocks simultaneously Described image is zoomed into same size, is gone forward side by side using third nerve network to each described image extraction feature of same size Row classification, obtains the classification confidence of each detection block on each object chain.

8. article detection device in a kind of video, which is characterized in that described device includes：

Key frame detection module for carrying out object detection to each key frame, obtains the detection of each key frame As a result；

Second determining module, for the testing result according to each key frame, between determining every adjacent two key frames Intermediate frame testing result；

Correcting module is modified for the testing result to each intermediate frame, obtains revised each centre The testing result of frame；

Third determining module, for the testing result based on each key frame, the inspection of each revised intermediate frame It surveys as a result, determining the testing result of the target video.

9. device according to claim 8, which is characterized in that described device further includes：

Sort module again, after determining the testing result of the target video in the third determining module, Jiang Gexiang The identical detection block of classification in adjacent two frames, carries out serial operation according to spatial position overlapping degree, obtains object chain, the mesh Mark chain is made of across multiframe and the identical detection block of classification；Divided again for the detection block on each object chain respectively Class, and obtain the classification confidence of each detection block.

10. a kind of computer storage media, computer executable instructions, the calculating are stored in the computer storage media Machine executable instruction requires object detecting method in 1 to 7 any one of them video for perform claim.