CN108491816A

CN108491816A - The method and apparatus for carrying out target following in video

Info

Publication number: CN108491816A
Application number: CN201810276460.5A
Authority: CN
Inventors: 杜康
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-03-30
Filing date: 2018-03-30
Publication date: 2018-09-04

Abstract

The embodiment of the present application discloses the method and apparatus for carrying out target following in video.The method for carrying out target following in video includes the position based on target to be tracked in the historical frames of video, and candidate region is intercepted out from the present frame of video；By the full convolutional network of the candidate region intercepted input training in advance, characteristic pattern is obtained, wherein characteristic pattern includes the candidate target region information for being used to indicate candidate target present position in characteristic pattern；Based on candidate target region information, determined from characteristic pattern and the one-to-one candidate target region of each candidate target；And by the candidate target region determined, with the highest candidate target region of target similarity to be tracked as the target to be tracked in present frame.The embodiment can be determined the target to be tracked in present frame from multiple candidate target regions, be conducive to the accuracy of target following based on the feature of target to be tracked itself.

Description

The method and apparatus for carrying out target following in video

Technical field

The invention relates to image processing fields, and in particular to computer vision field, more particularly in video The method and apparatus for carrying out target following.

Background technology

Target following refers to establishing the position relationship for the object of being tracked in continuous video sequence, it is complete obtaining object Whole movement locus.For example, the target coordinate position of given image first frame, calculates the definite position of the target in next frame image It sets.

During the motion, target may will present the variation on some images, such as variation, the scale of posture or shape Variation, background is blocked or the variation etc. of light luminance.The research of target tracking algorism is also around these variations of solution and tool The application start of body.

In the prior art, the algorithm of plurality of target tracking is there has been, for example, particle filter (Particle Filter) side Method, the optical flow algorithm of feature based point, the track algorithm etc. based on correlation filtering.

Invention content

The embodiment of the present application proposes the method and apparatus for carrying out target following in video.

In a first aspect, the embodiment of the present application provides a kind of method carrying out target following in video, including：Based on waiting for Position of the target in the historical frames of video is tracked, candidate region is intercepted out from the present frame of video；The candidate that will be intercepted The full convolutional network of region input training in advance, obtains characteristic pattern, wherein characteristic pattern includes to be used to indicate candidate target in feature The candidate target region information of present position in figure；Based on candidate target region information, determined from characteristic pattern and each candidate The one-to-one candidate target region of target；And by the candidate target region determined, most with target similarity to be tracked High candidate target region is as the target to be tracked in present frame.

In some embodiments, before by the candidate region intercepted the input in advance full convolutional network of training, method Further include the steps that trained full convolutional network, training full convolutional network the step of include：Establish initial full convolutional network；Obtain instruction Practice sample set, training sample set includes multiple training samples pair, and training sample is to the wherein two frame figures including same video file As and for label target object residing region in two field pictures markup information；By the initial full volume of training sample set input Product network, based on the initial full convolutional network of pre-set loss function training, the full convolutional network after being trained.

In some embodiments, by the candidate target region determined, with the highest time of target similarity to be tracked Select target area as the target to be tracked in present frame, including：Each candidate target region intercepted out from characteristic pattern is defeated Enter preset pond layer, obtains candidate feature figure corresponding with each candidate target region；For each candidate feature figure, meter Similarity between calculating the candidate feature figure and the clarification of objective figure to be tracked that obtains in advance；By with obtain in advance it is to be tracked The candidate target region corresponding to the highest candidate feature figure of similarity between clarification of objective figure is waited for as in present frame Track target.

In some embodiments, by the highest time of similarity between the clarification of objective figure to be tracked obtained in advance Select the candidate target region corresponding to characteristic pattern as the target to be tracked in present frame after, method further includes：Will with it is advance The highest candidate feature figure of similarity between the clarification of objective figure to be tracked obtained is as clarification of objective figure to be tracked.

In some embodiments, method further includes：It is detected in the present frame of video at predetermined intervals to be tracked Target；And based on the target to be tracked detected, update clarification of objective figure to be tracked.

In some embodiments, the present frame of the historical frames of video and video is two frames adjacent in video.

Second aspect, the embodiment of the present application also provides a kind of devices carrying out target following in video, including：Interception Unit is configured to the position in the historical frames of video based on target to be tracked, candidate is intercepted out from the present frame of video Region；Feature acquiring unit is configured to, by the full convolutional network of the candidate region intercepted input training in advance, obtain feature Figure, wherein characteristic pattern includes the candidate target region information for being used to indicate candidate target present position in characteristic pattern；Candidate mesh Area determination unit is marked, candidate target region information is based on, determines to wait correspondingly with each candidate target from characteristic pattern Select target area；And target tracking unit, it is configured in the candidate target region that will be determined, it is similar to target to be tracked Highest candidate target region is spent as the target to be tracked in present frame.

In some embodiments, device further includes training unit, and training unit is configured to institute in feature acquiring unit The candidate region input of interception is in advance before the full convolutional network of training：Establish initial full convolutional network；Training sample set is obtained, Training sample set includes multiple training samples pair, training sample to including same video file wherein two field pictures and be used for The markup information in label target object residing region in two field pictures；Training sample set is inputted into initial full convolutional network, base Initial full convolutional network, the full convolutional network after being trained are trained in pre-set loss function.

In some embodiments, target tracking unit is further configured to：Each candidate that will be intercepted out from characteristic pattern Target area inputs preset pond layer, obtains candidate feature figure corresponding with each candidate target region；For each time Characteristic pattern is selected, the similarity between calculating the candidate feature figure and the clarification of objective figure to be tracked that obtains in advance；Will with it is advance The candidate target region conduct corresponding to the highest candidate feature figure of similarity between the clarification of objective figure to be tracked obtained Target to be tracked in present frame.

In some embodiments, device further includes determination unit；Determination unit is configured to will be in target tracking unit The candidate target region corresponding to the highest candidate feature figure of similarity between the clarification of objective figure to be tracked obtained in advance After the target to be tracked in present frame, by the similarity highest between the clarification of objective figure to be tracked obtained in advance Candidate feature figure as clarification of objective figure to be tracked.

In some embodiments, device further includes：Detection unit is configured to working as in video at predetermined intervals Target to be tracked is detected in previous frame；And updating unit, it is configured to, based on the target to be tracked detected, update mesh to be tracked Target characteristic pattern.

The third aspect, the embodiment of the present application also provides a kind of equipment, including：One or more processors；Storage device, For storing one or more programs, when one or more programs are executed by one or more processors so that one or more Processor realizes such as any method of first aspect.

Fourth aspect, the embodiment of the present application also provide a kind of computer readable storage medium, are stored thereon with computer journey Sequence, wherein such as first aspect any method is realized when program is executed by processor.

The method and apparatus provided by the embodiments of the present application for carrying out target following in video, by being based on target to be tracked Position in the historical frames of video intercepts out candidate region from the present frame of video, and the candidate region intercepted is inputted The full convolutional network of training in advance obtains characteristic pattern, then is based on candidate target region information, is determined from characteristic pattern and each time The one-to-one candidate target region of target is selected, it is finally, similar to target to be tracked by the candidate target region determined Spend highest candidate target region as the target to be tracked in present frame, can based on the feature of target to be tracked itself, from The target to be tracked in present frame is determined in multiple candidate target regions, is conducive to the accuracy of target following.

Description of the drawings

By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon：

Fig. 1 is that this application can be applied to exemplary system architecture figures therein；

Fig. 2 is the flow chart according to one embodiment of the method for carrying out target following in video of the application；

Fig. 3 A are the schematic diagrames of target to be tracked present position in one of video historical frames；

Fig. 3 B are the schematic diagrames of the candidate region intercepted out in the present frame of video；

Fig. 3 C are the schematic diagrames of each candidate target region in candidate region；

Fig. 4 A~Fig. 4 D are the application scenarios signals according to the method for carrying out target following in video of the application Figure；

Fig. 5 is the flow chart according to another embodiment of the method for carrying out target following in video of the application；

Fig. 6 is in the method for carrying out target following in video of each embodiment of the application, the full convolutional network that uses The schematic flow chart of training method；

Fig. 7 is the structure chart according to one embodiment of the device for carrying out target following in video of the application；

Fig. 8 is adapted for the structural schematic diagram of the computer system of the server for realizing the embodiment of the present application.

Specific implementation mode

The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, is illustrated only in attached drawing and invent relevant part with related.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

Fig. 1 shows the method that can apply the target following of progress in video of the application or carries out target in video The exemplary system architecture 100 of the embodiment of the device of tracking.

As shown in Figure 1, system architecture 100 may include terminal device 101,102,103, network 104 and server 105. Network 104 between terminal device 101,102,103 and server 105 provide communication link medium.Network 104 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..

Terminal device 101,102,103 can be hardware, can also be software.When terminal device 101,102,103 is hard Can be the various electronic equipments that there is display screen and support video playing, including but not limited to smart mobile phone, tablet when part Computer, E-book reader, MP3 player (Moving Picture Experts Group Audio Layer III, dynamic Image expert's compression standard audio level 3), MP4 (Moving Picture Experts Group Audio Layer IV, move State image expert's compression standard audio level 4) player, pocket computer on knee and desktop computer etc..When terminal is set Standby 101,102,103 when being software, may be mounted in above-mentioned cited electronic equipment.Its may be implemented into multiple softwares or Software module (such as providing the multiple softwares or software module of Distributed Services), can also be implemented as single software or soft Part module.It is not specifically limited herein.

Server 105 can be to provide the server of various services, such as to being played on terminal device 101,102,103 Video provides the background process server supported.Background process server can to receive target following request etc. data into Row analysis etc. processing, and by handling result (for example, be loaded with target to be tracked each video frame region video counts According to) feed back to terminal device.

It should be noted that the method for carrying out target following in video that the embodiment of the present application is provided can be by servicing Device 105 executes, alternatively, can also be executed by terminal device 101,102,103.Correspondingly, target following is carried out in video Device can be set in server 105, alternatively, can also be set in terminal device 101,102,103.

It should be understood that the number of the terminal device 101,102,103 in Fig. 1, network 104 and server 105 is only to show Meaning property.According to needs are realized, can have any number of terminal device, network and server.

With continued reference to Fig. 2, one embodiment of the method for carrying out target following in video according to the application is shown Flow 200.This carries out the method for target following in video, includes the following steps：

Step 201, the position based on target to be tracked in the historical frames of video intercepts out time from the present frame of video Favored area.

In the present embodiment, executive agent (such as the service shown in FIG. 1 of the method for target following is carried out in video Device) it can be asked in response to receiving the target following of user's transmission, carry out the operation of performance objective tracking in video.

Herein, target following request for example may include clarification of objective information to be tracked.Characteristic information for example may be used To be any information that can characterize clarification of objective to be tracked.For example, in application scenes, user it is expected track up In the video file that certain Basketball Match obtains, track of first sportsman to shoot in ball match.So, in the application scenarios In, " sportsman of first shooting " can be used as clarification of objective information to be tracked.

Then, the executive agent of the method for the present embodiment can be according to the playing sequence of video file, according to certain target Detection algorithm carries out target detection in the video frame.If detecting target in a certain frame of video file, can incite somebody to action The video frame is as the historical frames in this step, and the position based on target to be tracked in the historical frames, to be tracked to determine The candidate region that target is likely to occur in the video frame of follow-up play.

In application scenes, if target to be tracked is in (x in the historical frames of video₁y₁,x₂y₂) in this region, It is possible in the current frame, according to (x₁y₁,x₂y₂) this rectangular area range, determine the range of a candidate region. Here, (x₁y₁) and (x₂y₂) can be the upper left corner and the right side for characterizing rectangular area of the target to be tracked residing for historical frames respectively Coordinate value of the inferior horn under preset plane right-angle coordinate.It is understood that in present frame, candidate region may include or Person does not include (x₁y₁,x₂y₂) this rectangular area.

Shown in Fig. 3 A, it illustrates a historical frames 300A.From historical frames 300A, it has been determined that gone out to be tracked Rectangular area residing for target 310, and the upper left corner of the rectangular area, the lower right corner are under preset rectangular coordinate system (Oxy) Coordinate value is respectively (x₁y₁) and (x₂y₂).At this point it is possible in the current frame, intercept out a range as candidate region.

Shown in Fig. 3 B, that schematically shows in the present frame of video, range residing for candidate region 320 (x₃y₃,x₄y₄).It is not difficult to find out, the candidate region (x that Fig. 3 B are intercepted out₃y₃,x₄y₄) contain (x₁y₁,x₂y₂) this rectangle region Domain.

The specific location and range size of candidate region can be arranged according to priori and specific application scenarios. For example, can be arranged according to the movement speed of target to be tracked, moving range etc..

In application scenes, target to be tracked can be a certain motor vehicle.It, can basis in these application scenarios In the movement speed range (for example, 0~100km/ hours) and video frame of motor vehicle, the road of motor-driven vehicle going is in video frame In residing region, position and the size of candidate region is arranged.

It is understood that in the case where lacking priori, in order to avoid target to be tracked is not appeared in from current It, can also be using the whole region of present frame as candidate region in application scenes in the candidate region that frame selects.

It returns with continued reference to shown in Fig. 2, the method for carrying out target following in video of the present embodiment further includes：

Step 202, by the full convolutional network of the candidate region intercepted input training in advance, characteristic pattern is obtained, wherein special Sign figure includes the candidate target region information for being used to indicate candidate target present position in characteristic pattern.

In this step, the candidate region intercepted out can be input to (for example, candidate region 320 shown in Fig. 3 B) pre- The first full convolutional network of training, obtains characteristic pattern.

Full convolutional network (Fully Convolutional Network, FCN) can receive the input figure of arbitrary dimension Then picture up-samples the characteristic pattern (feature map) of the last one convolutional layer by warp lamination, it is made to be restored to The identical size of input picture, so as to produce a prediction to each pixel.Meanwhile remaining original input picture In spatial information, finally classify to each pixel on the characteristic pattern for the sizes such as scheming with input, use pixel by pixel Softmax classified calculatings are lost, and are equivalent to each pixel and are corresponded to a training sample.

More than having the characteristics that just because of FCN, the candidate area size size of FCN no matter is inputted, can be passed through FCN is obtained in candidate region, the characteristic information of each pixel.

The characteristic pattern obtained by FCN for example may include that the classification information of the pixel in candidate region and recurrence are believed Breath.Herein, classification information can indicate that pixel belongs to the probability of target to be tracked, and feature can then be indicated by returning information In figure, belong to the Probability Area of target to be tracked.Specifically, by the FCN, the characteristic pattern in multiple channels can be exported, these Characteristic pattern can include in candidate region, and each pixel belongs to the probabilistic information of target to be tracked and the pixel belongs to When target to be tracked, information of the pixel relative to the relative position of the target to be tracked belonging to it.

Step 203, it is based on candidate target region information, determines to wait correspondingly with each candidate target from characteristic pattern Select target area.

It by step 202, can obtain in candidate region, the classification information and recurrence information of each pixel.It is appreciated that , the probability that target to be tracked is belonged to for each in candidate region is more than the pixel of a predetermined threshold value, corresponding The information of one relative position relative to the target to be tracked belonging to it.In application scenes, the pixel is relative to it The information of the relative position of affiliated target to be tracked can be expressed as surrounding the bounding box of the pixel.Herein, bounding box Such as can include a rectangular area of the pixel in candidate region.So, all to belong to in candidate region The probability of target to be tracked is more than that the bounding box determined of the pixel of a predetermined threshold value is clustered, can determine with respectively The one-to-one candidate target region of candidate target.As shown in Figure 3 C, that schematically shows to respectively being wrapped in candidate region 320 Enclose four candidate target region 310a~310d that box is clustered.

Step 204, by the candidate target region determined, with the highest candidate target region of target similarity to be tracked As the target to be tracked in present frame.

By comparing the similarity of each candidate target region and target to be tracked, can be determined from candidate target region A candidate target region closest with target to be tracked.The candidate target region, it is believed that be most probable in present frame It is mesh target area to be tracked.

In some optional realization methods, through the above steps 203, define each candidate mesh in candidate region Region is marked, also, each pixel in candidate target region all has a probability for belonging to target to be tracked.Therefore, in this step, The probability that target to be tracked can be belonged to based on each pixel in each candidate target region determines that the candidate target region is to wait for The probability of target is tracked, and using the probability as the similarity of the candidate target region and target to be tracked.So, With by all candidate target regions, the candidate target region with maximum probability is as the target to be tracked in present frame.

The present embodiment in video carry out target following method, by based on target to be tracked video historical frames In position, intercept out candidate region from the present frame of video, by the candidate region intercepted input in advance training full volume Product network obtains characteristic pattern, then is based on the candidate target region information, is determined from the characteristic pattern and each candidate The one-to-one candidate target region of target, finally, by the candidate target region determined, with target similarity to be tracked Highest candidate target region as the target to be tracked in the present frame, can based on the feature of target to be tracked itself, The target to be tracked in present frame is determined from multiple candidate target regions, is conducive to the accuracy of target following.

It is an applied field of the method for carrying out target following in video of the present embodiment shown in Fig. 4 A~Fig. 4 D The schematic diagram of scape.

In this scenario, it is assumed that there is Chinese athlete A, Japan sportsman B and South Korea sportsman C to participate in a certain long-distance running together Match, it is expected that carrying out target following to Chinese athlete A.

Assuming that in some historical frames, Chinese athlete A is in position as shown in Figure 4 B.

Then, the position based on Chinese athlete A in historical frames as shown in Figure 4 B, from present frame as shown in Figure 4 C In, determine a candidate region 410.

Then, the candidate region 410, the full convolutional network of input training in advance are intercepted, and obtains three candidate target areas Domain 410a, 410b and 410c, as shown in Figure 4 D.By candidate target region 410a~410c and target to be tracked (for example, from such as The feature of the Chinese athlete A extracted in the historical frames of Fig. 4 B) similarity calculation, it may be determined that go out, candidate target region Candidate target region 410b in 410a~410c is Chinese athlete A.

It is shown in Figure 5, it is the signal of another embodiment of the method for carrying out target following in video of the application Property flow chart 500.

The method of the present embodiment includes：

Step 501, the position based on target to be tracked in the historical frames of video intercepts out time from the present frame of video Favored area.

Step 502, by the full convolutional network of the candidate region intercepted input training in advance, characteristic pattern is obtained, wherein special Sign figure includes the candidate target region information for being used to indicate candidate target present position in characteristic pattern.

Step 503, it is based on candidate target region information, determines to wait correspondingly with each candidate target from characteristic pattern Select target area.

501~step 503 of above-mentioned steps can be according to the side similar with the step 201 of embodiment illustrated in fig. 2~step 203 Formula executes, and details are not described herein.

By step 501 as above~503, can be determined from characteristic pattern candidate correspondingly with each candidate target Target area, for example, candidate target region can have the form of expression as shown in 310a~310d in Fig. 3 C.

Step 504, each candidate target region intercepted out from characteristic pattern is inputted into preset pond layer, obtained and each time Select the corresponding candidate feature figure in target area.

Herein, each candidate target region intercepted out from characteristic pattern can be used as each ROI (regions of Interest, area-of-interest).By to these ROI carry out pondization operate, can obtain respectively with these candidate target regions Candidate feature figure corresponding, with identical size.The pondization operation of this step for example can be maximum pond (max Pooling), average pond (mean pooling), random pool (stochastic pooling) etc..

Step 505, for each candidate feature figure, the candidate feature figure and the target to be tracked that obtains in advance are calculated Similarity between characteristic pattern.

It herein, can be to be tracked as what is obtained in advance using the clarification of objective figure to be tracked determined from historical frames Clarification of objective figure.So, the clarification of objective figure to be tracked obtained in advance can with to present frame execute step 501~ Each candidate feature figure obtained after 505 is of the same size.

In this step, mode that is any existing or waiting for the following exploitation may be used to calculate candidate feature figure and obtain in advance Similarity between the clarification of objective figure to be tracked taken, it may for example comprise but be not limited to Euclidean distance, cosine similarity etc..

Step 506, by the highest candidate feature figure of similarity between the clarification of objective figure to be tracked obtained in advance Corresponding candidate target region is as the target to be tracked in present frame.

It is understood that the similarity between candidate feature figure and the clarification of objective figure to be tracked obtained in advance is got over Height, the region corresponding to the candidate feature figure are that the possibility of the target to be tracked in present frame is bigger.By comparing each candidate Similarity between characteristic pattern and the clarification of objective figure to be tracked obtained in advance, can therefrom determine in present frame, most have It may be mesh target area to be tracked.

The method for carrying out target following in video of the present embodiment obtains each candidate mesh using the method in the ponds ROI The candidate feature figure for marking region, can make obtained candidate feature figure be of the same size, advantageously reduce similarity Calculation amount when calculating, and further increase the accuracy that target to be tracked is determined from candidate target.

In some optional realization methods of the present embodiment, is determining from each candidate feature figure and in advance obtaining It, can be highest candidate special by the similarity after the highest candidate feature figure of similarity between clarification of objective figure to be tracked New characteristic pattern of the sign figure as target to be tracked.It so, can be with when the follow-up each frame to video carries out target following Using the new characteristic pattern as the benchmark of similarity calculation, occur over time gradually in the characteristics of some targets to be tracked The application scenarios of change can further promote the tracking of target to be tracked by being updated to clarification of objective figure to be tracked Accuracy.

In some optional realization methods of the present embodiment, can also by interval of for a period of time again detection wait for The mode of track target updates clarification of objective figure to be tracked.

Specifically, continuing with referring to Fig. 5, it, can be at predetermined intervals in the present frame of video in step 507 Middle detection target to be tracked.

Herein, preset time interval can according to application scenarios, the features of movement of target to be tracked in video come Setting.Also, the time interval can be a fixed value, and can also be one can variable value.For example, in application scenes, The detection of primary target to be tracked can be carried out at interval of 100 frames.Existing mesh may be used in mesh object detection method to be tracked Detection algorithm is marked to realize, details are not described herein.

Then, in step 508, based on the target to be tracked detected, clarification of objective figure to be tracked is updated.

It so, can be to avoid caused by multiple similarity operation by the update to clarification of objective figure to be tracked Clarification of objective figure to be tracked error accumulation, and then promoted target following accuracy.

In some optional realization methods, the full convolutional network used in the application the various embodiments described above may be used Mode as shown in FIG. 6 is trained.

Specifically, step 601, initial full convolutional network is established.

Herein, an initial full convolutional network with multiple convolutional layers can be established, and is the initial full convolution net Parameter in network assigns initial value.

Step 602, training sample set is obtained, training sample set includes multiple training samples pair, and training sample is to including same The wherein two field pictures of one video file and markup information for label target object residing region in two field pictures.

Markup information can be it is any can be to the target object and the information that distinguishes of non-targeted object in video frame. For example, " 1 " can be identified to each pixel in video frame, belonging to target object, and to other non-targeted right in video frame Each pixel logo " 0 " of elephant.

Step 603, training sample set is inputted into initial full convolutional network, it is initial based on the training of pre-set loss function Full convolutional network, the full convolutional network after being trained.

After the initial full convolutional network of training sample set input, characteristic pattern can be exported, and characteristic pattern can have Characterization pixel belongs to the probability (that is, classification information) of target object and when pixel belongs to target to be tracked, the pixel Information (that is, return information) of the point relative to the relative position of the target to be tracked belonging to it.

By the way that classification information, recurrence information and markup information are inputted pre-set loss function, it can be deduced that a damage Mistake value.By penalty values backpropagation in full convolutional neural networks, each parameter in full convolutional neural networks can be carried out Adjustment.

So, by the way that training sample set is circularly inputted full convolutional network, determines penalty values and loss The backpropagation of value can constantly adjust the parameter in full convolutional neural networks, until reach trained completion condition (for example, Penalty values are less than a certain preset penalty values threshold value).

In some optional realization methods of method for carrying out target following in video of each embodiment of the application, depending on The historical frames of frequency and the present frame of video can be two frames adjacent in video.So, frame by frame may be implemented in video Middle tracking target, is conducive to the continuity of target following.

With further reference to Fig. 7, as the realization to method shown in above-mentioned each figure, this application provides it is a kind of in video into One embodiment of the device of row target following, the device embodiment is corresponding with embodiment of the method shown in Fig. 2, device tool Body can be applied in various electronic equipments.

As shown in fig. 7, the device for carrying out target following in video of the present embodiment may include interception unit 701, spy Levy acquiring unit 702, candidate target region determination unit 703 and target tracking unit 704.

Interception unit 701 is configurable to the position in the historical frames of video based on target to be tracked, from working as video Candidate region is intercepted out in previous frame.

Feature acquiring unit 702 is configurable to the full convolutional network of the candidate region intercepted input training in advance, Obtain characteristic pattern, wherein characteristic pattern includes the candidate target region letter for being used to indicate candidate target present position in characteristic pattern Breath.

Candidate target region determination unit 703 can be based on candidate target region information, be determined from characteristic pattern and each time Select the one-to-one candidate target region of target.

Target tracking unit 704 is configurable in the candidate target region that will be determined, with target similarity to be tracked Highest candidate target region is as the target to be tracked in present frame.

In some optional realization methods, the device for carrying out target following in video of the present embodiment can also include Training unit (not shown).

In these optional realization methods, training unit is configurable to the candidate that will be intercepted in feature acquiring unit Region input is in advance before the full convolutional network of training：Establish initial full convolutional network；Obtain training sample set, training sample set Including multiple training samples pair, training sample to including same video file wherein two field pictures and be used for label target pair As the markup information in the residing region in two field pictures；Training sample set is inputted into initial full convolutional network, based on pre-setting The initial full convolutional network of loss function training, the full convolutional network after being trained.

In some optional realization methods, target tracking unit 704 can also be further configured to：It will be from characteristic pattern The middle each candidate target region intercepted out inputs preset pond layer, obtains candidate feature corresponding with each candidate target region Figure；For each candidate feature figure, between calculating the candidate feature figure and the clarification of objective figure to be tracked that obtains in advance Similarity；By the time corresponding to the highest candidate feature figure of similarity between the clarification of objective figure to be tracked obtained in advance Select target area as the target to be tracked in present frame.

In some optional realization methods, the device for carrying out target following in video of the present embodiment can also be into one Step includes determination unit (not shown).

In these optional realization methods, determination unit can also configure for will be obtained with advance in target tracking unit Clarification of objective figure to be tracked between the highest candidate feature figure of similarity corresponding to candidate target region as current It is after target to be tracked in frame, the similarity between the clarification of objective figure to be tracked obtained in advance is highest candidate special Sign figure is used as clarification of objective figure to be tracked.

In some optional realization methods, the device for carrying out target following in video of the present embodiment can also include Detection unit (not shown) and updating unit (not shown).

In these optional realization methods, detection unit is configurable at predetermined intervals in the current of video Target to be tracked is detected in frame.

Updating unit is configurable to, based on the target to be tracked detected, update clarification of objective figure to be tracked.

In some optional realization methods, the historical frames of video and the present frame of video are two frames adjacent in video.

Below with reference to Fig. 8, it illustrates the computers suitable for terminal device/server for realizing the embodiment of the present application The structural schematic diagram of system 800.Terminal device/server shown in Fig. 8 is only an example, should not be to the embodiment of the present application Function and use scope bring any restrictions.

As shown in figure 8, computer system 800 includes central processing unit (CPU) 801, it can be read-only according to being stored in Program in memory (ROM) 802 or be loaded into the program in random access storage device (RAM) 803 from storage section 808 and Execute various actions appropriate and processing.In RAM 803, also it is stored with system 800 and operates required various programs and data. CPU 801, ROM 802 and RAM 803 are connected with each other by bus 804.Input/output (I/O) interface 805 is also connected to always Line 804.

It is connected to I/O interfaces 805 with lower component：Importation 806 including keyboard, mouse etc.；It is penetrated including such as cathode The output par, c 807 of spool (CRT), liquid crystal display (LCD) etc. and loud speaker etc.；Storage section 808 including hard disk etc.； And the communications portion 809 of the network interface card including LAN card, modem etc..Communications portion 809 via such as because The network of spy's net executes communication process.Driver 810 is also according to needing to be connected to I/O interfaces 805.Detachable media 811, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on driver 810, as needed in order to be read from thereon Computer program be mounted into storage section 808 as needed.

Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium On computer program, which includes the program code for method shown in execution flow chart.In such reality It applies in example, which can be downloaded and installed by communications portion 809 from network, and/or from detachable media 811 are mounted.When the computer program is executed by central processing unit (CPU) 801, limited in execution the present processes Above-mentioned function.It should be noted that computer-readable medium described herein can be computer-readable signal media or Computer readable storage medium either the two arbitrarily combines.Computer readable storage medium for example can be --- but Be not limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or arbitrary above combination. The more specific example of computer readable storage medium can include but is not limited to：Electrical connection with one or more conducting wires, Portable computer diskette, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only deposit Reservoir (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory Part or above-mentioned any appropriate combination.In this application, computer readable storage medium can any be included or store The tangible medium of program, the program can be commanded the either device use or in connection of execution system, device.And In the application, computer-readable signal media may include the data letter propagated in a base band or as a carrier wave part Number, wherein carrying computer-readable program code.Diversified forms may be used in the data-signal of this propagation, including but not It is limited to electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer Any computer-readable medium other than readable storage medium storing program for executing, the computer-readable medium can send, propagate or transmit use In by instruction execution system, device either device use or program in connection.Include on computer-readable medium Program code can transmit with any suitable medium, including but not limited to：Wirelessly, electric wire, optical cable, RF etc., Huo Zheshang Any appropriate combination stated.

The calculating of the operation for executing the application can be write with one or more programming languages or combinations thereof Machine program code, described program design language include object oriented program language-such as Java, Smalltalk, C+ +, further include conventional procedural programming language-such as " C " language or similar programming language.Program code can Fully to execute on the user computer, partly execute, executed as an independent software package on the user computer, Part executes or executes on a remote computer or server completely on the remote computer on the user computer for part. In situations involving remote computers, remote computer can pass through the network of any kind --- including LAN (LAN) Or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as utilize Internet service Provider is connected by internet).

Flow chart in attached drawing and block diagram, it is illustrated that according to the system of the various embodiments of the application, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part for a part for one module, program segment, or code of table, the module, program segment, or code includes one or more uses The executable instruction of the logic function as defined in realization.It should also be noted that in some implementations as replacements, being marked in box The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually It can be basically executed in parallel, they can also be executed in the opposite order sometimes, this is depended on the functions involved.Also it to note Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction Combination realize.

Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard The mode of part is realized.Described unit can also be arranged in the processor, for example, can be described as：A kind of processor packet Include interception unit, feature acquiring unit, candidate target region determination unit and target tracking unit.Wherein, these units Title does not constitute the restriction to the unit itself under certain conditions, for example, interception unit is also described as " based on waiting for Position of the target in the historical frames of video is tracked, the unit of candidate region is intercepted out from the present frame of video ".

As on the other hand, present invention also provides a kind of computer-readable medium, which can be Included in device described in above-described embodiment；Can also be individualism, and without be incorporated the device in.Above-mentioned calculating Machine readable medium carries one or more program, when said one or multiple programs are executed by the device so that should Device：Position based on target to be tracked in the historical frames of video intercepts out candidate region from the present frame of video；By institute The full convolutional network of the candidate region input training in advance of interception, obtains characteristic pattern, wherein characteristic pattern includes to be used to indicate candidate The candidate target region information of target present position in characteristic pattern；Based on candidate target region information, determined from characteristic pattern Go out and the one-to-one candidate target region of each candidate target；And by the candidate target region determined, with mesh to be tracked The highest candidate target region of similarity is marked as the target to be tracked in present frame.

Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.People in the art Member should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature Other technical solutions of arbitrary combination and formation.Such as features described above has similar work(with (but not limited to) disclosed herein Can technical characteristic replaced mutually and the technical solution that is formed.

Claims

1. a kind of method carrying out target following in video, including：

Position based on target to be tracked in the historical frames of video intercepts out candidate region from the present frame of video；

By the full convolutional network of the candidate region intercepted input training in advance, characteristic pattern is obtained, wherein characteristic pattern includes to be used for Indicate the candidate target region information of candidate target present position in the characteristic pattern；

Based on the candidate target region information, determine to wait correspondingly with each candidate target from the characteristic pattern Select target area；And

By in the candidate target region determined, with the highest candidate target region of target similarity to be tracked as described current Target to be tracked in frame.

2. according to the method described in claim 1, wherein, the candidate region intercepted to be inputted to full volume trained in advance described The step of before product network, the method further includes the steps that trained full convolutional network, the training full convolutional network include：

Establish initial full convolutional network；

Training sample set is obtained, the training sample set includes multiple training samples pair, and the training sample is to including same regard The wherein two field pictures of frequency file and markup information for label target object residing region in two field pictures；

It is described initial based on the training of pre-set loss function by the training sample set input initial full convolutional network Full convolutional network, the full convolutional network after being trained.

It is described by the candidate target region determined 3. according to the method described in claim 1, wherein, with mesh to be tracked The highest candidate target region of similarity is marked as the target to be tracked in the present frame, including：

Each candidate target region intercepted out from the characteristic pattern is inputted into preset pond layer, is obtained and each candidate target area The corresponding candidate feature figure in domain；

For each candidate feature figure, calculate the candidate feature figure and the clarification of objective figure to be tracked that obtains in advance it Between similarity；

Corresponding to the highest candidate feature figure of similarity between the clarification of objective figure to be tracked obtained in advance Candidate target region is as the target to be tracked in the present frame.

4. according to the method described in claim 3, wherein, it is described by with the clarification of objective figure to be tracked that obtains in advance Between the highest candidate feature figure of similarity corresponding to candidate target region as the target to be tracked in the present frame Later, the method further includes：

Using the highest candidate feature figure of similarity between the clarification of objective figure to be tracked obtained in advance as described in Clarification of objective figure to be tracked.

5. according to the method described in claim 4, wherein, the method further includes：

Target to be tracked is detected in the present frame of the video at predetermined intervals；And

Based on the target to be tracked detected, the clarification of objective figure to be tracked is updated.

6. according to the method described in one of claim 1-5, wherein the historical frames of the video and the present frame of the video are Two adjacent frames in the video.

7. a kind of device carrying out target following in video, including：

Interception unit is configured to the position in the historical frames of video based on target to be tracked, is cut from the present frame of video Take out candidate region；

Feature acquiring unit is configured to, by the full convolutional network of the candidate region intercepted input training in advance, obtain feature Figure, wherein characteristic pattern includes the candidate target region information for being used to indicate candidate target present position in the characteristic pattern；

Candidate target region determination unit is configured to be based on the candidate target region information, be determined from the characteristic pattern Go out and each one-to-one candidate target region of candidate target；And

Target tracking unit is configured in the candidate target region that will be determined, with the highest time of target similarity to be tracked Select target area as the target to be tracked in the present frame.

8. device according to claim 7, wherein described device further includes training unit, and the training unit configuration is used In before the candidate region intercepted is inputted the full convolutional network of training in advance by the feature acquiring unit：

Establish initial full convolutional network；

9. device according to claim 7, wherein the target tracking unit is further configured to：

10. device according to claim 9, wherein described device further includes determination unit；

The determination unit be configured to the target tracking unit by with the clarification of objective to be tracked that obtains in advance The candidate target region corresponding to the highest candidate feature figure of similarity between figure is as the mesh to be tracked in the present frame After mark, using the highest candidate feature figure of similarity between the clarification of objective figure to be tracked obtained in advance as institute State clarification of objective figure to be tracked.

11. device according to claim 10, wherein described device further includes：

Detection unit is configured to detect target to be tracked in the present frame of the video at predetermined intervals；And

Updating unit is configured to, based on the target to be tracked detected, update the clarification of objective figure to be tracked.

12. according to the device described in one of claim 7-11, wherein the present frame of the historical frames of the video and the video For two frames adjacent in the video.

13. a kind of equipment, including：

One or more processors；

Storage device, for storing one or more programs,

When one or more of programs are executed by one or more of processors so that one or more of processors are real The now method as described in any in claim 1-6.

14. a kind of computer readable storage medium, is stored thereon with computer program, wherein described program is executed by processor Methods of the Shi Shixian as described in any in claim 1-6.