A kind of method in sports video, contestant being carried out plurality of human faces tracking
Technical field:
The invention belongs to Video processing and computer vision field, be specifically related in a kind of sports video competition
Athlete carries out the method for plurality of human faces tracking.
Background technology:
Multiple target tracking refers to targets interested multiple in video sequence are positioned, followed the tracks of, and pushes away
Survey the track of each target.Multiple target tracking, as an important topic in computer vision field, is regarding
The aspects such as frequency monitoring, target recognition, video information discovery have important value.
Plurality of human faces in sports video is followed the tracks of and is referred to, carries out determining to the face of contestant each in video
Position, is tracked simultaneously, ultimately generates the face movement locus of each contestant.In sports video
Plurality of human faces tracking as a basic technology, can be applicable to athletic identification, sports video
In the task that content analysis etc. are higher level, there is extremely important commercial application value.
Compared with the multiple target tracking problem in monitor video, the multiple target tracking problem in sports video is more
Challenging.First, competition area are shot from different perspectives by sports video by multiple video cameras
Camera lens splicing is made, and adjacent two camera lenses can exist the situations such as the switching of quick image or gradual change conversion.
Secondly, same competition person has the change of the aspects such as complicated attitude, illumination and yardstick under different camera lenses
Changing, this causes great difficulty to face tracking problem.Finally, sports video also exists have similar
The human face target of outward appearance, this adds difficulty to plurality of human faces tracking technique.
In existing sports video Patents, the face to each contestant is not tracked
Method.The present invention can make up this vacancy, exactly the multiple faces in video is positioned and is followed the tracks of,
Generate each athletic face tracking track.
Summary of the invention:
In order to overcome the deficiencies in the prior art, the invention provides in a kind of sports video contestant is entered
The method that row plurality of human faces is followed the tracks of.The face of contestants multiple in video can be carried out by the method simultaneously can
By location, ground and tracking, generate accurate face movement locus.
For reaching above-mentioned purpose, the present invention adopts the following technical scheme that and realizes:
A kind of method in sports video, contestant being carried out plurality of human faces tracking, comprises the following steps:
1) comprising no less than on the off-line human face data collection of 3000 different face classifications, using supervised
Method training in advance one is for the convolutional neural networks model of recognition of face;
2) by the Shot change in detection video, input video is divided into non-overlapping camera lens fragment, and
Select the camera lens fragment of all close shots;
3) in the camera lens fragment of each close shot, use human-face detector that every piece image is carried out Face datection,
Obtain Face datection response;
4) in the camera lens fragment of each close shot, by Face datection response high for similarity in adjacent several two field pictures
It is associated as path segment;
5) in obtained path segment, limit according to space time information, generate positive and negative two class training samples;
6) using the positive and negative training sample that obtains as input, Siamese or Triplet network is used to 1) in
The convolutional neural networks of pre-training is finely adjusted, on-line study more distinction and adaptive face characteristic;
7) use the convolutional neural networks after fine setting, extract the face characteristic of each image in each path segment;
8) layering associates all path segment, generates final face movement locus.
The present invention is further improved by, described step 1) in, the structure of convolutional neural networks is input layer
-convolution and sample level-output layer, input layer is the facial image of input, and convolution and sample level include process of convolution
With Max Pooling process, the corresponding face classification of each neuron of output layer.
The present invention is further improved by, described step 5) in, positive training sample is from same track
Two facial images in fragment, negative training sample is two faces respectively from two different tracks fragments
Image, wherein the two path segment occurs in a certain two field picture simultaneously;
Positive and negative training sample combines in the way of ternary one group: two facial images from same path segment,
3rd facial image is from another path segment, and wherein the two path segment is same in a certain two field picture
Time occur.
The present invention is further improved by, described step 6) in, Siamese network is identical by structure and weighs
Two convolutional neural networks compositions that value is shared, using two facial images as input, use contrast loss letter
Number;
Three convolutional neural networks that Triplet network is identical by structure and weights are shared form, with ternary one group
Mode as input, use Triplet loss function.
The present invention is further improved by, described step 8) in, association face path segment in two steps, the
One step is in each camera lens fragment, uses multi-object tracking method, according to movable information and the study of target
The identification face characteristic association path segment obtained;Second step is the face characteristic obtained merely with study,
The method using stratification agglomerative clustering, the path segment under the different camera lens of association, generate final face mesh
Mark track.
Compared with prior art, the method have the advantages that
Multi-object tracking method based on recognition of face of the present invention, collects from video to be tracked online
Training sample, is finely adjusted the face convolutional neural networks of training in advance, thus on-line study is more sentenced
The face characteristic of other property, and then use this feature to carry out more efficiently plurality of human faces tracking.
Accompanying drawing illustrates:
Fig. 1 is the schematic flow sheet of the present invention.
Detailed description of the invention:
Below in conjunction with the accompanying drawings the present invention is described in further detail:
With reference to Fig. 1, the method for multiple target tracking in sports video based on recognition of face of the present invention, bag
Include following steps:
1) on the off-line human face data collection comprising a large amount of face classification, supervised method training in advance one is used
The individual convolutional neural networks model for recognition of face.The structure of convolutional neural networks is " input layer convolution and adopting
Sample layer output layer ", input layer is the facial image of input, and convolution and sample level include process of convolution and Max
Pooling process, the corresponding face classification of each neuron of output layer.
2) by the Shot change in detection video, input video is divided into non-overlapping camera lens fragment.Root
The ratio of positive width image is accounted for according to face, and face and competition area reference substance (such as meadow, court line etc.)
Relation, selects the camera lens fragment of all close shots.
3) in the camera lens fragment of each close shot, use the human-face detector published that every piece image is entered
Row Face datection, obtains Face datection response.
4) in the camera lens fragment of each close shot, by Face datection response high for similarity in adjacent several two field pictures
It is associated as path segment.
5) in obtained path segment, limit according to space time information, generate positive and negative two class training samples.
Positive training sample is from two facial images in same path segment.Negative training sample is respectively
From two facial images of two different tracks fragments, wherein the two path segment is in a certain two field picture
Occur simultaneously.OrderRepresent a length of niPath segment, x represents that Face datection rings
Should, then positive training sampleIf TiAnd TjRepresent same
Two the different path segment occurred in frame, then bear training sample
Positive and negative training sample can combine further in the way of ternary one group (Triplet): two facial images
From from same path segment, the 3rd facial image from another path segment, wherein the two
Path segment occurs in a certain two field picture simultaneously.Make TiAnd TjRepresent two differences occurred in same frame
Path segment, then can be from TiAnd TjMiddle generation training sample s,
6) using the training sample that obtains as input, use Siamese or Triplet network to 1) in advance
The convolutional neural networks of training is finely adjusted, on-line study more distinction and adaptive face characteristic.
Two convolutional neural networks that Siamese network is identical by structure and weights are shared form, with two faces
Image, as input, uses contrast loss function.In Siamese network, the extraction process of face characteristic is permissible
It is expressed as f (x)=Conv (x;W), wherein Conv () is mapping function, x ∈ R227×227×3It it is the face of input
Image, f (x) represents the characteristic vector extracted.Make x1,x2Represent two training sample image, thenRepresent the distance of two image feature vectors.Damage is contrasted below using in training
Mistake function reduces the distance between the image of two same targets, increases between two different target images simultaneously
Distance:
Wherein, τ is nargin (margin).Y=1 represents that two images represent two from same target, y=0
Open image from different target.
Three convolutional neural networks that Triplet network is identical by structure and weights are shared form, with ternary one group
Mode as input, use Triplet loss function.In training, to one group of input sampleNeeds make positive training sample pairBetween distance less than negative training sample pairBetween
Distance.It is below the loss function of Triplet network:
Wherein α is distance nargin.
7) use the convolutional neural networks after fine setting, extract the face of every width facial image in each path segment
Feature.
Association face path segment in two steps.The first step is in each camera lens fragment, uses traditional many mesh
Mark tracking, associates path segment according to the identification face characteristic that the movable information of target obtains with study.
Second step is the face characteristic obtained merely with study, the method using stratification agglomerative clustering, association difference
Path segment under camera lens, generates final human face target track.