CN106022220A

CN106022220A - Method for performing multi-face tracking on participating athletes in sports video

Info

Publication number: CN106022220A
Application number: CN201610301411.3A
Authority: CN
Inventors: 王进军; 张顺; 姜思宇
Original assignee: Xi'an Brision Information Technology Co Ltd
Current assignee: Beijing Hippo energy Sports Technology Co., Ltd.
Priority date: 2016-05-09
Filing date: 2016-05-09
Publication date: 2016-10-12
Anticipated expiration: 2036-05-09
Also published as: CN106022220B

Abstract

The invention discloses a method for performing multi-face tracking on participating athletes in a sports video. The method comprises the following steps: pre-training a convolutional neural network for face identification; performing lens segmentation on an input video, and selecting all close-range lens segments; performing face detection on each image in a close-range lens, and obtaining face detection responses; associating with the face detection responses to form locus segments; according to time-space information restrictions between the locus segments, generating training samples; by taking the obtained training samples as input, by use of a Siamerse or Triplet netowrk, finely tuning the pre-trained convolutional neural network; by use of the finely-tuned convolutional neural network, extracting features of each face image; and associating with all the locus segments in a layered mode to generate a face motion lous. According to the method, the training samples are collected from a video to be tracked in an online mode, the pre-trained convolutional neural network is finely tuned, more distinguished face features are learned in an online mode, and more effective multi-face tracking is carried out by use of the features.

Description

A kind of method in sports video, contestant being carried out plurality of human faces tracking

Technical field:

The invention belongs to Video processing and computer vision field, be specifically related in a kind of sports video competition Athlete carries out the method for plurality of human faces tracking.

Background technology:

Multiple target tracking refers to targets interested multiple in video sequence are positioned, followed the tracks of, and pushes away Survey the track of each target.Multiple target tracking, as an important topic in computer vision field, is regarding The aspects such as frequency monitoring, target recognition, video information discovery have important value.

Plurality of human faces in sports video is followed the tracks of and is referred to, carries out determining to the face of contestant each in video Position, is tracked simultaneously, ultimately generates the face movement locus of each contestant.In sports video Plurality of human faces tracking as a basic technology, can be applicable to athletic identification, sports video In the task that content analysis etc. are higher level, there is extremely important commercial application value.

Compared with the multiple target tracking problem in monitor video, the multiple target tracking problem in sports video is more Challenging.First, competition area are shot from different perspectives by sports video by multiple video cameras Camera lens splicing is made, and adjacent two camera lenses can exist the situations such as the switching of quick image or gradual change conversion. Secondly, same competition person has the change of the aspects such as complicated attitude, illumination and yardstick under different camera lenses Changing, this causes great difficulty to face tracking problem.Finally, sports video also exists have similar The human face target of outward appearance, this adds difficulty to plurality of human faces tracking technique.

In existing sports video Patents, the face to each contestant is not tracked Method.The present invention can make up this vacancy, exactly the multiple faces in video is positioned and is followed the tracks of, Generate each athletic face tracking track.

Summary of the invention:

In order to overcome the deficiencies in the prior art, the invention provides in a kind of sports video contestant is entered The method that row plurality of human faces is followed the tracks of.The face of contestants multiple in video can be carried out by the method simultaneously can By location, ground and tracking, generate accurate face movement locus.

For reaching above-mentioned purpose, the present invention adopts the following technical scheme that and realizes:

A kind of method in sports video, contestant being carried out plurality of human faces tracking, comprises the following steps:

1) comprising no less than on the off-line human face data collection of 3000 different face classifications, using supervised Method training in advance one is for the convolutional neural networks model of recognition of face；

2) by the Shot change in detection video, input video is divided into non-overlapping camera lens fragment, and Select the camera lens fragment of all close shots；

3) in the camera lens fragment of each close shot, use human-face detector that every piece image is carried out Face datection, Obtain Face datection response；

4) in the camera lens fragment of each close shot, by Face datection response high for similarity in adjacent several two field pictures It is associated as path segment；

5) in obtained path segment, limit according to space time information, generate positive and negative two class training samples；

6) using the positive and negative training sample that obtains as input, Siamese or Triplet network is used to 1) in The convolutional neural networks of pre-training is finely adjusted, on-line study more distinction and adaptive face characteristic；

7) use the convolutional neural networks after fine setting, extract the face characteristic of each image in each path segment；

8) layering associates all path segment, generates final face movement locus.

The present invention is further improved by, described step 1) in, the structure of convolutional neural networks is input layer -convolution and sample level-output layer, input layer is the facial image of input, and convolution and sample level include process of convolution With Max Pooling process, the corresponding face classification of each neuron of output layer.

The present invention is further improved by, described step 5) in, positive training sample is from same track Two facial images in fragment, negative training sample is two faces respectively from two different tracks fragments Image, wherein the two path segment occurs in a certain two field picture simultaneously；

Positive and negative training sample combines in the way of ternary one group: two facial images from same path segment, 3rd facial image is from another path segment, and wherein the two path segment is same in a certain two field picture Time occur.

The present invention is further improved by, described step 6) in, Siamese network is identical by structure and weighs Two convolutional neural networks compositions that value is shared, using two facial images as input, use contrast loss letter Number；

Three convolutional neural networks that Triplet network is identical by structure and weights are shared form, with ternary one group Mode as input, use Triplet loss function.

The present invention is further improved by, described step 8) in, association face path segment in two steps, the One step is in each camera lens fragment, uses multi-object tracking method, according to movable information and the study of target The identification face characteristic association path segment obtained；Second step is the face characteristic obtained merely with study, The method using stratification agglomerative clustering, the path segment under the different camera lens of association, generate final face mesh Mark track.

Compared with prior art, the method have the advantages that

Multi-object tracking method based on recognition of face of the present invention, collects from video to be tracked online Training sample, is finely adjusted the face convolutional neural networks of training in advance, thus on-line study is more sentenced The face characteristic of other property, and then use this feature to carry out more efficiently plurality of human faces tracking.

Accompanying drawing illustrates:

Fig. 1 is the schematic flow sheet of the present invention.

Detailed description of the invention:

Below in conjunction with the accompanying drawings the present invention is described in further detail:

With reference to Fig. 1, the method for multiple target tracking in sports video based on recognition of face of the present invention, bag Include following steps:

1) on the off-line human face data collection comprising a large amount of face classification, supervised method training in advance one is used The individual convolutional neural networks model for recognition of face.The structure of convolutional neural networks is " input layer convolution and adopting Sample layer output layer ", input layer is the facial image of input, and convolution and sample level include process of convolution and Max Pooling process, the corresponding face classification of each neuron of output layer.

2) by the Shot change in detection video, input video is divided into non-overlapping camera lens fragment.Root The ratio of positive width image is accounted for according to face, and face and competition area reference substance (such as meadow, court line etc.) Relation, selects the camera lens fragment of all close shots.

3) in the camera lens fragment of each close shot, use the human-face detector published that every piece image is entered Row Face datection, obtains Face datection response.

4) in the camera lens fragment of each close shot, by Face datection response high for similarity in adjacent several two field pictures It is associated as path segment.

5) in obtained path segment, limit according to space time information, generate positive and negative two class training samples.

Positive training sample is from two facial images in same path segment.Negative training sample is respectively From two facial images of two different tracks fragments, wherein the two path segment is in a certain two field picture Occur simultaneously.OrderRepresent a length of n_iPath segment, x represents that Face datection rings Should, then positive training sampleIf T_iAnd T_jRepresent same Two the different path segment occurred in frame, then bear training sample

Positive and negative training sample can combine further in the way of ternary one group (Triplet): two facial images From from same path segment, the 3rd facial image from another path segment, wherein the two Path segment occurs in a certain two field picture simultaneously.Make T_iAnd T_jRepresent two differences occurred in same frame Path segment, then can be from T_iAnd T_jMiddle generation training sample s,

6) using the training sample that obtains as input, use Siamese or Triplet network to 1) in advance The convolutional neural networks of training is finely adjusted, on-line study more distinction and adaptive face characteristic.

Two convolutional neural networks that Siamese network is identical by structure and weights are shared form, with two faces Image, as input, uses contrast loss function.In Siamese network, the extraction process of face characteristic is permissible It is expressed as f (x)=Conv (x；W), wherein Conv () is mapping function, x ∈ R^227×227×3It it is the face of input Image, f (x) represents the characteristic vector extracted.Make x₁,x₂Represent two training sample image, thenRepresent the distance of two image feature vectors.Damage is contrasted below using in training Mistake function reduces the distance between the image of two same targets, increases between two different target images simultaneously Distance:

L_{P} = \frac{1}{2} (y \cdot d_{f}^{2} + (1 - y) \cdot m a x (0, τ - d_{f}^{2})

Wherein, τ is nargin (margin).Y=1 represents that two images represent two from same target, y=0 Open image from different target.

Three convolutional neural networks that Triplet network is identical by structure and weights are shared form, with ternary one group Mode as input, use Triplet loss function.In training, to one group of input sampleNeeds make positive training sample pairBetween distance less than negative training sample pairBetween Distance.It is below the loss function of Triplet network:

L_{t} = \underset{i, j, k, l}{Σ} [| | f (x_{i}^{k}) - f (x_{i}^{l}) | |_{2}^{2} - | | f (x_{i}^{k}) - f (x_{j}^{m}) | |_{2}^{2} + α]

Wherein α is distance nargin.

7) use the convolutional neural networks after fine setting, extract the face of every width facial image in each path segment Feature.

Association face path segment in two steps.The first step is in each camera lens fragment, uses traditional many mesh Mark tracking, associates path segment according to the identification face characteristic that the movable information of target obtains with study. Second step is the face characteristic obtained merely with study, the method using stratification agglomerative clustering, association difference Path segment under camera lens, generates final human face target track.

Claims

1. the method in a sports video, contestant being carried out plurality of human faces tracking, it is characterised in that bag Include following steps:

8) layering associates all path segment, generates final face movement locus.

A kind of sports video the most according to claim 1 carries out plurality of human faces tracking to contestant Method, it is characterised in that described step 1) in, the structure of convolutional neural networks is input layer-convolution and adopts Sample layer-output layer, input layer is the facial image of input, and convolution and sample level include process of convolution and Max Pooling process, the corresponding face classification of each neuron of output layer.

A kind of sports video the most according to claim 1 carries out plurality of human faces tracking to contestant Method, it is characterised in that described step 5) in, positive training sample is from two in same path segment Opening facial image, negative training sample is two facial images respectively from two different tracks fragments, wherein The two path segment occurs in a certain two field picture simultaneously；

A kind of sports video the most according to claim 1 carries out plurality of human faces tracking to contestant Method, it is characterised in that described step 6) in, Siamese network is identical by structure and weights are shared two Individual convolutional neural networks forms, and using two facial images as input, uses contrast loss function；

A kind of sports video the most according to claim 1 carries out plurality of human faces tracking to contestant Method, it is characterised in that described step 8) in, association face path segment in two steps, the first step is often In individual camera lens fragment, use multi-object tracking method, the differentiation obtained according to movable information and the study of target Property face characteristic association path segment；Second step is the face characteristic obtained merely with study, uses stratification The method of agglomerative clustering, the path segment under the different camera lens of association, generate final human face target track.