CN105760472A

CN105760472A - Video retrieval method and system

Info

Publication number: CN105760472A
Application number: CN201610084093.XA
Authority: CN
Inventors: 杨颖�; 李丹阳; 贾静丽
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2016-02-06
Filing date: 2016-02-06
Publication date: 2016-07-13

Abstract

The invention provides a video retrieval method and system. The method comprises the steps that when retrieval keywords are received, a video to be retrieved is divided into multiple shots; previous N frame images of the shots are extracted, and whether human face images exist in the extracted images or not is detected, wherein N is an integer larger than or equal to 1; all the human face images are detected in the shots, in which the human face images exist, of the former N frame images; according to the retrieval keywords, a sample set corresponding to the retrieval keywords is compared with the human face images, and the similarity between all the human face images and the sample set is calculated; the human face images with the similarity larger than a first preset value are integrated in the shots to which the human face images belong, and all the integrated shots are connected to obtain a target video. The method can solve the problem that in the prior art, interested parts are difficult to search for, the video retrieval speed is increased, and therefore viewing experience of users is promoted.

Description

Video retrieval method and system

Technical field

The present invention relates to multimedia technology field, particularly relate to a kind of video retrieval method and system.

Background technology

2010, the intelligent television plan of Google company, formally open the intellectualization times of TV, the demand of video is also just developed by people towards aspect personalized, hommization.

In daily video search, user is likely to only interested in the video segment of someone or a few individual, even and if most of video resource comprises video segment interested, but video resource itself is generally of the long period, thus for checking that video segment interested has to check whole video resource, or owing to location inaccuracy misses some video segments interested.Thus, causing user's video retrieval difficulty to part interested, so that user is also relatively long in the time that video is retrieved, what greatly reduce user views and admires experience.

Summary of the invention

For defect of the prior art, the present invention provides a kind of video retrieval method and system, to solve problem in prior art, part lookup interested is difficult.

First aspect, the present invention provides a kind of video retrieval method, including:

When receiving search key, it is multiple camera lens by Video segmentation to be retrieved；

Extracting the front N two field picture of described camera lens, and detect whether there is facial image in the image extracted, N is the integer be more than or equal to 1；

Exist in the camera lens of facial image at front N two field picture and detect face images；

According to described search key, sample set corresponding for described search key and described facial image are contrasted, calculates the similarity of each facial image and described sample set；

Described similarity is integrated more than the facial image of the first preset value in affiliated camera lens, and each camera lens after integrating is connected, to obtain target video.

Preferably, described is multiple camera lens by Video segmentation to be retrieved, including:

Extract the visual signature of video to be retrieved；

According to the similarity between described visual feature measurement adjacent image；

When described similarity is less than the second preset value, described adjacent image is divided into two camera lenses.

Preferably, described exist in the camera lens of facial image at front N two field picture detect face images, including:

Adopt cascade classifier to exist in the camera lens of facial image at described front N two field picture and detect face images.

Preferably, described according to described search key, sample set corresponding for described search key and described facial image are contrasted, calculates the similarity of each facial image and described sample set, including:

According to described search key, extracting the sample set relevant to described search key in face sample database, described sample set is multiple face sample images of same personage；

It is by the linear combination of described face sample image by described graphical representation；

The similarity of this image of the coefficient calculations according to described linear combination and described sample set.

Preferably, described described similarity is integrated in affiliated camera lens more than the facial image of the first preset value, including:

Described similarity is clustered in the camera lens belonging to this facial image more than the facial image of the first preset value；

By temporal information corresponding with this facial image for the facial image of cluster and acoustic information association, to generate the camera lens including this facial image.

Second aspect, the present invention provides a kind of video frequency search system, including:

Video lens segmentation module, is used for when receiving search key, is multiple camera lens by Video segmentation to be retrieved；

Shot detection module, for extracting the front N two field picture of described camera lens, and detects whether there is facial image in the image extracted, and N is the integer be more than or equal to 1；

Facial image detection module, detects face images for existing in the camera lens of facial image at front N two field picture；

Facial image retrieval module, for according to described search key, contrasting sample set corresponding for described search key and described facial image, calculate the similarity of each facial image and described sample set；

Target video generation module, for described similarity being integrated in affiliated camera lens more than the facial image of the first preset value, and connects each camera lens after integrating, to obtain target video.

Preferably, described video lens segmentation module, specifically for

Extract the visual signature of video to be retrieved；

Preferably, described facial image detection module, specifically for

Preferably, described facial image retrieval module, specifically for

Preferably, described target video generation module, specifically for

As shown from the above technical solution, the video retrieval method of the present invention and system, by being multiple camera lens by Video segmentation to be retrieved, and the camera lens that front N frame exists facial image carries out facial image detection, further according to search key, calculate the similarity of the sample set corresponding with described search key and facial image, described similarity is integrated more than the facial image of the first preset value in the camera lens belonging to this facial image, finally the camera lens after integration is connected, obtain target video.Thus, being effectively increased video frequency searching speed, what promote user views and admires experience.

Accompanying drawing explanation

The schematic flow sheet of the video retrieval method that Fig. 1 provides for one embodiment of the invention；

The schematic flow sheet of the video retrieval method that Fig. 2 provides for another embodiment of the present invention；

The schematic diagram of each class template that Fig. 3 provides for the embodiment of the present invention；

The structural representation of the video frequency search system that Fig. 4 provides for one embodiment of the invention.

Detailed description of the invention

For making the purpose of the embodiment of the present invention, technical scheme and advantage clearly, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is carried out clear, complete description, obviously, described embodiment is only a part of embodiment of the present invention, rather than whole embodiments.Based on embodiments of the invention, the every other embodiment that those of ordinary skill in the art obtain under not making creative work premise, broadly fall into the scope of protection of the invention.

Fig. 1 illustrates the schematic flow sheet of the video retrieval method that one embodiment of the invention provides, as it is shown in figure 1, the video retrieval method of the present embodiment is as described below.

101, when receiving search key, it is multiple camera lens by Video segmentation to be retrieved.

It will be appreciated that INVENTIONBroadcast video generally includes multiple camera lens in manufacturing process, scene and the content of each camera lens are continuous.Therefore, can, by judging that the difference between consecutive image identifies the border of camera lens, be thus multiple independent camera lenses by Video segmentation to be retrieved.

In actual applications, above-mentioned step 101 also includes not shown sub-step 1011-sub-step 1013.

1011, the visual signature of video to be retrieved is extracted.

For example, the color histogram of video to be retrieved or the pixel visual signature as video to be retrieved can be extracted.

1012, according to the similarity between described visual feature measurement adjacent image.

For example, it is possible to the similarity of position is as the similarity between adjacent image, and the similarity of position is represented by:

S = Σ_{i - 1}^{N} W_{i} \times [(x_{i^{'}} - x_{i}) + (y_{i^{'}} - y_{i})];

Wherein, (X_i, Y_i) represent the coordinate of point in the i-th two field picture, (X_i', Y_i') represent that the i-th two field picture believes the coordinate of the point in two field picture, W_iRepresent the weight of this point.

1013, when described similarity is less than the second preset value, described adjacent image is divided into two camera lenses.

By above-mentioned feasible, when above-mentioned location similarity S is less than the second preset value, using the position at this place as the position of shot segmentation.The second above-mentioned preset value is an empirical value, and the concrete value of the second preset value is not defined by the present embodiment.

Adopt and in manner just described video to be retrieved is traveled through, be multiple camera lens by Video segmentation to be retrieved.

102, extract the front N two field picture of described camera lens, and detect whether there is facial image in the image extracted.

Wherein, N is the integer be more than or equal to 1.The content that image in INVENTIONBroadcast video includes is complicated and has multiformity, for improving recall precision, detects using the front N two field picture of each camera lens as the key frame of this camera lens, if front N two field picture includes facial image, is then retained by this camera lens, as camera lens to be retrieved；If front N two field picture does not include facial image, then this camera lens is given up, no longer this camera lens is carried out detection further and retrieval.

103, exist at front N two field picture in the camera lens of facial image and detect face images.

In actual applications, cascade classifier can be adopted to detect all of facial image in this camera lens in above-mentioned camera lens.

Specifically, A Weak Classifier can carry out cascade, then B weak cascade classifier is cascaded as a strong classifier, then C strong classifier is carried out cascade obtain above-mentioned cascade classifier.When being carried out the detection of facial image frame by above-mentioned cascade classifier, first detection image is passed through first strong classifier, if it is facial image that this strong classifier judges this detection image, then by this detection image again through second strong classifier, by that analogy, until by whole strong classifiers.As long as detection image is non-face image to have a Weak Classifier to judge in detection process, then no longer carries out follow-up detection process, and this detection image is categorized as non-face image.Thus, substantial amounts of non-detection target can be eliminated, filter most of non-face picture frame, be greatly improved detection speed.

104, according to described search key, sample set corresponding for described search key and described facial image are contrasted, calculates the similarity of each facial image and described sample set.

Should be noted that the set that above-mentioned sample set is the multiple images including certain personage, the image of sample set is the same personage character image in different illumination, different angles or different facial expression.

In actual applications, the image in sample set generally goes through denoising, and zooms to 100 × 100 pixel sizes, and the characters name in usable image is classified.When carrying out video frequency searching, it is possible to characters name is that search key extracts sample set, then calculates the similarity of facial image and sample set.

105, described similarity is integrated more than the facial image of the first preset value in affiliated camera lens, and the camera lens after integrating is connected, to obtain target video fragment.

Specifically, above-mentioned step 105 includes not shown sub-step 1051 and sub-step 1052.

1051, described similarity is clustered in the camera lens belonging to this facial image more than the facial image of the first preset value.

For example, optional suitable threshold value, similarity is clustered more than the facial image of this threshold value, then the facial image that an apoplexy due to endogenous wind includes can be that scene is similar, and content is close and continuous print image.Therefore, above-mentioned facial image is first carried out clustering processing, the efficiency of image integration can be effectively improved.

1052, by temporal information corresponding with this facial image for the facial image of cluster and acoustic information association, to generate the camera lens including this facial image.

It will be appreciated that each image in video all correspond to the temporal information on unique time shaft and acoustic information.Therefore, temporal information corresponding with this image for image after cluster and acoustic information are together in parallel, could recovering the video content that each fragment of original video is identical, thus generation camera lens is attached ultimately generating again and only comprises the target video relevant to search key.

The video retrieval method of the present embodiment, by being multiple camera lens by Video segmentation to be retrieved, and the camera lens that front N frame exists facial image carries out facial image detection, further according to search key, calculate the similarity of the sample set corresponding with described search key and facial image, described similarity is integrated more than the facial image of the first preset value in the camera lens belonging to this facial image, finally the camera lens after integration is connected, obtain target video.Thus, being effectively increased video frequency searching speed, what promote user views and admires experience.

Fig. 2 illustrates the schematic flow sheet of the video retrieval method that one embodiment of the invention provides, as in figure 2 it is shown, the video retrieval method of the present embodiment is as described below.

201, when receiving search key, it is multiple camera lens by Video segmentation to be retrieved.

202, extract the front N two field picture of described camera lens, and detect whether there is facial image in the image extracted.

203, adopt cascade classifier to exist in the camera lens of facial image at described front N two field picture and detect face images.

In a kind of enforceable mode, the cascade classifier of the present embodiment, can be trained in the following way generating.

First, choose N number of image as training sample, process through smoothing denoising, be scaled to the image of 24 × 24 sizes.Then calculate the face characteristic of these samples, construct Weak Classifier.

Specifically, each class template as shown in Figure 3 can be adopted to obtain the rectangular characteristic of facial image.Should be noted that the scalable one-tenth arbitrary dimension detection window of above-mentioned template, obtain the rectangular characteristic of facial image.The detection window number that the template of such as s × t yardstick obtains is:

([\frac{24}{s}] + [\frac{23}{s}] + ... + [\frac{1}{s}]) \times ([\frac{24}{t}] + [\frac{23}{t}] + ... + [\frac{1}{t}])

Wherein, [] is for rounding symbol；

Further, the eigenvalue of each detection window is calculated.The eigenvalue of detection window can for all pixels comprised in black rectangle and all pixels comprised in deducting white rectangle and.Integrogram can be adopted to calculate the eigenvalue of each detection window.

For example, if A is (m, n) representing the cumulative sum of all pixels in top and left of this point in integral image, (m, n) represents the cumulative sum of line direction to S, i (m, n) represent this region pixel and, then line by line image is scanned, recursive calculation S (m, n) and A (m, n) can obtain:

S (m, n)=S (m, n-1)+i (m, n)

A (m, n)=A (m-1, n)+S (m, n)

Thus, the eigenvalue of each detection window is calculated.

Due to different sample X_i(i=1,2 ... N) is at different detection window K_jEigenvalue f in (j=1,2 ... M)_j(X_i) (i=1,2 ..., Nj=1,2 ..., M) different, these eigenvalues are chosen suitable threshold value, is used for judging that image is facial image or non-face image.

Such as, detection window K_jCorresponding Weak Classifier can be wh_j(X),

wh_j(X_i)=1 represents that this image is facial image, otherwise, for non-face image.P indicates the direction of the sign of inequality, and value is positive and negative 1, and when jth feature meansigma methods in all samples is less than threshold θ_jTime, p is-1, and otherwise p is 1；θ_jFor all samples optimal threshold in jth eigenvalue.

Wherein, threshold θ_jAdopt and determine with the following method: for each sample X of each feature calculation_iEigenvalue f_j(X_i), eigenvalue is ranked up from small to large, obtains the ratio T of all face sample images⁺And the ratio T of all non-face sample images^-；Calculate again at f_j(X_i) before the ratio S of all face sample images⁺And the ratio S of all non-face sample images^-, then current f_j(Xi) classification error rate e is:

E=min ((S⁺+(T^--S^-)),(S^-+(T⁺-S⁺)))

Obtained from above to the optimal threshold θ making error rate e minimum_j, j and f_j(X_i) value.According to classification error rate e, filter out the less feature of classification error rate as Weak Classifier.

Then, the strong classifier of Face datection is constructed.Each sample can be composed identical weights, with first Weak Classifier to N number of sample classification, the sample of misclassification is increased weight, point to sample reduce weight, second Weak Classifier of training in the sample composing new weight, increases weight by the sample of misclassification, point to sample reduce weight, after iteration P time, generate P Weak Classifier.

Again by P Weak Classifier according to certain weighted superposition, obtain strong classifier；Find out several strong classifiers, constitute cascade classifier, be used for detecting above-mentioned facial image frame.

204, according to described search key, face sample database extracts the sample set relevant to described search key.

Described sample set is multiple face sample images of same personage.

In actual applications, face sample image can be classified according to personage.Such as, if there being k personage, then sample set can be categorized as k class, and the sample set of each class is represented by [d₁₁,d₁₂,d₁₃,...d_1n,d₂₁,d₂₂,d₂₃,...d_2n,…d_i1,d_i2,d_i3,…d_in,…d_k1,d_k2,d_k3,...d_kn], then each column vector diji=(1,2 ... k), j=(1,2 ... the sample set of a personage n) can be represented.Further, Di=[d is made_i1,d_i2,d_i3,…d_in], then D=(D₁,D₂,D₃…D_k) for the face sample image database of k sample set composition.

205, it is by the linear combination of described face sample image by described graphical representation.

By above-mentioned sample set it can be seen that each above-mentioned image is represented by sample set from the linear combination of face sample image.Such as, Y=DA can be adopted to represent above-mentioned any image frame.Wherein, A is sparse coefficient matrix.When a certain personage that image is in face sample image database, it is represented by Y=a_i1×d_i1+a_i2×d_i2+…+a_in×d_in, wherein a_i1,a_i2,…a_in, for sparse coefficient, it is a train value of sparse coefficient matrix A.

206, the similarity according to this image of the coefficient calculations of described linear combination Yu described sample set.

Specifically, above-mentioned sparse coefficient and Σ A can be adopted_i=a_i1+a_i2+…+a_inRepresent the similarity of this picture frame Y and described sample set.

207, described similarity is integrated more than the facial image of the first preset value in affiliated camera lens, and each camera lens after integrating is connected, to obtain target video.

For example, if the number of the face sample image in sample set is n, then at sparse coefficient and Σ A_i> 0.8n time, using this image as the facial image being retrieved.Again the facial image obtained is integrated in camera lens belonging to this facial image, each camera lens after integrating is concatenated, finally gives target video.

Fig. 4 illustrates the video frequency search system that one embodiment of the invention provides, as described in Figure 4, the video frequency search system of the present embodiment, including: video lens segmentation module 41, shot detection module 42, facial image detection module 43, facial image retrieval module 44 and target video generation module 45.

Video lens segmentation module 41, is used for when receiving search key, is multiple camera lens by Video segmentation to be retrieved；

Shot detection module 42, for extracting the front N two field picture of described camera lens, and detects whether there is facial image in the image extracted, and N is the integer be more than or equal to 1；

Facial image detection module 43, detects face images for existing in the camera lens of facial image at front N two field picture；

Facial image retrieval module 44, for according to described search key, contrasting sample set corresponding for described search key and described facial image, calculate the similarity of each facial image and described sample set；

Target video generation module 45, for described similarity being integrated in affiliated camera lens more than the facial image of the first preset value, and connects each camera lens after integrating, to obtain target video.

Preferably, described video lens segmentation module 41, specifically for extracting the visual signature of video to be retrieved；According to the similarity between described visual feature measurement adjacent image；When described similarity is less than the second preset value, described adjacent image is divided into two camera lenses.

Preferably, described facial image detection module 43, detect face images specifically for adopting cascade classifier to exist in the camera lens of facial image at described front N two field picture.

Preferably, described facial image retrieval module 44, specifically for according to described search key, extracting the sample set relevant to described search key in face sample database, described sample set is multiple face sample images of same personage；It is by the linear combination of described face sample image by described graphical representation；The similarity of this image of the coefficient calculations according to described linear combination and described sample set.

Preferably, described target video generation module 45, specifically for clustering described similarity more than the facial image of the first preset value in the camera lens belonging to this facial image；By temporal information corresponding with this facial image for the facial image of cluster and acoustic information association, to generate the camera lens including this facial image.

The video frequency search system of the present embodiment, it is possible to for performing the technical scheme of embodiment of the method shown in above-mentioned Fig. 1 or Fig. 2, it is similar with technique effect that it realizes principle, repeats no more herein.

The video frequency search system of the present embodiment, by being multiple camera lens by Video segmentation to be retrieved, and the camera lens that front N frame exists facial image carries out facial image detection, further according to search key, calculate the similarity of the sample set corresponding with described search key and facial image, described similarity is integrated more than the facial image of the first preset value in the camera lens belonging to this facial image, finally the camera lens after integration is connected, obtain target video.Thus, being effectively increased video frequency searching speed, what promote user views and admires experience.

Last it is noted that various embodiments above is only in order to illustrate technical scheme, it is not intended to limit；Although the present invention being described in detail with reference to foregoing embodiments, it will be understood by those within the art that: the technical scheme described in foregoing embodiments still can be modified by it, or wherein some or all of technical characteristic is carried out equivalent replacement；And these amendments or replacement, do not make the essence of appropriate technical solution depart from the scope of the claims in the present invention.

Claims

1. a video retrieval method, it is characterised in that described method includes:

2. method according to claim 1, it is characterised in that described is multiple camera lens by Video segmentation to be retrieved, including:

Extract the visual signature of video to be retrieved；

3. method according to claim 1, it is characterised in that described exist in the camera lens of facial image at front N two field picture detect face images, including:

4. method according to claim 1, it is characterised in that described according to described search key, contrasts sample set corresponding for described search key and described facial image, calculates the similarity of each facial image and described sample set, including:

5. method according to claim 1, it is characterised in that described described similarity is integrated in affiliated camera lens more than the facial image of the first preset value, including:

6. a video frequency search system, it is characterised in that described system includes:

7. system according to claim 6, it is characterised in that described video lens segmentation module, specifically for

Extract the visual signature of video to be retrieved；

8. system according to claim 6, it is characterised in that described facial image detection module, specifically for

9. system according to claim 6, it is characterised in that described facial image retrieval module, specifically for

10. system according to claim 6, it is characterised in that described target video generation module, specifically for