CN101316328A

CN101316328A - News anchor lens detection method based on space-time strip pattern analysis

Info

Publication number: CN101316328A
Application number: CNA2007100997265A
Authority: CN
Inventors: 刘安安; 李锦涛; 张勇东; 唐胜; 宋砚
Original assignee: Institute of Computing Technology of CAS
Current assignee: Dongguan Lianzhou Electronic Technology Co Ltd
Priority date: 2007-05-29
Filing date: 2007-05-29
Publication date: 2008-12-03
Anticipated expiration: 2027-05-29
Also published as: CN100548030C

Abstract

The invention discloses an anchorperson detection method based on space-time stripe pattern analysis, which comprises the steps: n continuous frames are intercepted in edited video news as a group, and horizontal space-time stripes and vertical space-time stripes are extracted; corresponding image characteristics of the horizontal space-time stripes and the vertical space-time stripes are extracted so as to obtain corresponding characteristic vectors; the characteristic vectors are respectively clustered through a method of clustering, and the horizontal or vertical space-time stripes which are in the same class and continuous in time are respectively combined as a new factor of the same class so as to obtain a final horizontal clustering result and a vertical clustering result; the classes which comprise most factors in horizontal and vertical clustering results are blended, and according to the blending result, shot of an anchorperson is detected. The method of the invention has the advantages that the detection of shots of various anchorpersons has high accuracy, strong commonality, low calculation complexity, and avoids the disadvantages of over-reliance on accurate shot participation and other modal information in the existing method.

Description

A kind of news main broadcaster's lens detection method based on the space-time strip pattern analysis

Technical field

The present invention relates to video analysis and detection range, particularly a kind of news main broadcaster's detection method based on the space-time strip pattern analysis.

Background technology

Along with the high speed development of network, people can contact visualized datas such as a large amount of images, video.How the demand to visualized data growing according to people is effectively analyzed, manage, to be inquired about and to retrieve these magnanimity informations, becomes present problem demanding prompt solution.

For content-based video analysis because the special architectural feature of news video, the researcher with it as important research object.One section news video can be counted as main broadcaster's camera lens, the combination of News Stories camera lens and the advertisement camera lens that may exist.Wherein, a complete media event has been formed in main broadcaster's camera lens and continuous thereafter News Stories unit.Because the content of main broadcaster's camera lens usually is the brief introduction about ensuing story unit, so news main broadcaster Shot Detection is very important for the foundation of news video index.

Existing news main broadcaster detection method mainly is divided three classes: first kind method is the masterplate matching method, this method is by setting up masterplate to specific news video main broadcaster, the similarity of calculating detected frame of video and masterplate detects news main broadcaster camera lens, and this method poor robustness and accuracy rate are low; Robustness for Enhancement Method, the researcher sets up main broadcaster's model by the method for multi-modal information such as second class fusion vision, the sense of hearing and detects the news main broadcaster, although this method has improved robustness to a certain extent, exist for each news main broadcaster and set up the big and high shortcoming of computation complexity of model workload; The 3rd class methods do not rely on the foundation of main broadcaster's model, but they often depend on static studio backgrounds and accurately camera lens cut apart.Generally speaking, all there is certain deficiency in existing news main broadcaster's detection method on robustness and accuracy.

Main at present accuracy and the robustness that exists two The key factor restriction news main broadcasters to detect:

1, various video program source causes main broadcaster and studio backgrounds variation, and this makes does not also have a kind of general news main broadcaster's detection method at present;

2, Xian Jin video editing technology makes the variation of shot boundary transition form, as shear, dissolves and is fade-in fade-out etc., causes not having method in common can realize accurately and automatically extracting complete lens unit at present.

Summary of the invention

The objective of the invention is to overcome existing news main broadcaster's detection method poor robustness, accuracy rate is low, computation complexity height, and the wideless defective of versatility, a kind of versatility is wide, accuracy rate is high thereby provide, news main broadcaster's detection method that computation complexity is low.

To achieve these goals, the invention provides a kind of news main broadcaster's detection method based on the space-time strip pattern analysis, order is carried out according to the following steps:

Step 10), from through the continuous N frame of intercepting editor's the news video as one group, and from the successive frame group that is intercepted horizontal space-time strip of extraction and vertical space-time strip;

Extract pairing characteristics of image in described horizontal space-time strip and the vertical space-time strip, obtain corresponding characteristic vector step 20), respectively;

Step 30), by clustering method the pairing characteristic vector of described horizontal space-time strip and vertical space-time strip is distinguished cluster, and the level that the time in the same class is continuous or vertical space-time strip merge respectively, as the new element in the class, obtain final horizontal clustering result and vertical cluster result;

Step 40), the class that includes maximum elements in the class that includes maximum elements among the described horizontal clustering result and the described vertical cluster result is merged, detect news main broadcaster camera lens according to fusion results.

In the technique scheme, in described step 10), the horizontal space-time strip of extraction is meant:

From described successive frame group, extract the same delegation pixel of each successive image frame horizontal direction, and with extracted each the row pixel be spliced into the new image of a width of cloth, resulting new images is horizontal space-time strip, the length of described horizontal space-time strip is the frame number of the image that comprises in one group of successive frame group, and wide is the length of a two field picture.

In the technique scheme, in described step 10), extract vertical space-time strip and be meant:

From described successive frame group, extract the same row pixel of each successive image frame vertical direction, and each the row pixel that is extracted is spliced into the new image of a width of cloth, resulting new images is vertical space-time strip, the length of described vertical space-time strip is the frame number of the image that comprises in one group of successive frame group, and wide is the wide of a two field picture.

In the technique scheme, in described step 10), the value of described N is 25 integral multiple.

In the technique scheme, in described step 20) in, described characteristics of image is color characteristic and textural characteristics.

When extracting the color characteristic of image, can be at color space RGB (Red, Green, Blue; Red green blue tricolor) or HSV (Hue, Saturation, Value; Hue/saturation/purity colour model) or HIS (Hue, Saturation, Intensity; Hue/saturation/brightness and color model) or YUV (Y: luminance signal; Color difference signal) or Lab (L: luminance signal U and V:; A and b: realize color difference signal).

When in described color space HSV, extracting color characteristic, may further comprise the steps:

Step 21-1), the rgb value with image is converted into chromatic value, intensity value and brightness value;

Step 21-2), chromatic value, intensity value and the brightness value to image carries out grade quantizing respectively;

Step 21-3), increase the attention rate of tone, and the three-dimensional vector behind the grade quantizing is converted into an integer value by linear combination, each numerical value is represented a color segments;

Step 21-4), each pixel in the image all quantized after, extract the color characteristic of image.

At described step 21-4) in, the color characteristic that extracts image is meant: the fritter that at first entire image on average is divided into 4*4, then all fritters are combined as 5 big piecemeals, respectively correspondence up and down and mid portion, to different piecemeals, following different implementation method is arranged when extracting color characteristic:

Each extracts three-dimensional color moments to four plates up and down, and the color moment of described three-dimensional comprises the first moment of color, second moment and third moment;

Middle plate extracted amount is turned to 32 grades histogram;

Also to extract the third moment of the color that is used to describe the image overall color characteristic to entire image.

The extraction image texture features is meant: describe textural characteristics by entire image is extracted edge histogram.

In the technique scheme, in described step 30) in, described clustering method adopts the K-mean clustering method.

In the technique scheme, described step 40) specifically may further comprise the steps:

Step 41), calculate the similarity of corresponding camera lens in the class that includes maximum elements in the class that includes maximum elements among the described horizontal clustering result and the described vertical cluster result;

Step 42), according to described shot similarity result of calculation, whether the similarity of two camera lenses in determined level cluster and the vertical cluster greater than predefined first threshold, if then two camera lenses are the corresponding element in the class, otherwise are non-corresponding element;

Step 43), calculate the similarity that includes the class of maximum elements in the class that includes maximum elements among the described horizontal clustering result and the described vertical cluster result;

Step 44), according to step 43) include the similarity result of calculation of the class of maximum elements in the class that includes maximum elements among the horizontal clustering result that obtains and the vertical cluster result, whether the similarity of judging two classes is greater than predefined second threshold value, if, then described two classes are merged, class as news main broadcaster camera lens correspondence, if the similarity of two classes is less than or equal to predefined second threshold value, then video does not have tangible news video structure, need not extract news main broadcaster camera lens.

In described step 41) in, the similarity of calculating corresponding camera lens is meant: calculate the ratio that corresponding camera lens intersection time span on time domain accounts for total time length, if the calculation of similarity degree result less than zero, then is modified to zero.

In described step 43) in, the similarity of compute classes is meant: calculate the similarity of two classes by calculating the similarity summation between each element in two classes.

In described step 44) in, two classes merge and are meant: if there is not step 42 in the element in class in another class) described corresponding element, then with the new element of this element as the class after merging, if there is step 42 in two classes) described corresponding element, then corresponding element is merged, and with the earliest start time of two corresponding elements and the latest the concluding time as the zero-time that merges the back element.

The value of described first threshold is 0.5, and the value of described second threshold value is 0.5.

Advantage of the present invention is that all kinds of news video main broadcasters are detected the accuracy rate height, highly versatile, and computation complexity is low, has avoided existing method too to rely on camera lens accurately and has cut apart shortcoming with other modal informations.

Description of drawings

Fig. 1 is the flow chart of the news main broadcaster's detection method based on the space-time strip pattern analysis of the present invention;

Fig. 2 (a) is the schematic diagram of original video structure;

Fig. 2 (b) is the schematic diagram to the horizontal space-time strip of original video structure extraction of Fig. 2 (a);

Fig. 2 (c) is the schematic diagram to the vertical space-time strip of original video structure extraction of Fig. 2 (a);

Fig. 3 is based on the fixing schematic diagram of the color of image feature extracting method of piecemeal in the embodiment of the invention.

Embodiment

Below in conjunction with the drawings and specific embodiments the present invention is described in further detail:

As shown in Figure 1, the news main broadcaster's detection method based on the space-time strip pattern analysis of the present invention may further comprise the steps:

Step 10, from through the continuous N frame of intercepting editor's the news video as one group, and from the successive frame group of being extracted horizontal space-time strip of extraction and vertical space-time strip.In this step, when extracting horizontal space-time strip, shown in Fig. 2 (a), Fig. 2 (b), extract the same delegation pixel of each continuous in every group two field picture horizontal direction, and with extracted each the row pixel be spliced into the new image of a width of cloth, resulting new images is horizontal space-time strip.The length of horizontal space-time strip is the frame number that comprises image in a group, and wide is the length of a two field picture in the video.When extracting vertical space-time strip, shown in Fig. 2 (a), Fig. 2 (c), extract the same row pixel of every two field picture vertical direction continuous in a group, and each the row pixel that is extracted is spliced into the new image of a width of cloth, resultant new images is vertical space-time strip, the length of vertical space-time strip is the frame number of the image that comprises in one group, and wide is the wide of a two field picture in the video.

At the horizontal space-time strip of said extracted during with vertical space-time strip, should extract the pixel of the same number of rows or the same number of columns of each frame in the successive frame group, but the pixel rows that is extracted or the position of pixel column can be selected as required, in the present embodiment, can select the centre in the two field picture to go or middle column.

In this step, the N value of the N continuous frame that is intercepted can be set as required, in the present embodiment, and N value desirable 50.This step has avoided camera lens is cut apart the dependence of accurate detection by the intercepting to successive frame.After this step of news video process through editor, can obtain a series of horizontal space-time strips and vertical space-time strip.

Pairing characteristics of image in step 20, the horizontal space-time strip of extraction and the vertical space-time strip.Because in the news main broadcaster camera lens, figure and ground in the relative News Stories camera lens of the variation of news main broadcaster and studio backgrounds changes less, therefore the color and the texture of the corresponding space-time strip of news main broadcaster camera lens only have minor variations, and the color of the space-time strip of News Stories unit correspondence and texture variations are more obvious, therefore adopt the color of image and textural characteristics to come token image to help follow-up image clustering operation.From above-mentioned step 10 as can be known, horizontal space-time strip all is the new images that is spliced with vertical space-time strip, and the extraction of they being done characteristics of image specifically may further comprise the steps:

The color characteristic of step 21, extraction image.When extracting the color characteristic of image, can in different color spaces, carry out, as common RGB, HSV, HIS, YUV, Lab etc. all can.In the present embodiment, be example with HSV (Hue colourity, Saturation saturation, Value brightness) space, realize extraction to color characteristic.When extracting color characteristic, comprising:

Step 21-1, the rgb value of image is converted into colourity, saturation and brightness value;

Step 21-2, the colourity with image, saturation and brightness value carry out grade quantizing respectively; Because in the HSV space, the corresponding three-dimensional vector of each pixel represent colourity, saturation and the brightness of place picture element respectively, but the unit of three values and excursion is all inequality, therefore need do grade quantizing respectively.

Step 21-3, increase the attention rate of tone according to the result of vision research, and the three-dimensional vector behind the grade quantizing is converted into integer value between one 0 to 31 by linear combination, each numerical value is represented a color segments.

After step 21-4, each pixel all quantize, extract the color characteristic of image.The specific implementation of extraction color of image feature as shown in Figure 3, image averaging is divided into the fritter of 4*4, be combined as 5 big piecemeals by shown in Figure 3 then, respectively correspondence up and down and mid portion (ABCD is corresponding respectively to be gone up, a left side, down, right four parts, the corresponding mid portion of E, thick lines are represented the border of the piecemeal that these are big).To different piecemeals, different implementation methods is arranged when extracting color characteristic:

Four plates are up and down respectively extracted the color moments (first moment of color, second moment and third moment) of 3 dimensions;

Middle plate extracted amount is turned to 32 grades histogram;

Step 22, extraction image texture features; When extracting image texture features, textural characteristics is described by entire image is extracted edge histogram.

The specific implementation that above-mentioned steps 21-4 extracts color characteristic and step 22 texture feature extraction is ripe prior art, at list of references " DK Park, YS Jeon, CS Won, and S.-J.Park, Efficient use of localedge histogram descriptor, Proc.of the ACM Workshops on Multimedia, Los Angeles, CA, Nov.2000 " in the extraction of textural characteristics is had detailed record.

Step 23, each space-time strip all images feature is combined the formation characteristic vector, characterize the feature of this space-time strip.

Through the aforesaid operations of this step, each space-time strip has formed a high dimension vector that is used to represent this band feature, and one section news video then forms a plurality of characteristic vectors according to the space-time strip that is extracted in the step 10.

Step 30, by clustering method to the pairing characteristic vector of horizontal space-time strip and vertical space-time strip cluster respectively, and the level that the time in the same class is continuous or vertical space-time strip merge respectively, as the new element in the class.

Generally speaking, news video has following architectural feature:

1. news main broadcaster camera lens periodically occurs in the news video of being everlasting;

2. each news main broadcaster camera lens often has very high similarity;

3. the class of news main broadcaster camera lens correspondence should have maximum elements, because main broadcaster's camera lens usually has very high vision similarity, and other camera lenses are only similar to the adjacent camera lens of time domain in the same story unit.

According to the said structure feature of news video, the present invention formulates following rule: news main broadcaster's camera lens is that a unique class has in whole program much and has the camera lens of similar vision content to it in the news video.By above-mentioned rule as can be known, the class that contains maximum elements in level and the vertical cluster result respectively is exactly the class that comprises news main broadcaster camera lens, wherein, and a camera lens in the corresponding video of each space-time strip.Use Cluster _Max ^HAnd Cluster _Max ^VRepresent the maximum class of containing element in level and the vertical direction cluster result respectively, they can be represented with following formula:

{Cluster}_{Max}^{H} = &lang; {Shot}_{1}^{H}, {Shot}_{2}^{H}, . . . . . . {Shot}_{R}^{H} &rang;

{Cluster}_{Max}^{V} = &lang; {Shot}_{1}^{V}, {Shot}_{2}^{V}, . . . . . . {Shot}_{S}^{V} &rang;

Wherein R and S represent the number of elements of the class that containing element is maximum in the horizontal/vertical cluster result respectively.

In addition, before in the News Stories camera lens, existing some personage to appear at camera lens for a long time and situation about changing, therefore in order to prevent obscuring between this type of situation and news main broadcaster camera lens, in this step continuous level or vertical space-time strip of time in those same classes merged respectively, as the new element in the class.

When doing cluster operation, this step can adopt the K-mean clustering method.

Step 40, the class that includes maximum elements in the class that includes maximum elements among the resulting horizontal clustering result in the step 30 and the vertical cluster result is merged, detect news main broadcaster camera lens according to fusion results.In the following description, use Cluster _Max ^HInclude the class of maximum elements among the expression horizontal clustering result, use Cluster _Max ^VRepresent to include in the vertical cluster result class of maximum elements, the concrete operations of this step are as follows:

Step 41, calculating Cluster _Max ^HAnd Cluster _Max ^VIn the similarity of corresponding camera lens, when calculating shot similarity, by calculating Shot _i ^HAnd Shot _j ^VThe ratio that the intersection time span accounts for total time length on time domain is calculated the similarity of two camera lenses.The calculation of similarity degree formula is as follows:

Sim = &lang; {Shot}_{i}^{H}, {Shot}_{j}^{V} &rang; = \frac{Min (T_{End}^{H}, T_{End}^{V}) - Max (T_{Start}^{H}, T_{Start}^{V})}{Max (T_{End}^{H}, T_{End}^{V}) - Min (T_{Start}^{H}, T_{Start}^{V})}

T wherein _Start ^H, T _End ^H, T _Start ^V, T _End ^VRepresent Shot respectively _i ^HAnd Shot _j ^VTime started and concluding time, Min and Max represent to get minimum value and maxima operation respectively.If the similarity that calculates in two formula then is modified to 0 less than 0.

Step 42, according to the shot similarity result of calculation that step 41 obtains, judge two camera lens Shot _i ^HWith Shot _j ^VSimilarity Sim＜Shot _i ^H, Shot _j ^V＞whether greater than pre-set threshold Th ₁,, otherwise be called " non-correspondence " element if then these two camera lenses are called " correspondence " element in the class.

Step 43, calculating Cluster _Max ^HAnd Cluster _Max ^VSimilarity, when calculating similarity, by calculating Cluster _Max ^HAnd Cluster _Max ^VIn the summation of similarity between each element calculate the similarity of two classes.Its computing formula is as follows:

Sim &lang; Cluste r_{Max}^{H}, {Cluster}_{Max}^{V} &rang; = \frac{1}{Min (R, S)} Σ_{i = 1}^{R} Σ_{j = 1}^{S} Sim &lang; {Shot}_{i}^{H}, {Shot}_{j}^{V} &rang;

Step 44, the Cluster that obtains according to step 43 _Max ^HAnd Cluster _Max ^VSimilarity result of calculation, judge Cluster _Max ^HAnd Cluster _Max ^VSimilarity Sim＜Cluster _Max ^H, Cluster _Max ^V＞whether greater than pre-set threshold Th ₂, if then merge class Cluster _Max ^HAnd Cluster _Max ^V, as the class of final news main broadcaster's correspondence, if Cluster _Max ^HAnd Cluster _Max ^VSimilarity Sim＜Cluster _Max ^H, Cluster _Max ^V＞be less than or equal to pre-set threshold Th ₂, represent that this video does not have tangible news video structure, need not extract news main broadcaster camera lens this moment.

In above-mentioned fusion process, if Cluster _Max ^HOr Cluster _Max ^VIn an element in another kind of, do not have resulting in the step 42 " correspondence " element, then with the new element of this element as final class, if there is " correspondence " element in two classes, then " correspondence " element is merged, with time started the earliest of two corresponding elements and concluding time the latest zero-time as new element.

The threshold value Th that in step 42, is adopted ₁Value desirable 0.5, the value of the threshold value Th2 that is adopted in step 44 is also desirable 0.5, but can do suitable adjustment according to these two threshold values of actual conditions.

It should be noted last that above embodiment is only unrestricted in order to technical scheme of the present invention to be described.Although the present invention is had been described in detail with reference to embodiment, those of ordinary skill in the art is to be understood that, technical scheme of the present invention is made amendment or is equal to replacement, do not break away from the spirit and scope of technical solution of the present invention, it all should be encompassed in the middle of the claim scope of the present invention.

Claims

1, a kind of news main broadcaster's detection method based on the space-time strip pattern analysis, order is carried out according to the following steps:

2, the news main broadcaster's detection method based on the space-time strip pattern analysis according to claim 1 is characterized in that, in described step 10), the horizontal space-time strip of extraction is meant:

3, the news main broadcaster's detection method based on the space-time strip pattern analysis according to claim 1 is characterized in that, in described step 10), extracts vertical space-time strip and is meant:

4, the news main broadcaster's detection method based on the space-time strip pattern analysis according to claim 1 is characterized in that, in described step 10), the value of described N is 25 integral multiple.

5, the news main broadcaster's detection method based on the space-time strip pattern analysis according to claim 1 is characterized in that, in described step 20) in, described characteristics of image is color characteristic and textural characteristics.

6, the news main broadcaster's detection method based on the space-time strip pattern analysis according to claim 5 is characterized in that, extracts the color characteristic of image and can realize in color space RGB or HSV or HIS or YUV or Lab.

7, the news main broadcaster's detection method based on the space-time strip pattern analysis according to claim 6 is characterized in that, when extracting color characteristic in described color space HSV, may further comprise the steps:

8, the news main broadcaster's detection method based on the space-time strip pattern analysis according to claim 7, it is characterized in that, at described step 21-4) in, the color characteristic that extracts image is meant: the fritter that at first entire image on average is divided into 4*4, then all fritters are combined as 5 big piecemeals, respectively correspondence up and down and mid portion to different piecemeals, has following different implementation method when extracting color characteristic:

Middle plate extracted amount is turned to 32 grades histogram;

9, the news main broadcaster's detection method based on the space-time strip pattern analysis according to claim 5 is characterized in that, extracts image texture features and is meant: describe textural characteristics by entire image is extracted edge histogram.

10, the news main broadcaster's detection method based on the space-time strip pattern analysis according to claim 1 is characterized in that, in described step 30) in, described clustering method adopts the K-mean clustering method.

11, the news main broadcaster's detection method based on the space-time strip pattern analysis according to claim 1 is characterized in that described step 40) specifically may further comprise the steps:

12, the news main broadcaster's detection method based on the space-time strip pattern analysis according to claim 11, it is characterized in that, in described step 41) in, the similarity of calculating corresponding camera lens is meant: calculate the ratio that corresponding camera lens intersection time span on time domain accounts for total time length, if the calculation of similarity degree result less than zero, then is modified to zero.

13, the news main broadcaster's detection method based on the space-time strip pattern analysis according to claim 11, it is characterized in that, in described step 43) in, the similarity of compute classes is meant: calculate the similarity of two classes by calculating the similarity summation between each element in two classes.

14, the news main broadcaster's detection method based on the space-time strip pattern analysis according to claim 11, it is characterized in that, in described step 44) in, two classes merge and are meant: if there is not step 42 in the element in class in another class) described corresponding element, then with the new element of this element as the class after merging, if there is step 42 in two classes) described corresponding element, then corresponding element is merged, and with the earliest start time of two corresponding elements and the latest the concluding time as the zero-time that merges the back element.

15, the news main broadcaster's detection method based on the space-time strip pattern analysis according to claim 11 is characterized in that, the value of described first threshold is 0.5, and the value of described second threshold value is 0.5.