CN112288047A

CN112288047A - Broadcast television news stripping method based on probability distribution transformation clustering

Info

Publication number: CN112288047A
Application number: CN202011555578.5A
Authority: CN
Inventors: 陈锋; 温序铭; 张�诚; 杨瀚; 彭军
Original assignee: Chengdu Sobey Digital Technology Co Ltd
Current assignee: Chengdu Sobey Digital Technology Co Ltd
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-01-29
Anticipated expiration: 2040-12-25
Also published as: CN112288047B

Abstract

The invention discloses a broadcast television news stripping method based on probability distribution transformation clustering, which comprises the following steps: s1, converting the news program video into data, and extracting characteristic data; s2, calculating the importance ratio of each feature data, and multiplying each feature data by the importance ratio of the feature as the new data of the feature; s3, normalizing each data with the weight characteristics; s4, clustering the in-point class and the non-in-point class in the normalized feature data by using probability distribution transformation clustering; s5, segmenting news stories and the like according to the point-in type data and the non-point-in type data obtained by clustering; the method solves the problems of large error and low accuracy of the traditional clustering algorithm in the splitting of the broadcast television news, and has very important significance for improving the accuracy of the clustering algorithm in the splitting application of the television news program.

Description

Broadcast television news stripping method based on probability distribution transformation clustering

Technical Field

The invention relates to the field of broadcast television news stripping, in particular to a broadcast television news stripping method based on probability distribution transformation clustering.

Background

In recent years, with the blowout development of the broadcast television news industry, the television news program has the characteristics of durability, instantaneity and the like of 7 × 24 hours. These news programs typically contain multiple news stories, and audience members such as television editors, viewers, etc. are usually only interested in a small portion of the news stories, so that it is necessary to split a continuous entire news program into multiple independent news stories. The traditional method for manually splitting the news stories is time-consuming and labor-consuming. Therefore, it is necessary to find a method for automatically stripping tv news, which intercepts news stories from the entire news material.

In conventional engineering applications, the problem of splitting news stories is generally regarded as a labeling type problem, and segments of news stories are labeled as bs (begin scene), ms (middle scene), es (end scene), ss (single scene), and then labeled by using a labeling algorithm, so as to complete the splitting. However, the labeling algorithm applied in the traditional labeling thought is a supervised learning algorithm, and a large amount of labels labeled manually are needed, so that the rapid application of the labeling algorithm is restricted.

Clustering algorithms, an unsupervised learning algorithm, are typically used in the absence of data tags. The essence of the news story split bar is to find the in point of each news story from the television news program material, and the news story is naturally determined as long as the in point of the news story is found. Therefore, the in-points can be considered as one class, and the non-in-points can be considered as another class, so that the breaking of news stories is considered as a binary clustering problem.

However, in the practical engineering application of news story slitting, the effect of the traditional clustering algorithm is limited to a certain extent, mainly because the traditional clustering algorithm directly carries out clustering analysis in the original data space. For example, the Kmeans algorithm determines the class of data by continuously iterating with euclidean distance directly in the original data space. When the distribution of the in-point data and the non-in-point data in the original data space is not clear, the clustering effect is not good due to direct clustering, and therefore large errors occur in the in-out points of the news story slivers.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a broadcast television news stripping method based on probability distribution transformation clustering, solves the problems of large error and low accuracy of the traditional clustering algorithm in the broadcast television news stripping, and has very important significance for improving the accuracy of the clustering algorithm in the application of television news program stripping.

The purpose of the invention is realized by the following scheme:

a broadcast television news splitting method based on probability distribution transformation clustering comprises the following steps:

s1, extracting characteristic data in the news program video data;

s2, calculating the importance ratio of each extracted feature data, and then multiplying each feature data by the importance ratio of the feature data to obtain data with weighted features;

s3, normalizing each data with the weight characteristics;

s4, clustering the point-in data and the non-point-in data in the normalized feature data by probability distribution conversion clustering;

and S5, segmenting the news story according to the data of the in-point class and the non-in-point class obtained by clustering.

Further, step S1 includes the steps of:

s101, cutting a video from audio pause points in a news program to obtain a plurality of cut segments, wherein all the audio pause points are used as candidate cut points of a news story;

s102, extracting visual feature data of each cutting segment according to the video information of each cutting segment; the visual feature data includes: judging result data of whether the current cutting segment has a studio or not, judging result data of whether the front cutting segment and the rear cutting segment of the current cutting segment contain the studio or not, data of the number of faces appearing in the studio, judging result data of whether the current cutting segment is a continuous studio or not and judging result data of whether the film and flower information appears or not;

s103, extracting audio characteristic data of each cutting segment according to the audio information of each cutting segment; the audio feature data includes: judging result data of whether music appears in the current cutting segment, judging result data of whether the front cutting segment and the rear cutting segment of the current cutting segment contain music and ASR (auto-regressive) speech character information data of the current cutting segment;

s104, manually judging whether each current cutting segment and the previous cutting segment belong to different news stories, setting 1 if the current cutting segment and the previous cutting segment belong to different news stories, and setting 0 if the current cutting segment and the previous cutting segment do not belong to different news stories; the result of the manual judgment will be used in calculating the feature importance in the subsequent step S2 as a true result.

Further, step S2 includes the steps of:

s201, numbering according to a time sequence based on the feature data extracted in the step S1, then taking a certain feature according to the number sequence, taking the certain feature as a current feature, judging whether the current feature is a continuous feature or a discrete feature, and if the current feature is the continuous feature, discretizing a continuous value into n boxes by using an equal frequency binning method, wherein 2< = n < = 5;

s202, selecting a certain box of the current characteristic data, recording the box as an i box, and counting the number of the cutting types of the i box, namely the data of the set 1 in the step S104

And counting the number of non-slitting type data of the set 0 in step S104

Then, respectively calculating the number of the cutting classes of the current feature current box to account for the total number of the cutting classes of the current feature

The ratio of the current feature to the total number of the non-slitting classes of the current feature

The quotient of the ratio, the logarithm, and the notation

；

S203, calculating the difference between the ratio of the cutting class and the ratio of the non-cutting class of the current box of the current feature in the step S202, and multiplying the difference by the ratio

It is recorded as

；

S204, repeating the steps S202-S203, completing the calculation of all boxes, and then completing the calculation of all boxes, namely n boxes

Adding to obtain current characteristic dataIVA value;

s205, of all characteristic dataIVAfter the values are calculated, the value of each feature is calculatedIVValue of the sumIVThe ratio of the value sum; and taking the ratio as the weight of the current characteristic data, and multiplying the value of the current characteristic data by the ratio as the data with the weighted characteristic of the current characteristic data.

Further, in step S3, the min-max algorithm is selected as the normalization method of the weighted feature data, and the calculation formula is as follows:

where j represents the data index,

、

respectively represent data before and after normalization, and X represents all data of a certain characteristic.

Further, step S4 includes the steps of:

s401, taking the certain normalized news program material data in the step S3 as basic data X, randomly selecting two segments in the basic data X as initial center segments, and taking the two initial center segments as current optimal center segments;

s402, taking each line in the basic data X as a segment, calculating Euclidean distances between the segments and the two initial central segments in the step S401 respectively, dividing the segments into which initial central segment class when the segments are closer to which initial central segment, and respectively recording the two initial central segment classes as a class and a class b;

s403, solving the data transfer matrix A to enable the edge probability of the data of a and b

And

the difference of (a) is minimal; calculating a data transfer matrix A by using the MMD distances of the two classes; x_a，X_bRespectively representing basic data of a and b;

s404, using a Gaussian kernel function to perform dimension increasing on the basic data X to obtain dimension increasing data

(ii) a The gaussian kernel function is given by:

s405, calculating

Data in new data space after upscaling

；

S406, for the data of the new data space after the dimension increasing

Using the kmeans algorithm, the data is divided

Clustering as new

Recording indexes of data of the two types;

s407: according to step S406

The indexes of the two types of data find the corresponding data of the basic data X as new types a and b of the basic data X;

s408: respectively calculating new a and b clustering center segments according to the new a and b types of the basic data X, and comparing the new a and b clustering center segments with the current optimal clustering center segment; if the new a and b cluster center segments and the current optimal cluster center segment are moved in comparison, which indicates that the clustering has a space for iterative optimization, then go to step S403; if the new a and b cluster center segments are not moved in comparison with the current optimal cluster center segment, the optimal cluster is found through iteration, and the algorithm is ended.

Further, in step S403, the MMD distance, i.e. the distance between the class centers of the two classes a and b, is calculated as follows:

n and m respectively represent data volumes of a and b, and i and j respectively represent data indexes of a and b;

then, the MMD distance is converted, and the constraint condition that the variance of the data of a and b is unchanged before and after conversion is considered; meanwhile, overfitting is prevented, and a regular term is added; in summary, the objective function is as follows:

where tr () is the trace of the matrix, M is an MMD matrix, H is the center matrix, I is the identity matrix,

is a regular term coefficient;

then, solving the target function by using a Lagrange method, and then obtaining a transformation matrix A; the solving formula is as follows:

wherein,

is the lagrange multiplier and X is the underlying data.

Further, in step S5, the number of studios appearing in each category is counted according to the two categories of clustering results a and b, and the category with the largest number of studios appearing is taken as the cut category, and the category with the smallest number of studios appearing is taken as the non-cut category, so as to obtain the final news program cut result.

Further, in step S1, the audio pause point is used as a candidate point for news story segmentation.

Further, comprising the steps of:

extracting the topic distribution of each segment by using a multi-label topic classification model according to the ASR speech character information extracted in the step S103, then calculating the topic cosine similarity of the current segment and the two segments before and after the current segment and calculating the jaccard similarity of the current segment and the two segments before and after the current segment by using the topic distribution of the current segment and the two segments before and after the current segment;

extracting keywords of each segment according to the ASR speech character information extracted in the step S103, and then calculating the average value, the maximum value, the minimum value and the variance of the similarity of the keywords of the current segment and the keywords of the front segment and the rear segment by using the keywords of the current segment and the keywords of the front segment and the rear segment and combining a word2vect model;

and extracting the entity time of each segment according to the ASR speech character information extracted in the step S103, and judging whether the entity time can be extracted or not.

The invention has the beneficial effects that:

(1) the invention provides a new clustering method and a news stripping method, which solve the problems of large error and low accuracy of the traditional clustering algorithm in the broadcast television news stripping; specifically, the method for splitting the news of the broadcast television based on the probability distribution transformation clustering comprises the steps of firstly transforming an original data space so as to enable the distribution difference of point-in data and non-point-in data to be larger, and then clustering again by using data after transforming the data space so as to achieve the purpose of distinguishing the point-in data from the non-point-in data; the method can convert the original data space according to the distribution of the point-in data and the non-point-in data, so that the difference of the data of the same type is smaller, and then the category of the data is calculated by using a clustering method, so that the point-in class and the non-point-in class are found, and the method has very important significance for improving the accuracy of a clustering algorithm in the application of splitting the television news program.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a probability distribution transform clustering algorithm;

FIG. 2 is a flow chart of the method steps of the present invention.

Detailed Description

All of the features disclosed in the specification for all of the embodiments (including any accompanying claims, abstract and drawings), or all of the steps of a method or process so disclosed, may be combined and/or expanded, or substituted, in any way, except for mutually exclusive features and/or steps.

As shown in fig. 1 and 2, the method for splitting news items of broadcast television based on probability distribution transformation clustering includes the steps:

s1, extracting characteristic data in the news program video data;

s3, normalizing each data with the weight characteristics;

Further, step S1 includes the steps of:

Further, step S2 includes the steps of:

And counting the number of non-slitting type data of the set 0 in step S104

The quotient of the ratio, the logarithm, and the notation

；

It is recorded as

；

Adding to obtain current characteristic dataIVA value;

where j represents the data index,

、

Further, step S4 includes the steps of:

And

(ii) a The gaussian kernel function is given by:

s405, calculating

Data in new data space after upscaling

；

S406, for the data of the new data space after the dimension increasing

Using the kmeans algorithm, the data is divided

Clustering as new

Recording indexes of data of the two types;

s407: according to step S406

is a regular term coefficient;

wherein,

is the lagrange multiplier and X is the underlying data.

Further, comprising the steps of:

In other embodiments of the present invention, a method for breaking news items of broadcast television based on probability distribution transformation clustering is found, which comprises the following steps:

the method comprises the following steps: and (5) video datamation of the news program. More than 50 news program videos are obtained, and feature data (such as whether a studio, semantic similarity before and after, keyword similarity before and after, and the like) are extracted according to the news program videos.

Step two: and calculating the weight of the news program characteristic data. The importance ratio of each feature is calculated using the Information Value algorithm, and then each feature data is multiplied by the importance ratio of the feature as new data for the feature.

Step three: and normalizing the news program characteristic data. The data for each feature was normalized to between 0-1 using the min-max method.

Step four: and clustering news program characteristic data. And clustering the in-point class and the non-in-point class in the characteristic data by using a probability distribution transformation clustering algorithm.

Step five: a news story is segmented. And cutting out news stories according to the point-in type data and the non-point-in type data obtained by clustering.

In other embodiments of the present invention, a method for splitting news of broadcast television based on probability distribution transformation clustering is provided, where fig. 1 shows the whole process steps of extracting video data of broadcast television news into segments by using a clustering algorithm, and the method includes the following steps:

the method comprises the following steps: video digitization of news programs;

step two: calculating the weight of the feature data of the news program;

step three: normalizing the weight characteristic data of the news program;

step four: clustering news program characteristic data;

step five: and segmenting the news stories based on the clustering point-in data.

In the first step of the above scheme, the video digitization of the news program refers to obtaining historical video material of the news program from a plurality of television channel programs. Considering that short audio pause occurs when switching between different news stories, the scheme of the embodiment adopts the audio pause point as a candidate point for segmenting the news stories. The nature of the news story ticker is to find the true news story cut points from these candidate audio cut points.

In view of the above considerations, the specific implementation steps of step one are as follows:

step 101: the video is first cut from audio pause points in the news program, all of which are candidate cut points for news stories.

Step 102: and extracting the visual feature data of the segments according to the video information of each cut segment. The visual feature data includes: whether the current cutting segment appears in a studio; whether the front and rear cutting segments of the current cutting segment contain a studio or not; the number of faces appearing in the studio; whether it is a continuous studio; whether the chipping information appears or not.

Step 103: and extracting the audio characteristic data of each cut segment according to the audio information of each cut segment. The audio feature data includes: whether music appears in the current cutting segment; whether the front and rear cut sections of the current cut section contain music; ASR phonetic text information.

Step 104: and extracting the theme distribution of each segment by using a theme model according to the ASR speech character information extracted in the step 103, and then calculating the theme cosine and jaccard similarity of the current segment and the two segments by using the theme distribution of the current segment and the two segments.

Step 105: extracting the keywords of each segment by using the keyword model according to the ASR speech character information extracted in the step 103, and then calculating the average value, the maximum value, the minimum value and the variance of the similarity of the keywords of the current segment and the keywords of the front segment and the back segment by using the keywords of the current segment and the keywords of the front segment and the back segment in combination with the word2 vent model.

Step 106: and (4) extracting the entity time of each segment by using an entity recognition model according to the ASR speech character information extracted in the step 103, and judging whether the entity time can be extracted or not.

Step 107: and manually judging whether each current clip and the previous clip belong to different news stories, if so, setting 1, and otherwise, setting 0. And the manual judgment result is used as a real result in the subsequent step II for calculating the feature importance.

In step two of the above scheme, the feature weight extracted in step one needs to be calculated. The purpose of extracting feature weights is to enlarge the contribution of important features and reduce the contribution of unimportant features when computing feature distances. The present invention calculates the degree of importance of each feature using the Information Value method. The specific calculation formula and the calculation process are as follows:

step 201: judging whether the current feature is a continuous feature or a discrete feature, and if the current feature is the continuous feature, discretizing a continuous value into n boxes by using an equal frequency binning method, wherein 2< = n < = 5.

Step 202: selecting a certain box with current characteristics, recording the box as i, and counting the number of the box slitting classes (namely the data of the 1 in the step 107) ((I))

) And the number of non-slitted classes (i.e. data set 0 in step 107) ((ii))

) Then respectively calculating the current characteristicsThe number of the cutting types of the front box accounts for the total number of the cutting types of the current characteristics (

) The ratio of (a) and the number of non-slice classes of the current box of the current feature in the total number of non-slice classes of the current feature ((

) The quotient of the ratio, the logarithm, and the notation

。

Step 203: the difference between the slice class fraction and the non-slice class fraction of the current bin of the current feature in step 202 is calculated and then multiplied by this difference

It is recorded as

。

Step 204: the 202-203 steps are repeated to directly complete the calculation of all the bins, and then all the bins (n bins in total) are calculated

And adding to obtain the Information Value of the feature.

Step 205: after the Information values of all the features are calculated, the ratio of the Information Value of each feature to the sum of the total Information values is calculated. And taking the ratio as the weight of the feature, and multiplying the value of the feature by the weight to obtain new data of the feature.

In the third step of the above scheme, the weight feature data of the news program calculated in the second step needs to be normalized. The purpose of normalization is to scale the numerical value of each feature within the interval of 0-1, and avoid large deviation of the calculated distance caused by different features due to inconsistent dimensions. Selecting a min-max algorithm as a normalization method of the characteristic data with the weight, wherein the calculation method comprises the following steps:

where j represents the data index,

、

In the fourth step of the scheme, data clustering is carried out by using a probability distribution transform clustering algorithm according to the data normalized in the third step, so that the clustering distinguishes two types, namely a slitting type and a non-slitting type. The concrete implementation steps of the step four are divided into the following steps.

Step 401: taking the normalized news program material data in the third step as basic data X, randomly selecting two segments in the basic data X as initial central segments, and taking the two initial central segments as current optimal central segments.

Step 402: the euclidean distances are calculated for the segments in the basis data from the two initial center segments in step 401, and the closer the basis data is to which initial center segment, the segments are classified as the center segment class. These two classes are referred to as a and b, respectively.

Step 403: solving the data transfer matrix A to make the data of a and bEdge probability

And

as close as possible. The data transfer matrix a is calculated using the MMD distances of the two classes. The MMD distance is essentially the distance to compute the class center for the two classes a, b, which is defined as follows:

wherein n and m represent data volumes of a and b, and i and j represent data indexes of a and b.

The MMD distance is then transformed using mathematical changes, the mathematical reasoning being as follows:

where M is an MMD matrix, as follows:

this MMD matrix represents the meaning: when two fragment data belong to class a at the same time, then

When two fragment data belong to class b at the same time, then

When the two fragments belong to different classes respectively,

。

although the data of the two classes a and b are as close as possible in the transformed data space, the misclassified data can be closer to the correct class, but the risk that the correctly clustered data is misclassified is also increased. In order to resist the risk, the constraint condition that the respective variances of the data of the a and the b are not changed before and after the transformation data space is added. Meanwhile, in order to prevent overfitting, a regularization term needs to be added.

In summary, the objective function of the algorithm of the present embodiment is as follows:

wherein, in the constraint condition,

is a central matrix of which the center is,

is a matrix of units, and is,

is a regular term coefficient.

Then, the lagrangian method is used to solve the target function with the constraint, and then the transformation matrix a can be obtained. The solution is:

wherein,

is the lagrange multiplier and X is the underlying data.

Step 404: using a Gaussian kernel function to carry out dimensionality raising on the basic data X to obtain dimensionality-raised data

. The gaussian kernel function is given by:

step 405: the up-dimensional data K is obtained from step 404, and calculation is performed

Data in new data space after upscaling

And K represents the ascending dimension data.

Step 406: data for new data space after upscaling

Using the kmeans algorithm, the data is divided

Clustering as new

And recording indexes of the data to which the two types belong.

Step 407: according to step 406

And the indexes of the data belonging to the two types find the corresponding data of the basic data X as new types a and b of the basic data X.

Step 408: and respectively calculating the clustering center segments of the new a and b types according to the new a and b types of the basic data X. Step 409: comparing the new a and b cluster center segments with the current optimal cluster center segment; if the new a and b cluster center segments and the current optimal cluster center segment are moved in comparison, which indicates that the clustering has a space for iterative optimization, then go to step S403; if the new a and b cluster center segments are not moved in comparison with the current optimal cluster center segment, the optimal cluster is found through iteration, and the algorithm is ended.

Step five: and counting the number of the studios in each category according to the clustering results a and b, taking the studios with more studios as the slitting categories, and taking the studios with less studios as the non-slitting categories.

And (5) extracting data from the news program, and obtaining a slitting class and a non-slitting class by using a probability distribution conversion clustering algorithm so as to obtain a final news program slitting result.

The functionality of the present invention, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium, and all or part of the steps of the method according to the embodiments of the present invention are executed in a computer device (which may be a personal computer, a server, or a network device) and corresponding software. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, or an optical disk, exist in a read-only Memory (RAM), a Random Access Memory (RAM), and the like, for performing a test or actual data in a program implementation.

Claims

1. A broadcast television news splitting method based on probability distribution transformation clustering is characterized by comprising the following steps:

s1, extracting characteristic data in the news program video data;

s3, normalizing each data with the weight characteristics;

2. The method for breaking news items in broadcast television based on probability distribution transformation clustering as claimed in claim 1, wherein the step S1 includes the steps of:

3. The method for breaking news items of broadcast television based on probability distribution transformation clustering as claimed in any one of claims 1 or 2, wherein the step S2 comprises the steps of:

And counting the number of non-slitting type data of the set 0 in step S104

The quotient of the ratio, the logarithm, and the notation

；

It is recorded as

；

Adding to obtain current characteristic dataIVA value;

4. The method for breaking news in broadcasting TV based on probability distribution transformation clustering as claimed in claim 1, wherein in step S3, the min-max algorithm is selected as the normalization method of weighted feature data, and the calculation formula is as follows:

where j represents the data index,

、

5. The method for breaking news items in broadcast television based on probability distribution transformation clustering as claimed in claim 1, wherein the step S4 comprises the steps of:

And

(ii) a The gaussian kernel function is given by:

s405, calculating

Data in new data space after upscaling

；

S406, for the data of the new data space after the dimension increasing

Using the kmeans algorithm, the data is divided

Clustering as new

Recording indexes of data of the two types;

s407: according to step S406

6. The method of claim 5, wherein the broadcast TV news ticker-splitting method based on probability distribution transformation clustering,

in step S403, the MMD distance, i.e. the distance between the class centers of the two classes a and b, is calculated as follows:

is a regular term coefficient;

wherein,

is the lagrange multiplier and X is the underlying data.

7. The method for breaking news in broadcasting TV based on probability distribution transformation clustering as claimed in claim 5, wherein in step S5, according to the clustering results a and b, the number of studios appearing in each category is counted, the studios with the larger number of studios appearing are used as the slicing categories, and the studios with the smaller number of studios appearing are used as the non-slicing categories, so as to obtain the final news program slicing result.

8. The broadcast television news ticker method based on probability distribution transform clustering of claim 1, wherein in step S1, audio pause points are used as candidate points for news story segmentation.