CN112288047A - Broadcast television news stripping method based on probability distribution transformation clustering - Google Patents
Broadcast television news stripping method based on probability distribution transformation clustering Download PDFInfo
- Publication number
- CN112288047A CN112288047A CN202011555578.5A CN202011555578A CN112288047A CN 112288047 A CN112288047 A CN 112288047A CN 202011555578 A CN202011555578 A CN 202011555578A CN 112288047 A CN112288047 A CN 112288047A
- Authority
- CN
- China
- Prior art keywords
- data
- current
- feature
- clustering
- segment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000009826 distribution Methods 0.000 title claims abstract description 35
- 230000009466 transformation Effects 0.000 title claims abstract description 24
- 238000005520 cutting process Methods 0.000 claims description 85
- 239000011159 matrix material Substances 0.000 claims description 28
- 238000004364 calculation method Methods 0.000 claims description 14
- 238000010606 normalization Methods 0.000 claims description 9
- 230000000007 visual effect Effects 0.000 claims description 8
- 238000006243 chemical reaction Methods 0.000 claims description 7
- 238000005457 optimization Methods 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims description 3
- 230000006870 function Effects 0.000 description 12
- 238000002372 labelling Methods 0.000 description 5
- 239000012634 fragment Substances 0.000 description 3
- 238000003860 storage Methods 0.000 description 3
- 238000013145 classification model Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000001131 transforming effect Effects 0.000 description 2
- 230000001174 ascending effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a broadcast television news stripping method based on probability distribution transformation clustering, which comprises the following steps: s1, converting the news program video into data, and extracting characteristic data; s2, calculating the importance ratio of each feature data, and multiplying each feature data by the importance ratio of the feature as the new data of the feature; s3, normalizing each data with the weight characteristics; s4, clustering the in-point class and the non-in-point class in the normalized feature data by using probability distribution transformation clustering; s5, segmenting news stories and the like according to the point-in type data and the non-point-in type data obtained by clustering; the method solves the problems of large error and low accuracy of the traditional clustering algorithm in the splitting of the broadcast television news, and has very important significance for improving the accuracy of the clustering algorithm in the splitting application of the television news program.
Description
Technical Field
The invention relates to the field of broadcast television news stripping, in particular to a broadcast television news stripping method based on probability distribution transformation clustering.
Background
In recent years, with the blowout development of the broadcast television news industry, the television news program has the characteristics of durability, instantaneity and the like of 7 × 24 hours. These news programs typically contain multiple news stories, and audience members such as television editors, viewers, etc. are usually only interested in a small portion of the news stories, so that it is necessary to split a continuous entire news program into multiple independent news stories. The traditional method for manually splitting the news stories is time-consuming and labor-consuming. Therefore, it is necessary to find a method for automatically stripping tv news, which intercepts news stories from the entire news material.
In conventional engineering applications, the problem of splitting news stories is generally regarded as a labeling type problem, and segments of news stories are labeled as bs (begin scene), ms (middle scene), es (end scene), ss (single scene), and then labeled by using a labeling algorithm, so as to complete the splitting. However, the labeling algorithm applied in the traditional labeling thought is a supervised learning algorithm, and a large amount of labels labeled manually are needed, so that the rapid application of the labeling algorithm is restricted.
Clustering algorithms, an unsupervised learning algorithm, are typically used in the absence of data tags. The essence of the news story split bar is to find the in point of each news story from the television news program material, and the news story is naturally determined as long as the in point of the news story is found. Therefore, the in-points can be considered as one class, and the non-in-points can be considered as another class, so that the breaking of news stories is considered as a binary clustering problem.
However, in the practical engineering application of news story slitting, the effect of the traditional clustering algorithm is limited to a certain extent, mainly because the traditional clustering algorithm directly carries out clustering analysis in the original data space. For example, the Kmeans algorithm determines the class of data by continuously iterating with euclidean distance directly in the original data space. When the distribution of the in-point data and the non-in-point data in the original data space is not clear, the clustering effect is not good due to direct clustering, and therefore large errors occur in the in-out points of the news story slivers.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a broadcast television news stripping method based on probability distribution transformation clustering, solves the problems of large error and low accuracy of the traditional clustering algorithm in the broadcast television news stripping, and has very important significance for improving the accuracy of the clustering algorithm in the application of television news program stripping.
The purpose of the invention is realized by the following scheme:
a broadcast television news splitting method based on probability distribution transformation clustering comprises the following steps:
s1, extracting characteristic data in the news program video data;
s2, calculating the importance ratio of each extracted feature data, and then multiplying each feature data by the importance ratio of the feature data to obtain data with weighted features;
s3, normalizing each data with the weight characteristics;
s4, clustering the point-in data and the non-point-in data in the normalized feature data by probability distribution conversion clustering;
and S5, segmenting the news story according to the data of the in-point class and the non-in-point class obtained by clustering.
Further, step S1 includes the steps of:
s101, cutting a video from audio pause points in a news program to obtain a plurality of cut segments, wherein all the audio pause points are used as candidate cut points of a news story;
s102, extracting visual feature data of each cutting segment according to the video information of each cutting segment; the visual feature data includes: judging result data of whether the current cutting segment has a studio or not, judging result data of whether the front cutting segment and the rear cutting segment of the current cutting segment contain the studio or not, data of the number of faces appearing in the studio, judging result data of whether the current cutting segment is a continuous studio or not and judging result data of whether the film and flower information appears or not;
s103, extracting audio characteristic data of each cutting segment according to the audio information of each cutting segment; the audio feature data includes: judging result data of whether music appears in the current cutting segment, judging result data of whether the front cutting segment and the rear cutting segment of the current cutting segment contain music and ASR (auto-regressive) speech character information data of the current cutting segment;
s104, manually judging whether each current cutting segment and the previous cutting segment belong to different news stories, setting 1 if the current cutting segment and the previous cutting segment belong to different news stories, and setting 0 if the current cutting segment and the previous cutting segment do not belong to different news stories; the result of the manual judgment will be used in calculating the feature importance in the subsequent step S2 as a true result.
Further, step S2 includes the steps of:
s201, numbering according to a time sequence based on the feature data extracted in the step S1, then taking a certain feature according to the number sequence, taking the certain feature as a current feature, judging whether the current feature is a continuous feature or a discrete feature, and if the current feature is the continuous feature, discretizing a continuous value into n boxes by using an equal frequency binning method, wherein 2< = n < = 5;
s202, selecting a certain box of the current characteristic data, recording the box as an i box, and counting the number of the cutting types of the i box, namely the data of the set 1 in the step S104And counting the number of non-slitting type data of the set 0 in step S104Then, respectively calculating the number of the cutting classes of the current feature current box to account for the total number of the cutting classes of the current featureThe ratio of the current feature to the total number of the non-slitting classes of the current featureThe quotient of the ratio, the logarithm, and the notation;
S203, calculating the difference between the ratio of the cutting class and the ratio of the non-cutting class of the current box of the current feature in the step S202, and multiplying the difference by the ratioIt is recorded as;
S204, repeating the steps S202-S203, completing the calculation of all boxes, and then completing the calculation of all boxes, namely n boxesAdding to obtain current characteristic dataIVA value;
s205, of all characteristic dataIVAfter the values are calculated, the value of each feature is calculatedIVValue of the sumIVThe ratio of the value sum; and taking the ratio as the weight of the current characteristic data, and multiplying the value of the current characteristic data by the ratio as the data with the weighted characteristic of the current characteristic data.
Further, in step S3, the min-max algorithm is selected as the normalization method of the weighted feature data, and the calculation formula is as follows:
where j represents the data index,、respectively represent data before and after normalization, and X represents all data of a certain characteristic.
Further, step S4 includes the steps of:
s401, taking the certain normalized news program material data in the step S3 as basic data X, randomly selecting two segments in the basic data X as initial center segments, and taking the two initial center segments as current optimal center segments;
s402, taking each line in the basic data X as a segment, calculating Euclidean distances between the segments and the two initial central segments in the step S401 respectively, dividing the segments into which initial central segment class when the segments are closer to which initial central segment, and respectively recording the two initial central segment classes as a class and a class b;
s403, solving the data transfer matrix A to enable the edge probability of the data of a and bAndthe difference of (a) is minimal; calculating a data transfer matrix A by using the MMD distances of the two classes; xa,XbRespectively representing basic data of a and b;
s404, using a Gaussian kernel function to perform dimension increasing on the basic data X to obtain dimension increasing data(ii) a The gaussian kernel function is given by:
S406, for the data of the new data space after the dimension increasingUsing the kmeans algorithm, the data is dividedClustering as newRecording indexes of data of the two types;
s407: according to step S406The indexes of the two types of data find the corresponding data of the basic data X as new types a and b of the basic data X;
s408: respectively calculating new a and b clustering center segments according to the new a and b types of the basic data X, and comparing the new a and b clustering center segments with the current optimal clustering center segment; if the new a and b cluster center segments and the current optimal cluster center segment are moved in comparison, which indicates that the clustering has a space for iterative optimization, then go to step S403; if the new a and b cluster center segments are not moved in comparison with the current optimal cluster center segment, the optimal cluster is found through iteration, and the algorithm is ended.
Further, in step S403, the MMD distance, i.e. the distance between the class centers of the two classes a and b, is calculated as follows:
n and m respectively represent data volumes of a and b, and i and j respectively represent data indexes of a and b;
then, the MMD distance is converted, and the constraint condition that the variance of the data of a and b is unchanged before and after conversion is considered; meanwhile, overfitting is prevented, and a regular term is added; in summary, the objective function is as follows:
where tr () is the trace of the matrix, M is an MMD matrix, H is the center matrix, I is the identity matrix,is a regular term coefficient;
then, solving the target function by using a Lagrange method, and then obtaining a transformation matrix A; the solving formula is as follows:
Further, in step S5, the number of studios appearing in each category is counted according to the two categories of clustering results a and b, and the category with the largest number of studios appearing is taken as the cut category, and the category with the smallest number of studios appearing is taken as the non-cut category, so as to obtain the final news program cut result.
Further, in step S1, the audio pause point is used as a candidate point for news story segmentation.
Further, comprising the steps of:
extracting the topic distribution of each segment by using a multi-label topic classification model according to the ASR speech character information extracted in the step S103, then calculating the topic cosine similarity of the current segment and the two segments before and after the current segment and calculating the jaccard similarity of the current segment and the two segments before and after the current segment by using the topic distribution of the current segment and the two segments before and after the current segment;
extracting keywords of each segment according to the ASR speech character information extracted in the step S103, and then calculating the average value, the maximum value, the minimum value and the variance of the similarity of the keywords of the current segment and the keywords of the front segment and the rear segment by using the keywords of the current segment and the keywords of the front segment and the rear segment and combining a word2vect model;
and extracting the entity time of each segment according to the ASR speech character information extracted in the step S103, and judging whether the entity time can be extracted or not.
The invention has the beneficial effects that:
(1) the invention provides a new clustering method and a news stripping method, which solve the problems of large error and low accuracy of the traditional clustering algorithm in the broadcast television news stripping; specifically, the method for splitting the news of the broadcast television based on the probability distribution transformation clustering comprises the steps of firstly transforming an original data space so as to enable the distribution difference of point-in data and non-point-in data to be larger, and then clustering again by using data after transforming the data space so as to achieve the purpose of distinguishing the point-in data from the non-point-in data; the method can convert the original data space according to the distribution of the point-in data and the non-point-in data, so that the difference of the data of the same type is smaller, and then the category of the data is calculated by using a clustering method, so that the point-in class and the non-point-in class are found, and the method has very important significance for improving the accuracy of a clustering algorithm in the application of splitting the television news program.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of a probability distribution transform clustering algorithm;
FIG. 2 is a flow chart of the method steps of the present invention.
Detailed Description
All of the features disclosed in the specification for all of the embodiments (including any accompanying claims, abstract and drawings), or all of the steps of a method or process so disclosed, may be combined and/or expanded, or substituted, in any way, except for mutually exclusive features and/or steps.
As shown in fig. 1 and 2, the method for splitting news items of broadcast television based on probability distribution transformation clustering includes the steps:
s1, extracting characteristic data in the news program video data;
s2, calculating the importance ratio of each extracted feature data, and then multiplying each feature data by the importance ratio of the feature data to obtain data with weighted features;
s3, normalizing each data with the weight characteristics;
s4, clustering the point-in data and the non-point-in data in the normalized feature data by probability distribution conversion clustering;
and S5, segmenting the news story according to the data of the in-point class and the non-in-point class obtained by clustering.
Further, step S1 includes the steps of:
s101, cutting a video from audio pause points in a news program to obtain a plurality of cut segments, wherein all the audio pause points are used as candidate cut points of a news story;
s102, extracting visual feature data of each cutting segment according to the video information of each cutting segment; the visual feature data includes: judging result data of whether the current cutting segment has a studio or not, judging result data of whether the front cutting segment and the rear cutting segment of the current cutting segment contain the studio or not, data of the number of faces appearing in the studio, judging result data of whether the current cutting segment is a continuous studio or not and judging result data of whether the film and flower information appears or not;
s103, extracting audio characteristic data of each cutting segment according to the audio information of each cutting segment; the audio feature data includes: judging result data of whether music appears in the current cutting segment, judging result data of whether the front cutting segment and the rear cutting segment of the current cutting segment contain music and ASR (auto-regressive) speech character information data of the current cutting segment;
s104, manually judging whether each current cutting segment and the previous cutting segment belong to different news stories, setting 1 if the current cutting segment and the previous cutting segment belong to different news stories, and setting 0 if the current cutting segment and the previous cutting segment do not belong to different news stories; the result of the manual judgment will be used in calculating the feature importance in the subsequent step S2 as a true result.
Further, step S2 includes the steps of:
s201, numbering according to a time sequence based on the feature data extracted in the step S1, then taking a certain feature according to the number sequence, taking the certain feature as a current feature, judging whether the current feature is a continuous feature or a discrete feature, and if the current feature is the continuous feature, discretizing a continuous value into n boxes by using an equal frequency binning method, wherein 2< = n < = 5;
s202, selecting a certain box of the current characteristic data, recording the box as an i box, and counting the number of the cutting types of the i box, namely the data of the set 1 in the step S104And counting the number of non-slitting type data of the set 0 in step S104Then, respectively calculating the number of the cutting classes of the current feature current box to account for the total number of the cutting classes of the current featureThe ratio of the current feature to the total number of the non-slitting classes of the current featureThe quotient of the ratio, the logarithm, and the notation;
S203, calculating the difference between the ratio of the cutting class and the ratio of the non-cutting class of the current box of the current feature in the step S202, and multiplying the difference by the ratioIt is recorded as;
S204, repeating the steps S202-S203, completing the calculation of all boxes, and then completing the calculation of all boxes, namely n boxesAdding to obtain current characteristic dataIVA value;
s205, of all characteristic dataIVAfter the values are calculated, the value of each feature is calculatedIVValue of the sumIVThe ratio of the value sum; and taking the ratio as the weight of the current characteristic data, and multiplying the value of the current characteristic data by the ratio as the data with the weighted characteristic of the current characteristic data.
Further, in step S3, the min-max algorithm is selected as the normalization method of the weighted feature data, and the calculation formula is as follows:
where j represents the data index,、respectively represent data before and after normalization, and X represents all data of a certain characteristic.
Further, step S4 includes the steps of:
s401, taking the certain normalized news program material data in the step S3 as basic data X, randomly selecting two segments in the basic data X as initial center segments, and taking the two initial center segments as current optimal center segments;
s402, taking each line in the basic data X as a segment, calculating Euclidean distances between the segments and the two initial central segments in the step S401 respectively, dividing the segments into which initial central segment class when the segments are closer to which initial central segment, and respectively recording the two initial central segment classes as a class and a class b;
s403, solving the data transfer matrix A to enable the edge probability of the data of a and bAndthe difference of (a) is minimal; calculating a data transfer matrix A by using the MMD distances of the two classes; xa,XbRespectively representing basic data of a and b;
s404, using a Gaussian kernel function to perform dimension increasing on the basic data X to obtain dimension increasing data(ii) a The gaussian kernel function is given by:
S406, for the data of the new data space after the dimension increasingUsing the kmeans algorithm, the data is dividedClustering as newRecording indexes of data of the two types;
s407: according to step S406The indexes of the two types of data find the corresponding data of the basic data X as new types a and b of the basic data X;
s408: respectively calculating new a and b clustering center segments according to the new a and b types of the basic data X, and comparing the new a and b clustering center segments with the current optimal clustering center segment; if the new a and b cluster center segments and the current optimal cluster center segment are moved in comparison, which indicates that the clustering has a space for iterative optimization, then go to step S403; if the new a and b cluster center segments are not moved in comparison with the current optimal cluster center segment, the optimal cluster is found through iteration, and the algorithm is ended.
Further, in step S403, the MMD distance, i.e. the distance between the class centers of the two classes a and b, is calculated as follows:
n and m respectively represent data volumes of a and b, and i and j respectively represent data indexes of a and b;
then, the MMD distance is converted, and the constraint condition that the variance of the data of a and b is unchanged before and after conversion is considered; meanwhile, overfitting is prevented, and a regular term is added; in summary, the objective function is as follows:
where tr () is the trace of the matrix, M is an MMD matrix, H is the center matrix, I is the identity matrix,is a regular term coefficient;
then, solving the target function by using a Lagrange method, and then obtaining a transformation matrix A; the solving formula is as follows:
Further, in step S5, the number of studios appearing in each category is counted according to the two categories of clustering results a and b, and the category with the largest number of studios appearing is taken as the cut category, and the category with the smallest number of studios appearing is taken as the non-cut category, so as to obtain the final news program cut result.
Further, in step S1, the audio pause point is used as a candidate point for news story segmentation.
Further, comprising the steps of:
extracting the topic distribution of each segment by using a multi-label topic classification model according to the ASR speech character information extracted in the step S103, then calculating the topic cosine similarity of the current segment and the two segments before and after the current segment and calculating the jaccard similarity of the current segment and the two segments before and after the current segment by using the topic distribution of the current segment and the two segments before and after the current segment;
extracting keywords of each segment according to the ASR speech character information extracted in the step S103, and then calculating the average value, the maximum value, the minimum value and the variance of the similarity of the keywords of the current segment and the keywords of the front segment and the rear segment by using the keywords of the current segment and the keywords of the front segment and the rear segment and combining a word2vect model;
and extracting the entity time of each segment according to the ASR speech character information extracted in the step S103, and judging whether the entity time can be extracted or not.
In other embodiments of the present invention, a method for breaking news items of broadcast television based on probability distribution transformation clustering is found, which comprises the following steps:
the method comprises the following steps: and (5) video datamation of the news program. More than 50 news program videos are obtained, and feature data (such as whether a studio, semantic similarity before and after, keyword similarity before and after, and the like) are extracted according to the news program videos.
Step two: and calculating the weight of the news program characteristic data. The importance ratio of each feature is calculated using the Information Value algorithm, and then each feature data is multiplied by the importance ratio of the feature as new data for the feature.
Step three: and normalizing the news program characteristic data. The data for each feature was normalized to between 0-1 using the min-max method.
Step four: and clustering news program characteristic data. And clustering the in-point class and the non-in-point class in the characteristic data by using a probability distribution transformation clustering algorithm.
Step five: a news story is segmented. And cutting out news stories according to the point-in type data and the non-point-in type data obtained by clustering.
In other embodiments of the present invention, a method for splitting news of broadcast television based on probability distribution transformation clustering is provided, where fig. 1 shows the whole process steps of extracting video data of broadcast television news into segments by using a clustering algorithm, and the method includes the following steps:
the method comprises the following steps: video digitization of news programs;
step two: calculating the weight of the feature data of the news program;
step three: normalizing the weight characteristic data of the news program;
step four: clustering news program characteristic data;
step five: and segmenting the news stories based on the clustering point-in data.
In the first step of the above scheme, the video digitization of the news program refers to obtaining historical video material of the news program from a plurality of television channel programs. Considering that short audio pause occurs when switching between different news stories, the scheme of the embodiment adopts the audio pause point as a candidate point for segmenting the news stories. The nature of the news story ticker is to find the true news story cut points from these candidate audio cut points.
In view of the above considerations, the specific implementation steps of step one are as follows:
step 101: the video is first cut from audio pause points in the news program, all of which are candidate cut points for news stories.
Step 102: and extracting the visual feature data of the segments according to the video information of each cut segment. The visual feature data includes: whether the current cutting segment appears in a studio; whether the front and rear cutting segments of the current cutting segment contain a studio or not; the number of faces appearing in the studio; whether it is a continuous studio; whether the chipping information appears or not.
Step 103: and extracting the audio characteristic data of each cut segment according to the audio information of each cut segment. The audio feature data includes: whether music appears in the current cutting segment; whether the front and rear cut sections of the current cut section contain music; ASR phonetic text information.
Step 104: and extracting the theme distribution of each segment by using a theme model according to the ASR speech character information extracted in the step 103, and then calculating the theme cosine and jaccard similarity of the current segment and the two segments by using the theme distribution of the current segment and the two segments.
Step 105: extracting the keywords of each segment by using the keyword model according to the ASR speech character information extracted in the step 103, and then calculating the average value, the maximum value, the minimum value and the variance of the similarity of the keywords of the current segment and the keywords of the front segment and the back segment by using the keywords of the current segment and the keywords of the front segment and the back segment in combination with the word2 vent model.
Step 106: and (4) extracting the entity time of each segment by using an entity recognition model according to the ASR speech character information extracted in the step 103, and judging whether the entity time can be extracted or not.
Step 107: and manually judging whether each current clip and the previous clip belong to different news stories, if so, setting 1, and otherwise, setting 0. And the manual judgment result is used as a real result in the subsequent step II for calculating the feature importance.
In step two of the above scheme, the feature weight extracted in step one needs to be calculated. The purpose of extracting feature weights is to enlarge the contribution of important features and reduce the contribution of unimportant features when computing feature distances. The present invention calculates the degree of importance of each feature using the Information Value method. The specific calculation formula and the calculation process are as follows:
step 201: judging whether the current feature is a continuous feature or a discrete feature, and if the current feature is the continuous feature, discretizing a continuous value into n boxes by using an equal frequency binning method, wherein 2< = n < = 5.
Step 202: selecting a certain box with current characteristics, recording the box as i, and counting the number of the box slitting classes (namely the data of the 1 in the step 107) ((I))) And the number of non-slitted classes (i.e. data set 0 in step 107) ((ii))) Then respectively calculating the current characteristicsThe number of the cutting types of the front box accounts for the total number of the cutting types of the current characteristics () The ratio of (a) and the number of non-slice classes of the current box of the current feature in the total number of non-slice classes of the current feature (() The quotient of the ratio, the logarithm, and the notation。
Step 203: the difference between the slice class fraction and the non-slice class fraction of the current bin of the current feature in step 202 is calculated and then multiplied by this differenceIt is recorded as。
Step 204: the 202-203 steps are repeated to directly complete the calculation of all the bins, and then all the bins (n bins in total) are calculatedAnd adding to obtain the Information Value of the feature.
Step 205: after the Information values of all the features are calculated, the ratio of the Information Value of each feature to the sum of the total Information values is calculated. And taking the ratio as the weight of the feature, and multiplying the value of the feature by the weight to obtain new data of the feature.
In the third step of the above scheme, the weight feature data of the news program calculated in the second step needs to be normalized. The purpose of normalization is to scale the numerical value of each feature within the interval of 0-1, and avoid large deviation of the calculated distance caused by different features due to inconsistent dimensions. Selecting a min-max algorithm as a normalization method of the characteristic data with the weight, wherein the calculation method comprises the following steps:
where j represents the data index,、respectively represent data before and after normalization, and X represents all data of a certain characteristic.
In the fourth step of the scheme, data clustering is carried out by using a probability distribution transform clustering algorithm according to the data normalized in the third step, so that the clustering distinguishes two types, namely a slitting type and a non-slitting type. The concrete implementation steps of the step four are divided into the following steps.
Step 401: taking the normalized news program material data in the third step as basic data X, randomly selecting two segments in the basic data X as initial central segments, and taking the two initial central segments as current optimal central segments.
Step 402: the euclidean distances are calculated for the segments in the basis data from the two initial center segments in step 401, and the closer the basis data is to which initial center segment, the segments are classified as the center segment class. These two classes are referred to as a and b, respectively.
Step 403: solving the data transfer matrix A to make the data of a and bEdge probabilityAndas close as possible. The data transfer matrix a is calculated using the MMD distances of the two classes. The MMD distance is essentially the distance to compute the class center for the two classes a, b, which is defined as follows:
wherein n and m represent data volumes of a and b, and i and j represent data indexes of a and b.
The MMD distance is then transformed using mathematical changes, the mathematical reasoning being as follows:
where M is an MMD matrix, as follows:
this MMD matrix represents the meaning: when two fragment data belong to class a at the same time, thenWhen two fragment data belong to class b at the same time, thenWhen the two fragments belong to different classes respectively,。
although the data of the two classes a and b are as close as possible in the transformed data space, the misclassified data can be closer to the correct class, but the risk that the correctly clustered data is misclassified is also increased. In order to resist the risk, the constraint condition that the respective variances of the data of the a and the b are not changed before and after the transformation data space is added. Meanwhile, in order to prevent overfitting, a regularization term needs to be added.
In summary, the objective function of the algorithm of the present embodiment is as follows:
wherein, in the constraint condition,is a central matrix of which the center is,is a matrix of units, and is,is a regular term coefficient.
Then, the lagrangian method is used to solve the target function with the constraint, and then the transformation matrix a can be obtained. The solution is:
Step 404: using a Gaussian kernel function to carry out dimensionality raising on the basic data X to obtain dimensionality-raised data. The gaussian kernel function is given by:
step 405: the up-dimensional data K is obtained from step 404, and calculation is performedData in new data space after upscalingAnd K represents the ascending dimension data.
Step 406: data for new data space after upscalingUsing the kmeans algorithm, the data is dividedClustering as newAnd recording indexes of the data to which the two types belong.
Step 407: according to step 406And the indexes of the data belonging to the two types find the corresponding data of the basic data X as new types a and b of the basic data X.
Step 408: and respectively calculating the clustering center segments of the new a and b types according to the new a and b types of the basic data X. Step 409: comparing the new a and b cluster center segments with the current optimal cluster center segment; if the new a and b cluster center segments and the current optimal cluster center segment are moved in comparison, which indicates that the clustering has a space for iterative optimization, then go to step S403; if the new a and b cluster center segments are not moved in comparison with the current optimal cluster center segment, the optimal cluster is found through iteration, and the algorithm is ended.
Step five: and counting the number of the studios in each category according to the clustering results a and b, taking the studios with more studios as the slitting categories, and taking the studios with less studios as the non-slitting categories.
And (5) extracting data from the news program, and obtaining a slitting class and a non-slitting class by using a probability distribution conversion clustering algorithm so as to obtain a final news program slitting result.
The functionality of the present invention, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium, and all or part of the steps of the method according to the embodiments of the present invention are executed in a computer device (which may be a personal computer, a server, or a network device) and corresponding software. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, or an optical disk, exist in a read-only Memory (RAM), a Random Access Memory (RAM), and the like, for performing a test or actual data in a program implementation.
Claims (8)
1. A broadcast television news splitting method based on probability distribution transformation clustering is characterized by comprising the following steps:
s1, extracting characteristic data in the news program video data;
s2, calculating the importance ratio of each extracted feature data, and then multiplying each feature data by the importance ratio of the feature data to obtain data with weighted features;
s3, normalizing each data with the weight characteristics;
s4, clustering the point-in data and the non-point-in data in the normalized feature data by probability distribution conversion clustering;
and S5, segmenting the news story according to the data of the in-point class and the non-in-point class obtained by clustering.
2. The method for breaking news items in broadcast television based on probability distribution transformation clustering as claimed in claim 1, wherein the step S1 includes the steps of:
s101, cutting a video from audio pause points in a news program to obtain a plurality of cut segments, wherein all the audio pause points are used as candidate cut points of a news story;
s102, extracting visual feature data of each cutting segment according to the video information of each cutting segment; the visual feature data includes: judging result data of whether the current cutting segment has a studio or not, judging result data of whether the front cutting segment and the rear cutting segment of the current cutting segment contain the studio or not, data of the number of faces appearing in the studio, judging result data of whether the current cutting segment is a continuous studio or not and judging result data of whether the film and flower information appears or not;
s103, extracting audio characteristic data of each cutting segment according to the audio information of each cutting segment; the audio feature data includes: judging result data of whether music appears in the current cutting segment, judging result data of whether the front cutting segment and the rear cutting segment of the current cutting segment contain music and ASR (auto-regressive) speech character information data of the current cutting segment;
s104, manually judging whether each current cutting segment and the previous cutting segment belong to different news stories, setting 1 if the current cutting segment and the previous cutting segment belong to different news stories, and setting 0 if the current cutting segment and the previous cutting segment do not belong to different news stories; the result of the manual judgment will be used in calculating the feature importance in the subsequent step S2 as a true result.
3. The method for breaking news items of broadcast television based on probability distribution transformation clustering as claimed in any one of claims 1 or 2, wherein the step S2 comprises the steps of:
s201, numbering according to a time sequence based on the feature data extracted in the step S1, then taking a certain feature according to the number sequence, taking the certain feature as a current feature, judging whether the current feature is a continuous feature or a discrete feature, and if the current feature is the continuous feature, discretizing a continuous value into n boxes by using an equal frequency binning method, wherein 2< = n < = 5;
s202, selecting a certain box of the current characteristic data, recording the box as an i box, and counting the number of the cutting types of the i box, namely the data of the set 1 in the step S104And counting the number of non-slitting type data of the set 0 in step S104Then, respectively calculating the number of the cutting classes of the current feature current box to account for the total number of the cutting classes of the current featureThe ratio of the current feature to the total number of the non-slitting classes of the current featureThe quotient of the ratio, the logarithm, and the notation;
S203, calculating the difference between the ratio of the cutting class and the ratio of the non-cutting class of the current box of the current feature in the step S202, and multiplying the difference by the ratioIt is recorded as;
S204, repeating the steps S202-S203, completing the calculation of all boxes, and then completing the calculation of all boxes, namely n boxesAdding to obtain current characteristic dataIVA value;
s205, of all characteristic dataIVAfter the values are calculated, the value of each feature is calculatedIVValue of the sumIVThe ratio of the value sum; and taking the ratio as the weight of the current characteristic data, and multiplying the value of the current characteristic data by the ratio as the data with the weighted characteristic of the current characteristic data.
4. The method for breaking news in broadcasting TV based on probability distribution transformation clustering as claimed in claim 1, wherein in step S3, the min-max algorithm is selected as the normalization method of weighted feature data, and the calculation formula is as follows:
5. The method for breaking news items in broadcast television based on probability distribution transformation clustering as claimed in claim 1, wherein the step S4 comprises the steps of:
s401, taking the certain normalized news program material data in the step S3 as basic data X, randomly selecting two segments in the basic data X as initial center segments, and taking the two initial center segments as current optimal center segments;
s402, taking each line in the basic data X as a segment, calculating Euclidean distances between the segments and the two initial central segments in the step S401 respectively, dividing the segments into which initial central segment class when the segments are closer to which initial central segment, and respectively recording the two initial central segment classes as a class and a class b;
s403, solving the data transfer matrix A to enable the edge probability of the data of a and bAndthe difference of (a) is minimal; calculating a data transfer matrix A by using the MMD distances of the two classes; xa,XbRespectively representing basic data of a and b;
s404, using a Gaussian kernel function to perform dimension increasing on the basic data X to obtain dimension increasing data(ii) a The gaussian kernel function is given by:
S406, for the data of the new data space after the dimension increasingUsing the kmeans algorithm, the data is dividedClustering as newRecording indexes of data of the two types;
s407: according to step S406The indexes of the two types of data find the corresponding data of the basic data X as new types a and b of the basic data X;
s408: respectively calculating new a and b clustering center segments according to the new a and b types of the basic data X, and comparing the new a and b clustering center segments with the current optimal clustering center segment; if the new a and b cluster center segments and the current optimal cluster center segment are moved in comparison, which indicates that the clustering has a space for iterative optimization, then go to step S403; if the new a and b cluster center segments are not moved in comparison with the current optimal cluster center segment, the optimal cluster is found through iteration, and the algorithm is ended.
6. The method of claim 5, wherein the broadcast TV news ticker-splitting method based on probability distribution transformation clustering,
in step S403, the MMD distance, i.e. the distance between the class centers of the two classes a and b, is calculated as follows:
n and m respectively represent data volumes of a and b, and i and j respectively represent data indexes of a and b;
then, the MMD distance is converted, and the constraint condition that the variance of the data of a and b is unchanged before and after conversion is considered; meanwhile, overfitting is prevented, and a regular term is added; in summary, the objective function is as follows:
where tr () is the trace of the matrix, M is an MMD matrix, H is the center matrix, I is the identity matrix,is a regular term coefficient;
then, solving the target function by using a Lagrange method, and then obtaining a transformation matrix A; the solving formula is as follows:
7. The method for breaking news in broadcasting TV based on probability distribution transformation clustering as claimed in claim 5, wherein in step S5, according to the clustering results a and b, the number of studios appearing in each category is counted, the studios with the larger number of studios appearing are used as the slicing categories, and the studios with the smaller number of studios appearing are used as the non-slicing categories, so as to obtain the final news program slicing result.
8. The broadcast television news ticker method based on probability distribution transform clustering of claim 1, wherein in step S1, audio pause points are used as candidate points for news story segmentation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011555578.5A CN112288047B (en) | 2020-12-25 | 2020-12-25 | Broadcast television news stripping method based on probability distribution transformation clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011555578.5A CN112288047B (en) | 2020-12-25 | 2020-12-25 | Broadcast television news stripping method based on probability distribution transformation clustering |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112288047A true CN112288047A (en) | 2021-01-29 |
CN112288047B CN112288047B (en) | 2021-04-09 |
Family
ID=74426352
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011555578.5A Active CN112288047B (en) | 2020-12-25 | 2020-12-25 | Broadcast television news stripping method based on probability distribution transformation clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112288047B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113997989A (en) * | 2021-11-29 | 2022-02-01 | 中国人民解放军国防科技大学 | Safety detection method, device, equipment and medium for single-point suspension system of maglev train |
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007036888A2 (en) * | 2005-09-29 | 2007-04-05 | Koninklijke Philips Electronics N.V. | A method and apparatus for segmenting a content item |
CN102523536A (en) * | 2011-12-15 | 2012-06-27 | 清华大学 | Video semantic visualization method |
CN102547139A (en) * | 2010-12-30 | 2012-07-04 | 北京新岸线网络技术有限公司 | Method for splitting news video program, and method and system for cataloging news videos |
KR101382904B1 (en) * | 2012-12-14 | 2014-04-08 | 포항공과대학교 산학협력단 | Methods of online video segmentation and apparatus for performing the same |
CN104182421A (en) * | 2013-05-27 | 2014-12-03 | 华东师范大学 | Video clustering method and detecting method |
CN104780388A (en) * | 2015-03-31 | 2015-07-15 | 北京奇艺世纪科技有限公司 | Video data partitioning method and device |
CN107341429A (en) * | 2016-04-28 | 2017-11-10 | 富士通株式会社 | Cutting method, cutting device and the electronic equipment of hand-written adhesion character string |
CN108710860A (en) * | 2018-05-23 | 2018-10-26 | 北京奇艺世纪科技有限公司 | A kind of news-video dividing method and device |
CN109086830A (en) * | 2018-08-14 | 2018-12-25 | 江苏大学 | Typical association analysis based on sample punishment closely repeats video detecting method |
CN110110739A (en) * | 2019-03-25 | 2019-08-09 | 中山大学 | A kind of domain self-adaptive reduced-dimensions method based on samples selection |
CN111126126A (en) * | 2019-10-21 | 2020-05-08 | 武汉大学 | Intelligent video strip splitting method based on graph convolution neural network |
CN111160099A (en) * | 2019-11-28 | 2020-05-15 | 福建省星云大数据应用服务有限公司 | Intelligent segmentation method for video image target |
CN111222499A (en) * | 2020-04-22 | 2020-06-02 | 成都索贝数码科技股份有限公司 | News automatic bar-splitting conditional random field algorithm prediction result back-flow training method |
CN111242110A (en) * | 2020-04-28 | 2020-06-05 | 成都索贝数码科技股份有限公司 | Training method of self-adaptive conditional random field algorithm for automatically breaking news items |
US20200320306A1 (en) * | 2019-04-08 | 2020-10-08 | Baidu.Com Times Technology (Beijing) Co., Ltd. | Method and apparatus for generating information |
-
2020
- 2020-12-25 CN CN202011555578.5A patent/CN112288047B/en active Active
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007036888A2 (en) * | 2005-09-29 | 2007-04-05 | Koninklijke Philips Electronics N.V. | A method and apparatus for segmenting a content item |
CN102547139A (en) * | 2010-12-30 | 2012-07-04 | 北京新岸线网络技术有限公司 | Method for splitting news video program, and method and system for cataloging news videos |
CN102523536A (en) * | 2011-12-15 | 2012-06-27 | 清华大学 | Video semantic visualization method |
KR101382904B1 (en) * | 2012-12-14 | 2014-04-08 | 포항공과대학교 산학협력단 | Methods of online video segmentation and apparatus for performing the same |
CN104182421A (en) * | 2013-05-27 | 2014-12-03 | 华东师范大学 | Video clustering method and detecting method |
CN104780388A (en) * | 2015-03-31 | 2015-07-15 | 北京奇艺世纪科技有限公司 | Video data partitioning method and device |
CN107341429A (en) * | 2016-04-28 | 2017-11-10 | 富士通株式会社 | Cutting method, cutting device and the electronic equipment of hand-written adhesion character string |
CN108710860A (en) * | 2018-05-23 | 2018-10-26 | 北京奇艺世纪科技有限公司 | A kind of news-video dividing method and device |
CN109086830A (en) * | 2018-08-14 | 2018-12-25 | 江苏大学 | Typical association analysis based on sample punishment closely repeats video detecting method |
CN110110739A (en) * | 2019-03-25 | 2019-08-09 | 中山大学 | A kind of domain self-adaptive reduced-dimensions method based on samples selection |
US20200320306A1 (en) * | 2019-04-08 | 2020-10-08 | Baidu.Com Times Technology (Beijing) Co., Ltd. | Method and apparatus for generating information |
CN111126126A (en) * | 2019-10-21 | 2020-05-08 | 武汉大学 | Intelligent video strip splitting method based on graph convolution neural network |
CN111160099A (en) * | 2019-11-28 | 2020-05-15 | 福建省星云大数据应用服务有限公司 | Intelligent segmentation method for video image target |
CN111222499A (en) * | 2020-04-22 | 2020-06-02 | 成都索贝数码科技股份有限公司 | News automatic bar-splitting conditional random field algorithm prediction result back-flow training method |
CN111242110A (en) * | 2020-04-28 | 2020-06-05 | 成都索贝数码科技股份有限公司 | Training method of self-adaptive conditional random field algorithm for automatically breaking news items |
Non-Patent Citations (5)
Title |
---|
BOGDAN MOCANU 等: "Automatic Segmentation of TV News into Stories Using Visual and Temporal Information", 《INTERNATIONAL CONFERENCE ON ADVANCED CONCEPTS FOR INTELLIGENT VISION SYSTEM(ACIVS)》 * |
RAGHVENDRA KANNAO 等: "A system for semantic segmentation of TV news broadcast videos", 《MULTIMEDIA TOOLS AND APPLICATIONS》 * |
WINSTON H.HSU 等: "Discovery and fusion of salient multimodal features toward news story segmentation", 《THE INTERNATIONAL SOCIETY FOR OPTICAL ENGINEERING》 * |
刘智康: "基于语义关系图的新闻事件聚类算法研究与应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
李晨杰 等: "基于音视频特征的新闻拆条算法", 《微型电脑应用》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113997989A (en) * | 2021-11-29 | 2022-02-01 | 中国人民解放军国防科技大学 | Safety detection method, device, equipment and medium for single-point suspension system of maglev train |
CN113997989B (en) * | 2021-11-29 | 2024-03-29 | 中国人民解放军国防科技大学 | Safety detection method, device, equipment and medium for single-point suspension system of maglev train |
Also Published As
Publication number | Publication date |
---|---|
CN112288047B (en) | 2021-04-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106599029B (en) | Chinese short text clustering method | |
US20210382937A1 (en) | Image processing method and apparatus, and storage medium | |
CN106919619B (en) | Commodity clustering method and device and electronic equipment | |
CN110928981A (en) | Method, system and storage medium for establishing and perfecting iteration of text label system | |
CN109299263B (en) | Text classification method and electronic equipment | |
Bouguila | A model-based approach for discrete data clustering and feature weighting using MAP and stochastic complexity | |
WO2023065642A1 (en) | Corpus screening method, intention recognition model optimization method, device, and storage medium | |
Asim et al. | Comparison of feature selection methods in text classification on highly skewed datasets | |
CN113656373A (en) | Method, device, equipment and storage medium for constructing retrieval database | |
CN111242110B (en) | Training method of self-adaptive conditional random field algorithm for automatically breaking news items | |
CN105205130A (en) | Method of improving accuracy of recommendation system | |
CN112288047B (en) | Broadcast television news stripping method based on probability distribution transformation clustering | |
CN111538846A (en) | Third-party library recommendation method based on mixed collaborative filtering | |
CN114328939B (en) | Natural language processing model construction method based on big data | |
CN114579739A (en) | Topic detection and tracking method for text data stream | |
CN114579768A (en) | Maintenance method for realizing intelligent operation and maintenance knowledge base of equipment | |
CN113051886B (en) | Test question duplicate checking method, device, storage medium and equipment | |
CN107609570B (en) | Micro video popularity prediction method based on attribute classification and multi-view feature fusion | |
CN116561230B (en) | Distributed storage and retrieval system based on cloud computing | |
Liang et al. | An efficient hierarchical near-duplicate video detection algorithm based on deep semantic features | |
CN110852059B (en) | Document content difference contrast visual analysis method based on grouping | |
CN117033961A (en) | Multi-mode image-text classification method for context awareness | |
CN115587231A (en) | Data combination processing and rapid storage and retrieval method based on cloud computing platform | |
CN112988953B (en) | Adaptive broadcast television news keyword standardization method | |
CN115345158A (en) | New word discovery method, device, equipment and storage medium based on unsupervised learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |