CN101616264B

CN101616264B - Method and system for cataloging news video

Info

Publication number: CN101616264B
Application number: CN2008101157870A
Authority: CN
Inventors: 陈众; 张树武; 曾智; 杨武夷
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2008-06-27
Filing date: 2008-06-27
Publication date: 2011-03-30
Anticipated expiration: 2028-06-27
Also published as: CN101616264A

Abstract

The invention relates to a method and a system for cataloging news video. The method realizes automatic cataloging of the news video based on caption bars, anchorman and audio mute point information in a news program, and comprises the following steps: carrying out audio-video separation of news video stream and head leader music matching of audio data to determine the effective time range of a news program in a file; determining an audio mute point, an anchorman frame and the emerging time of a caption frame within the effective time range, and carrying out comprehensive analysis processing to determine the division time point of news items; and identifying video caption information, associating the caption information with a division result, and taking the caption information after association as cataloging semantic information. The system comprises a bar removing module and an educing module connected with a news video bar-removing result database as well as a browse module, a play module and a correction module connected in parallel between a client and the news video bar-removing result database. The method and the system solve the problems of news automatic bar removing and news item automatic semantic information labeling and realize automatic cataloging of news programs, thereby having the advantages of high efficiency and low cost.

Description

News video categorization and system

Technical field

The invention belongs to the video structure analysis field, more precisely, relate to the news video structured techniques.

Background technology

Video structural obtains structural informations such as camera lens that video has, scene exactly when taking, utilize these structured messages to set up some index for video, makes things convenient for the management and the use of video.Can adopt manual mode that video frequency program is cut into more a plurality of video-frequency bands on the content, and these video-frequency bands are marked, for user index and use.But manual method will spend a large amount of time and human cost, inefficiency.And video manually marked there is subjective inconsistency, to same section video frequency program, different marks has different understanding with the user of service, and this otherness makes markup information can not objectively respond the true content of video, brings some inconvenience for the management of video content.

Because it is unrealistic in operation by hand video to be carried out structuring, utilizes automated method to come video is handled.Utilize the powerful calculating ability of computer, video is carried out structuring handle, finish the manual work that is difficult to realize.

Summary of the invention

The present invention is directed to that existing manual news categorization efficient is low, cost is high, made a catalogue personnel subjective factor influences big problem, for this reason, the invention provides a kind of news video automated cataloging method and system.

In order to reach described purpose, an aspect of of the present present invention, provide a kind of news video categorization, its technical scheme comprises the steps: based on the caption strips that occurs in the news program, host, audio mute dot information news video to be carried out automated cataloging, and step is as follows:

Step 1: news video stream is carried out audio, video data separate, obtain voice data and video data; Step 2: voice data is carried out head music coupling, determine news program scope effective time hereof; Voice data in the time range of news program place is carried out quiet point detect, obtain the audio mute point sequence; Video data in the time range of news program place is carried out key frame extraction, the detection of host's picture frame and literal frame detect, obtain quiet time, host's time of occurrence, literal time of occurrence in the time range of news program place; Step 3: audio mute point sequence, host's time of occurrence and Word message time of occurrence are carried out comprehensive analysis processing, obtain news item point sliced time; Simultaneously the Word message that occurs in the video is discerned, extracted Word message; Step 4: the demolition result and the Word message that identifies of news program carried out related, obtain having the news catalogue result of semantic information.

Wherein, the treatment step of video data comprises:

Step S2B1: extract the isolated video data of audio, video data; Step S2B2: video data is extracted key frame, be used to detect host's picture frame and Word message picture frame; Step S2B3: the time point that host's frame is occurred carries out the detection based on local feature coupling and host's time distribution characteristics, is used to generate the information of the zero-time that helps definite news item; Step S2B4: key frame set is detected, obtain the Word message frame, be used for generating the number of the news item that news program comprises.

Wherein, definite step of news item point sliced time comprises as follows:

Step 31: host's frame and Word message frame according to the time order and function order, are mixed and line up a mixed sequence M; Step 32: utilize host and Word message two class time points among the mixed sequence M,, determine the time point that news is cut apart in conjunction with the information among the quiet point sequence V.

Wherein, news sliced time, point adopted rule 1 and rule 2, or adopted rule 1 and rule 3, and its rule is: 1: one Word message frame of rule is represented news item, the start time point of this news Word message frame appearance place or before; The rule 2: in mixed sequence M, if current Word message frame front adjacent be host's key frame, think that then current Word message frame and host's frame belong to same news, the host belongs to the leading report camera lens of this news; Got among the quiet point sequence V before this host's frame, and apart from the quiet point of its nearest, as the zero-time of current this news; The rule 3: in mixed sequence M, if current Word message frame front adjacent also be a Word message frame, these two Word message frames belong to different news item; Got among the quiet point sequence V before current Word message frame, and apart from the quiet point of its nearest, as the zero-time of current this news.

Wherein, key frame of video extracts: be to extract key frame on the basis of video I frame, window with one 3 frame sign slides in the I frame sequence, the similarity of frame difference target area in the similarity of frame difference target area and second frame and the 3rd frame in interior first frame of difference calculation window and second frame, use sim (n respectively, n+1) and sim (n+1, n+2) expression; Calculation of similarity degree adopts histogram intersection, and the color histogram of establishing frame difference target area in interior three the I frames of window is respectively H _n(k), H _N+1(k), H _N+2(k), the formula of calculating similarity is:

sim (n, n + 1) = \frac{Σ_{k = 0}^{N - 1} \min (H_{n} (k), H_{n + 1} (k))}{Σ_{k = 0}^{N - 1} H_{n} (k)}

sim (n + 1, n + 2) = \frac{Σ_{k = 0}^{N - 1} \min (H_{n + 1} (k), H_{n + 2} (k))}{Σ_{k = 0}^{N - 1} H_{n + 1} (k)}

In the formula, N is the number between the chromatic zones that comprises of color histogram; According to the I frame similarity threshold T of prior setting, carry out following relatively judgement: if sim (n, n+1)＜T, and sim (n+1, n+2)＞T, i.e. n I frame and n+1 I frame dissmilarity, and n+1 I frame is similar to n+2 I frame, and extracting n+1 I frame so is key frame; Otherwise n+1 I frame is not key frame; Then, window is slided backward a frame, continue above-mentioned similarity calculating and relatively judgement; When window in the entire I frame sequence, slide go over after, just extracted the key frame set that may comprise host or Word message.

Wherein, it is to carry out on the basis of the key frame set of extracting that host's picture frame detects, and the key frame that utilizes people's face to detect extracting filters, and the key frame of selecting to comprise people's face is formed new people's face key frame set; Behind some extracted region visual signature to people's face key frame, utilize local feature point detection algorithm in the specific region of people's face key frame, to detect the local feature point; With some key frame is benchmark, mates the local feature point in other key frames, finds out many groups key frame that can match abundant local feature point host's key frame group as the candidate; Before two people's face key frames being carried out local feature point coupling, whether utilize color histogram to ask similarity to calculate these two people's face key frames may be similar, if the method by the histogram coupling is assert two key frame dissmilarities, they are not carried out local feature point coupling; Based on the time regularity of distribution of host's key frame in whole program, if the time span of one group of key frame in video is greater than certain threshold value, think that then they are candidate set of host's key frame, otherwise think that they are not host's key frames and it is given up; At last, comprehensive candidate's key frame group that only comprises a host and the key frame group that comprises two hosts judge which is host's key frame.

In order to reach described purpose, a second aspect of the present invention, be to the invention provides a kind of news video cataloging syytem, comprise: the output of demolition module is connected with the input of news video demolition result database, is used to export the demolition result that audio and video characteristic merges; News video demolition result database output is connected with the input of deriving module, receive the demolition result that audio and video characteristic merges, guide goes out the input output news video catalogue result of module and exports in the XML file outside the system, be used for these XML files are loaded into other system, make other system obtain news video catalogue result; Browsing module, playing module and correction module is parallel between user side and the news video demolition result database; Browse module, the numbering of the news item that the requirement of reception user appointment is browsed receives the inventory information of specifying news item in the news video demolition result database; Export the inventory information of specifying news item to the user, comprise some sliced time, headline, the news content descriptor of news item; Playing module receives the user and specifies the news item that requires to play to number, and receives the file path and the time range of this news in the news video demolition result database; Play the picture and the sound-content of this news to the user; Correction module receives the numbering that the user specifies the news item that requires correction, receives the existing inventory information of this news in the news video demolition result database; Show the existing inventory information of this news to the user, the inventory information of this news behind news video demolition result database output calibration.

Wherein, the demolition module comprises: the output of audio, video data separative element is connected with the input of audio and video characteristic integrated unit, its audio, video data separative element receives news video stream, is used for the news video flow point from generating voice data and video data and output; The audio and video characteristic integrated unit receives voice data and video data, is used for voice data and video data are generated demolition result and output.

Wherein, the audio, video data separative element also comprises:

The voice data subelement has the happy matching part of a slice head tone, has one quiet some test section, and described head music matching part and quiet some test section are connected in parallel; The video data subelement has host's frame test section, has a title bar frame test section, has a caption text identification part, and described host's frame test section, title bar frame test section and caption text identification part are connected in parallel.

Wherein, browse module and comprise: text header is browsed subelement and key frame images and is browsed subelement and be connected in parallel, and is used for different forms the result of news catalogue being showed the user.

Wherein, correction module comprises: news item fractionation or merging subelement, news item time point information syndrome unit, news item text message syndrome unit are connected in parallel, and from different perspectives the problem that may occur in the news automated cataloging process are proofreaied and correct respectively.

Beneficial effect of the present invention: the present invention has adopted the quiet dot information, host's information and the Word message that utilize in the news program news program to be carried out the technical scheme of automated cataloging.Solved the automatic demolition of news, the problem of news item automatic semantic information labeling.Realize the automated cataloging of news program, had efficient height, advantage that cost is low.Use XML as intermediate medium simultaneously in the solution of the present invention, realize the exchanges data and the information sharing of cataloging syytem and other video on-demand systems.

Description of drawings

Fig. 1 is a news catalogue scheme flow chart of the present invention.

Fig. 2 is that frame difference of the present invention is calculated target area figure.

Fig. 3 is that continuous three frames are formed a window in the I frame sequence of the present invention.

Fig. 4 is a news cataloging syytem structure chart of the present invention.

Fig. 5 is a news cataloging syytem surface chart of the present invention.

Embodiment

Describe each related detailed problem in the technical solution of the present invention in detail below in conjunction with accompanying drawing.Be to be noted that described embodiment only is intended to be convenient to the understanding of the present invention, and it is not played any qualification effect.

The present invention proposes a kind of news video automated cataloging method, as shown in Figure 1, this method can be carried out automated cataloging to the news video program to method, and the caption text information Recognition in the news program is come out, as the meaning of one's words information of News Stories.The method of catalogue is mainly carried out work by the appearance of caption strips, host and audio mute point in the news video is discerned, and is total to the analysis-by-synthesis to above-mentioned information, determines the time point cut apart and the information of headline.The operations such as result data derivation of can making a catalogue, browse, play, proofread and correct, mark, make a catalogue video file of news automated cataloging system.System utilizes the XML file as intermediary, realizes exchanges data with existed system.

1. news video categorization

Catalogued procedure is divided into that audio, video data separates, head music coupling, quiet point detect, key frame extracts, host's frame detects, steps such as some sliced time, related news item and text message are determined in the inspection of literal frame, Word message identification, comprehensive audio/video information.

(1) audio, video data separates:

The catalogue scheme that the present invention proposes will utilize picture and sound two aspect information that news content is carried out analyzing and processing, so before carrying out concrete catalogue calculating, voice data in the video file and video data to be extracted respectively earlier, use for follow-up Audio Processing and video processing procedure.

(2) head music coupling: voice data is carried out head music coupling, determine news program scope effective time hereof; Voice data in the time range of news program place is carried out quiet point detect, obtain the audio mute point sequence; Video data in the time range of news program place is carried out key frame extraction, the detection of host's frame and literal frame detect, obtain quiet time, host's time of occurrence, literal time of occurrence in the time range of news program place.Audio mute point sequence, host's time of occurrence and Word message time of occurrence are carried out comprehensive analysis processing, obtain news item point sliced time; Simultaneously the Word message that occurs in the video is discerned, extracted Word message;

Described processing of audio data step comprises: step S2A1: extract the isolated voice data of audio, video data; Step S2A2: voice data is carried out the frequency domain differential demodulation feature extraction, obtain audio frequency characteristics and head music template characteristic and mate, find the zero-time of news program in the file of news program place, obtain the audio mute point sequence; Identify the news program type simultaneously; Step S2A3: audio stream is carried out discrete sampling, and be divided into a plurality of audio frames in short-term, have necessarily overlappingly between the adjacent audio frame, with short-time average energy voice data is carried out quiet point and detect, find out possible news item point sliced time.

The treatment step of described video data comprises:

News video normally obtains by the news in the recording TV program, in order to guarantee the integrality of program recording, and generally can be with each records the content of a period of time more after finishing before news program begins.In this case, effectively news program partly is in certain uncertain position in the video file.Before news video is made a catalogue, at first to determine news program time range hereof, then could be to the calculating of making a catalogue of the valid data in this scope.

Categorization that the present invention proposes, used some about the priori of news program as program parameter.Use the priori can the short cataloging scheme, get around some full-automatic algorithms and solve a bad difficult problem, such as Word message orientation problem in the video, to reach practical purpose.Dissimilar news has different time and space structure, so the program parameter that uses during to dissimilar news catalogue is also different.Therefore, before calculating that news video is made a catalogue, to determine the type of handled news program earlier.

When news program begins to play, one section head music is arranged all, and the head music difference of different news.Based on these characteristics, utilize the method for program head music coupling, can find the zero-time of news program in the file, identify the news program type simultaneously.

In advance the head music of preserving present common news is as template, in the time of determining the zero-time of the news program that a file comprises and type, with regard to respectively with these templates go with file in voice data mate.Use the characteristic vector of audible spectrum difference feature as head music matching process.

Similarity between two audio fragments, can utilize their characteristic vector to calculate:

Sim (a_{1}, a_{2}) = 1 - \frac{HD (H_{1}, H_{2})}{N}

Wherein, a ₁, a ₂Represent two audio fragments; H ₁And H ₂Represent from a respectively ₁And a ₂In the N dimensional feature vector that extracts; Two hamming distances (Hammingdistance) between the vector are asked in HD () expression.

If the known type news program has the P kind, be respectively News ₁, News ₂..., News _P, corresponding head music template is respectively HM ₁, HM ₂..., HMP.With head music template HM ₁From audio stream starting point to be matched, be that unit slides with the frame, every cunning moves a step, and once mates calculating, if HM ₁Surpassed the threshold value of a predefined with the similarity of the audio fragment of position, then thought and found possible head music starting point, having stopped the current coupling of sliding, and to write down this start time be ST ₁, similarity is Sim ₁Carried out sliding after the coupling with all head music templates, obtained similarity sequence Sim ₁, Sim ₂..., Sim _P, suppose that maximum wherein is Sim _k, then select ST _kBe news program zero-time hereof, the news type is News _k

According to the news type that obtains, can know the time span of news program, in conjunction with the zero-time that obtains, can know the time range of news program in video file.

(3) quiet point detects: audio mute point sequence, host's time of occurrence and Word message time of occurrence are carried out comprehensive analysis processing, obtain news item point sliced time; Simultaneously the Word message that occurs in the video is discerned, extracted Word message;

In a news-video, all there are host's report or the sound that backgrounding explains orally the most of the time.And, have the pause of reporting or explaining orally in the place that two news replace, in audio stream, can there be one section very tangible quiet fragment.This quiet fragment can help some sliced time between definite news item.

Use the short-time average energy method, voice data is carried out quiet point detect, find out possible news item point sliced time.Short-time average energy refers to an average energy that the sampled point signal is assembled in the audio frame in short-term.Represent one section continuous audio signal stream with x, x is carried out discrete sampling, and be divided into a plurality of audio frames in short-term, have certain overlapping between the adjacent audio frame.Then wherein the short-time average energy of m audio frame is:

E_{m} = \frac{Σ_{n = 0}^{N - 1} {[x (n)]}^{2}}{N}

Wherein, E _mThe short-time average energy of representing m audio frame, N are represented the number of the sampled point that comprises in the m frame, the sampled value of n sampled point in x (n) the expression m frame.

If one in short-term the average energy of audio frame be lower than a prior given threshold value, judge that then this short time frame is quiet, otherwise be non-quiet.For a little audio fragment, surpassed certain proportion if wherein be judged as the quiet number of audio frame in short-term, then this little audio fragment is judged as quiet fragment.

(4) key frame of video extracts:

Before video data is handled, extract key frame earlier, replace entire video data with key frame then, carry out follow-up computing.Because key frame has been eliminated redundant data, can significantly reduce follow-up amount of calculation, so the key frame extraction is a very important step.

The key frame extraction operation here mainly is to detect and the detection of Word message frame is prepared for follow-up host's frame.The target that extracts is the picture frame that possible comprise host or Word message, and needn't extract the representative frame of all reaction different pictures contents, the key frame that extracts is like this wanted much less than the key frame on the ordinary meaning, more helps reducing follow-up amount of calculation.

Owing to be the key frame extraction of news program being carried out above-mentioned specific type, therefore can utilize some prioris about news program, improve traditional key frame of video abstracting method.Owing to only need to extract the picture frame that may comprise host or Word message, so when calculating the frame difference, can only consider to reflect that the host occurs or the variation of certain zonule that Word message occurs gets final product, and needn't consider the variation of whole video frame images, can reduce participating in the number that the frame difference is calculated like this, thereby reduce amount of calculation as rope point.As shown in Figure 2, the white rectangle zone of selecting the video pictures lower left is as the target area of calculating the frame difference, and (a) represents news spot among Fig. 2, and (b) the expression Word message (c) is represented male host, (d) the expression toastmistress.

As can be seen from the figure, when video content never transforms to the picture that has Word message with the picture of Word message, perhaps when non-host's picture transformed to host's picture, obvious variation all can take place in the vision content of selected rectangle zonule.This meets the target area of calculating the frame difference can react the principle that Word message occurs or the host occurs.When video was changed between above-mentioned four types picture, the color characteristic of frame difference target area had significant variation, thus select this regional color histogram, as the characteristic vector of calculating the frame difference.

Key frame extracts and carries out on the basis of video I frame (inner picture intra picture).As shown in Figure 3, utilize the gap of consecutive frame picture material to judge the existence of key frame, window with one 3 frame sign among the figure slides in the I frame sequence, the similarity of frame difference target area in the similarity of frame difference target area and second frame and the 3rd frame in interior first frame of difference calculation window and second frame, use sim (n respectively, n+1) and sim (n+1, n+2) expression.Calculation of similarity degree adopts the method for histogram intersection, and the color histogram of establishing frame difference target area in interior three the I frames of window is respectively H _n(k), H _N+1(k), H _N+2(k), the formula of calculating similarity is:

sim (n, n + 1) = \frac{Σ_{k = 0}^{N - 1} \min (H_{n} (k), H_{n + 1} (k))}{Σ_{k = 0}^{N - 1} H_{n} (k)}

sim (n + 1, n + 2) = \frac{Σ_{k = 0}^{N - 1} \min (H_{n + 1} (k), H_{n + 2} (k))}{Σ_{k = 0}^{N - 1} H_{n + 1} (k)}

Wherein, N is the number of (bin) between the chromatic zones that comprises of color histogram.

According to the I frame similarity threshold T of prior setting, carry out following relatively judgement: if sim (n, n+1)＜T, and sim (n+1, n+2)＞T, i.e. n I frame and n+1 I frame dissmilarity, and n+1 I frame is similar to n+2 I frame, and extracting n+1 I frame so is key frame; Otherwise n+1 I frame is not key frame.Then, window is slided backward a frame, continue above-mentioned similarity calculating and relatively judgement.When window in the entire I frame sequence, slide go over after, just extracted the key frame set that may comprise host or Word message, n=1,2,3,4 ....

This method can not directly extract the frame of video that only comprises host or Word message, but can obtain their superset, and the number of the frame of video that is comprised in this superset, all frame of video that comprise than video file or the number of I frame are wanted much less.This can significantly reduce the number of the picture frame that next participates in detection of host's frame and the detection of Word message frame, thereby reduces amount of calculation.

(5) host's frame detects:

The appearance of host's frame means the beginning of news item usually, therefore can determine the zero-time of news item by the time point that detects the appearance of host's frame.

The present invention uses based on the method for detection of people's face and local Feature Points Matching and carries out the detection of host's frame.This method is based on following hypothesis: (1) news program has one or two host, and a host can repeatedly occur in same news program, occurs for the first time and has the long time interval between last the appearance; (2) the positive face of host appears in the video pictures above the waist to video camera; (3) same host is when the different time points of whole program occurs, and only there are some small variations in gesture actions above the waist; (4) in same news program, host's clothing is constant, but background can have bigger variation.

Host's picture frame detects on the basis of the key frame set of extracting and carries out.The key frame that utilizes people's face to detect extracting filters, and only selects to comprise the key frame of people's face, and these key frames of selecting are formed new people's face key frame set.Behind some extracted region visual signature to people's face key frame, utilize local feature point detection algorithm in the specific region of people's face key frame, to detect the local feature point.With some key frame is benchmark, mates the local feature point in other key frames, finds out many groups key frame that can match abundant local feature point host's key frame group as the candidate.Note, before two people's face key frames being carried out local feature point coupling, whether can utilize color histogram to ask the method for similarity to calculate these two people's face key frames earlier may be similar, if the method by the histogram coupling is assert two key frame dissmilarities, just needn't carry out local feature point coupling to them again, thereby reduce the workload that local feature point detects and mates.It is because the amount of calculation of color histogram coupling detects than local feature point and the amount of calculation of coupling is much smaller that such judgement is carried out in selection.Based on the time regularity of distribution of host's key frame in whole program, if the time span of one group of key frame in video is greater than certain threshold value, just think that they are candidate set of host's key frame, otherwise think that they can not be host's key frames and it is given up.At last, comprehensive candidate's key frame group that only comprises a host and the key frame group that comprises two hosts judge which is host's key frame.

(6) the Word message frame detects:

Find that by the observation to a large amount of news video programs the appearance of each bar news all is attended by relevant Word message in the program, these Word messages are described the content of this news.Because having one to one with every news, Word message concerns, so can be by the number of the news item the detection of Word message being determined comprise in the news program.

The Word message frame detects on the basis of the key frame set of extracting and carries out.In a kind of news program of definite type, describing the locus of Word message in frame of video of news content fixes, can utilize this priori, in frame of video, mark the Word message viewing area, and should the zone when detecting the Word message frame, the effective coverage of calculating two frame similarities.That is to say that similarity is only relevant with the zone of this piece mark between two frames, and the extra-regional content of this piece does not participate in calculation of similarity degree, this zone is called " Word message target area ".

Preserve the Word message frame template of common type news program in advance.When detecting the Word message frame,, select related words information frame template according to the news program type that head music coupling is determined.Calculate the similarity of the Word message target area of the Word message target area of this template and each key frame respectively, select all similarities greater than the key frame of given threshold value as the Word message frame.

The method that calculation of similarity degree adopts color histogram to intersect is established H _Model(k) be the color histogram of template Word message target area, H _i(k) be the color histogram of the Word message target area of i key frame, then the similarity of template and i key frame is:

sim (\mod el, i) = \frac{Σ_{k = 0}^{N - 1} \min (H_{\mod el} (k), H_{i} (k))}{Σ_{k = 0}^{N - 1} H_{\mod el (k)}}

Wherein, (model i) is the similarity of Word message frame template and i key frame to sim, and N is the number of (bin) between chromatic zones in the color histogram.

If T is prior given similarity threshold, if sim (model, i)＞T, think that then i key frame is the Word message frame; Otherwise i key frame is not the Word message frame, gives up.

(7) comprehensive audio/video information is determined some sliced time:

Comprehensive described news item point sliced time determine to comprise the steps: step 31: host's frame and Word message frame according to the time order and function order, are mixed and line up a mixed sequence M; Step 32: utilize host and Word message two class time points among the mixed sequence M,, determine the time point that news is cut apart in conjunction with the information among the quiet point sequence V.

Through the processing of front, audio mute time point sequence, host's time of occurrence point sequence and Word message time of occurrence point sequence in the news program have been obtained.The information of comprehensive these three time point sequences can determine to comprise in the news program number of news item and the zero-time of each news item in whole file.

News item must be accompanied by a Word message of describing its content, and this is the basic foundation that we cut apart news program.So a Word message frame that detects has just been determined the existence of news item.Host's time of occurrence point and audio mute point, the auxiliary concrete zero-time of determining each bar news.

Host's frame and Word message frame according to the time order and function order, are mixed and line up a sequence, and it is called sequence M.(3) detected quiet point sequence is called V.Utilize host and Word message two class time points among the sequence M, in conjunction with the information among the quiet point sequence V, determine the time point that news is cut apart, detailed process is based on following rule:

The rule 1 one Word message frames represent news item, the start time point of this news Word message frame appearance place or before.

Rule 2 in sequence M, if current Word message frame front adjacent be host's key frame, think that so current Word message frame and host's frame belong to same news, the host belongs to the leading report camera lens of this news.Got among the sequence V before this host's frame, and apart from the quiet point of its nearest, as the zero-time of current this news.

Rule 3 in sequence M, if current Word message frame front adjacent also be a Word message frame, these two Word message frames belong to different news item.Got among the sequence V before current Word message frame, and apart from the quiet point of its nearest, as the zero-time of current this news.Described news point sliced time adopts rule 1 and rule 2, or adopts rule 1 and rule 3.

(8) Word message identification

Literal letter in the news video has comprised abundant semantic content, is the description to corresponding news item content.These Word messages can be extracted from video, as news catalogue result's a part.

(9) related news item and Word message:

The OCR result of Word message frame not only comprises the Word message of being discerned, and also comprises the time location that the Word message frame occurs.Utilize this time tag recognition result and its described news item of Word message can be associated, obtain having the news catalogue result of text description information.

2. cataloging syytem function module design

System hardware and software environmental condition of the present invention: system of the present invention, exploitation and operation are adopted intel pentium 4 processors, Windows XP operating system on conventional microcomputer.Development language uses C++ and Java.Developing instrument uses VC6.0 and Eclipse.Database uses SQLServer2000.

News cataloging syytem structure of the present invention as shown in Figure 4, the news cataloging syytem mainly is divided into five modules: demolition module 1, news video demolition result database 2, derive module 3, browse module 4, playing module 5, correction module 6 and user 7.

The output of demolition module is connected with the input of news video demolition result database, is used to export the demolition result that audio and video characteristic merges;

News video demolition result database output is connected with the input of deriving module, receive the demolition result that audio and video characteristic merges, guide goes out the input output news video catalogue result of module and exports in the XML file outside the system, be used for these XML files are loaded into other system, make other system obtain news video catalogue result;

Browsing module, playing module and correction module is parallel between user side and the news video demolition result database;

Browse module, the numbering of the news item that the requirement of reception user appointment is browsed receives the inventory information of specifying news item in the news video demolition result database; Export the inventory information of specifying news item to the user, comprise some sliced time, headline, the news content descriptor of news item;

Playing module receives the user and specifies the news item that requires to play to number, and receives the file path and the time range of this news in the news video demolition result database; Play the picture and the sound-content of this news to the user;

Correction module receives the numbering that the user specifies the news item that requires correction, receives the existing inventory information of this news in the news video demolition result database; Show the existing inventory information of this news to the user, the inventory information of this news behind news video demolition result database output calibration.

(1) demolition module 1 is the corn module of system.From news video stream, extract voice data and video data, voice data is carried out head music coupling and quiet some detection acquisition audio frequency characteristics information, video data is carried out the detection of host's frame, the detection of title bar frame and caption text identification acquisition visual signature information.According to certain rule together, determine some sliced time of news item with the audio and video characteristic information fusion.Demolition result mainly comprises the beginning and ending time point of news item and headline information etc., and these results are stored in the news video demolition result database 2, support the service function in future.

Demolition module 1 comprises: audio, video data separative element 11 and audio and video characteristic integrated unit 12, audio, video data separative element 11 outputs and audio and video characteristic integrated unit 12 inputs are connected in series, wherein: audio, video data separative element 11 receives news video stream, is used for the news video flow point from generating voice data and video data and output; Audio and video characteristic integrated unit 12 receives voice data and video data, is used for voice data and video data are generated demolition result and output.Described audio, video data separative element 11 also comprises: voice data subelement 1a, have the happy matching part of a slice head tone, and have the happy matching part of a slice head tone, described head music matching part and head music matching part are connected in parallel; Video data subelement 1b has host's frame test section, has a title bar frame test section, has a caption text identification part, and described host's frame test section, title bar frame test section and caption text identification part are connected in parallel.

Audio, video data separative element 11 is separated into voice data and video data two parts with video flowing; The voice data that obtains is used for head music coupling and quiet point detects, and the video data of acquisition is used for the detection of host's frame, the Word message frame detects and Word message identification; The analysis-by-synthesis module merges audio/video information, obtains demolition news demolition result as a result.

(2) browse module 4 text and two kinds of browsing modes of picture are provided.By can read the fast heading message of each news item of text mode, understand the general content of news; Can browse the key frame picture of news item by the figure sheet mode, news content is had impression intuitively, just look like to be that news illustration on the newspaper is the same.

Text header browses subelement and key frame images is browsed two subblocks of subelement, is coordination between these two, with different forms the result of news catalogue is showed user 7 user respectively.

(3) playing module 5, and the video player that utilizes system to carry carries out playback to the news item of user's 7 appointments, for user 7 provides detailed news report content.

(4) correction module 6, and item text information editing and clauses and subclauses beginning and ending time point editor is provided two kinds of functions.The text message editor allows 7 pairs of clauses and subclauses titles of discerning automatically of user to revise, and can also add other relevant text message for clauses and subclauses.Beginning and ending time, the some editor allowed the zero-time and the termination time of 7 pairs of clauses and subclauses of user to make amendment, and can also delete and add clauses and subclauses, when having the clauses and subclauses time point inaccurate in the automatic demolition, can utilize manual mode to go to revise.

News item fractionation or merging subelement, news item time point information syndrome unit, three subblocks in news item text message syndrome unit, between these three subblocks is relation arranged side by side, from different perspectives the problem that may occur in the news automated cataloging process is proofreaied and correct respectively.

(5) the catalogue result derives module 3, and the news video in the news video demolition result database 2 catalogue result is exported in XML file system outside, and these XML files are loaded in the other system, can make other system obtain the result that news video is made a catalogue.

Catalogue is export function as a result, and the catalogue result of derivation is saved in the XML file of system outside.

3. system interface layout

System interface is the listed files of news-video on the left of the interface as shown in Figure 5, organizes according to the TV station's classification under the news program.The top, left side is TV station's directory tree, and the below is the news program listed files, and when choosing some TV stations node in TV station's directory tree, the news video listed files can be updated to the news program file that belongs to this TV station synchronously.Each Archive sit can launch, and shows the news program title that this document comprises.The news program node further launches, and shows the heading message of a plurality of news item that obtain behind this news program catalogue.The right side, interface is a news item key frame display floater, provides the synopsis of news item in the mode of picture, and is visual and clear.Top, middle part, interface is a video player, can play the news footage of choosing on left and right sides listed files and key frame panel, allows the user understand the detailed content of news.The player below is the panel that shows current broadcast news item information, and the user can be in this reading or modification temporal information and the semantic information relevant with news item.Result's derivation is made a catalogue and made a catalogue to news by the realization of the function menu item in the File menu.

Table 1. news catalogue experimental result

News program	Actual news item number	Detected news item number	The omission entry number	Many inspection entry number
					News 30 minutes-1	18	18	0	0
News 30 minutes-2	26	26	0	0
					News hookup-1	32	29	3	0
News hookup-2	40	40	0	0
					News when international	8	8	0	0
The Zhejiang news hookup	18	17	1	0
					Summer is looked news	9	9	0	0
The Xinjiang news hookup	17	13	4	0
					Zun Yi news hookup	8	8	0	0
Zhengzhou news	14	14	0	0
					Amount to	190	182	8	0

Claims

1. a news video categorization is characterized in that, based on the caption strips that occurs in the news program, host and audio mute dot information news video is carried out automated cataloging, and step is as follows:

Step 1: news video stream is carried out audio, video data separate, obtain voice data and video data;

Step 2: voice data is carried out head music coupling, determine news program scope effective time hereof; Voice data in the time range of news program place is carried out quiet point detect, obtain the audio mute point sequence; Video data in the time range of news program place is carried out key frame extraction, the detection of host's picture frame and literal frame detect, obtain quiet time, host's time of occurrence and Word message time of occurrence in the time range of news program place;

Step 3: audio mute point sequence, host's time of occurrence, Word message time of occurrence and rule are carried out comprehensive analysis processing, host's frame and Word message frame according to the time order and function order, are mixed and line up a mixed sequence M; Step 32: utilize host and Word message two class time points among the mixed sequence M,, obtain news item point sliced time in conjunction with the information among the quiet point sequence V; Simultaneously the Word message that occurs in the video is discerned, extracted Word message;

Described rule is rule 1, rule 2 and rule 3, described news item point sliced time adopts rule 1 and rule 2, or adopt rule 1 and rule 3, described regular 1: one Word message frame is represented news item, the start time point of this news Word message frame appearance place or before; Described regular 2: in mixed sequence M, if current Word message frame front adjacent be host's key frame, think that then current Word message frame and host's frame belong to same news, the host belongs to the leading report camera lens of this news; Got among the quiet point sequence V before this host's picture frame, and apart from the quiet point of its nearest, as the zero-time of current this news; Described regular 3: in mixed sequence M, if current Word message frame front adjacent also be a Word message frame, these two Word message frames belong to different news item; Got among the quiet point sequence V before current Word message frame, and apart from the quiet point of its nearest, as the zero-time of current this news;

Step 4: the demolition result and the Word message that identifies of news program carried out related, obtain having the news catalogue result of semantic information.

2. news video categorization according to claim 1 is characterized in that: the treatment step of described video data comprises:

Step S2B1: extract the isolated video data of audio, video data;

Step S2B2: video data is extracted key frame, be used to detect host's picture frame and Word message picture frame;

Step S2B3: the time point that host's picture frame is occurred carries out the detection based on local feature coupling and host's time distribution characteristics, is used to generate the information of the zero-time that helps definite news item;

Step S2B4: key frame set is detected, obtain the Word message picture frame, be used for generating the number of the news item that news program comprises.

3. news video categorization according to claim 1, it is characterized in that, described key frame of video extracts: be to extract key frame on the basis of video I frame, window with one 3 frame sign slides in the I frame sequence, the similarity of frame difference target area in the similarity of frame difference target area and second frame and the 3rd frame in interior first frame of difference calculation window and second frame, use respectively sim (n, n+1) and sim (n+1, n+2) expression; Calculation of similarity degree adopts histogram intersection, and the color histogram of establishing frame difference target area in interior three the I frames of window is respectively H _n(k), H _N+1(k), H _N+2(k), the formula of calculating similarity is:

sim (n, n + 1) = \frac{Σ_{k = 0}^{N - 1} \min (H_{n} (k), H_{n + 1} (k))}{Σ_{k = 0}^{N - 1} H_{n} (k)}

sim (n + 1, n + 2) = \frac{Σ_{k = 0}^{N - 1} \min (H_{n + 1} (k), H_{n + 2} (k))}{Σ_{k = 0}^{N - 1} H_{n + 1} (k)}

4. news video categorization according to claim 2 is characterized in that:

It is to carry out on the basis of the key frame set of extracting that described host's picture frame detects, and the key frame that utilizes people's face to detect extracting filters, and the key frame of selecting to comprise people's face is formed new people's face key frame set; Behind some extracted region visual signature to people's face key frame, utilize local feature point detection algorithm in the specific region of people's face key frame, to detect the local feature point; With some key frame is benchmark, mates the local feature point in other key frames, finds out many groups key frame that can match abundant local feature point host's key frame group as the candidate; Before two people's face key frames being carried out local feature point coupling, whether utilize color histogram to ask similarity to calculate these two people's face key frames may be similar, if the method by the histogram coupling is assert two key frame dissmilarities, they are not carried out local feature point coupling; Based on the time regularity of distribution of host's key frame in whole program, if the time span of one group of key frame in video is greater than certain threshold value, think that then they are candidate set of host's key frame, otherwise think that they are not host's key frames and it is given up; At last, comprehensive candidate's key frame group that only comprises a host and the key frame group that comprises two hosts judge which is host's key frame.

5. a news video cataloging syytem is characterized in that, comprising:

The demolition module comprises: the output of audio, video data separative element is connected with the input of audio and video characteristic integrated unit; The audio, video data separative element receives news video stream, is used for the news video flow point from generating voice data and video data and output; Described audio, video data separative element also comprises: the voice data subelement, have the happy matching part of a slice head tone, and have the happy matching part of a slice head tone, described head music matching part and head music matching part are connected in parallel; The video data subelement has host's frame test section, has a title bar frame test section, has a caption text identification part, and described host's frame test section, title bar frame test section and caption text identification part are connected in parallel; The audio and video characteristic integrated unit receives voice data and video data, is used for voice data and video data are generated demolition result and output; The audio and video characteristic integrated unit determines that news item sliced time, the step of point comprised as follows: step 31: host's frame and Word message frame according to the time order and function order, are mixed and line up a mixed sequence M; Step 32: utilize host and Word message two class time points among the mixed sequence M,, determine the time point that news is cut apart in conjunction with the information among the quiet point sequence V; Described news point sliced time adopts rule 1 and rule 2, or adopts rule 1 and rule 3, and its rule is: 1: one Word message frame of rule is represented news item, the start time point of this news Word message frame appearance place or before; The rule 2: in mixed sequence M, if current Word message frame front adjacent be host's key frame, think that then current Word message frame and host's frame belong to same news, the host belongs to the leading report camera lens of this news; Got among the quiet point sequence V before this host's picture frame, and apart from the quiet point of its nearest, as the zero-time of current this news; The rule 3: in mixed sequence M, if current Word message frame front adjacent also be a Word message frame, these two Word message frames belong to different news item; Got among the quiet point sequence V before current Word message frame, and apart from the quiet point of its nearest, as the zero-time of current this news;

6. according to claim 5 news video cataloging syytem, it is characterized in that: the described module of browsing comprises: text header is browsed subelement and key frame images and is browsed subelement and be connected in parallel, and is used for different forms the result of news catalogue being showed the user.

7. according to claim 5 news video cataloging syytem, it is characterized in that: described correction module module comprises: news item fractionation or merging subelement, news item time point information syndrome unit, news item text message syndrome unit are connected in parallel, and from different perspectives the problem that may occur in the news automated cataloging process are proofreaied and correct respectively.