CN103730111B

CN103730111B - Method for cutting audio and video signal segments by speaker identification

Info

Publication number: CN103730111B
Application number: CN201410001020.0A
Authority: CN
Inventors: 王惇琳; 刘继谥; 林志荣
Original assignee: Chunghwa Telecom Co Ltd
Current assignee: Chunghwa Telecom Co Ltd
Priority date: 2013-08-15
Filing date: 2014-01-02
Publication date: 2016-11-30
Anticipated expiration: 2034-01-02

Abstract

The invention relates to a method for cutting audio and video segments by speaker identification, which trains a non-specific speaker model in real time by increasing unknown speaker source audio and determines audio and video segments by using speaker identification results, comprising the following steps: (1) training a non-specific speaker model in real time; (2) determining a source audio unspecific speaker segment according to the speaker model; (3) updating speaker models based on the source audio non-specific speaker segments. The present invention provides a real-time progressive training method for speaker model, which can immediately acquire the feature audio signal of unspecified speaker, quickly learn the robust speaker audio model, solve the problem that the speaker audio signal can not be acquired in real-time training, and overcome the problem that enough training model samples can not be acquired. The invention can utilize the characteristics of real-time training to detect the speaker without specific speaker and the corresponding audio-video message segment through the speaker model of real-time training, thereby improving the practicability of speaker detection technology.

Description

The method of the cutting sound video signal fragment that the person that utilizes language identifies

Technical field

The present invention relates to a kind of cutting sound video signal technology, the cutting sound video signal fragment that a kind of person that utilizes language identifies Method.

Background technology

Video content source is the abundantest now, and content is more diversified, how from of all kinds and a large amount of Video content obtain important content rapidly and become the subject under discussion of video signal spectators' growing interest already.It is said that in general, from electricity The video content great majority of brain network are the filmstrip via artificial cutting, are easier to the need meeting user to video content Ask.So for processing a large amount of video contents, surface trimming sound Vision Technology its importance the most aobvious.

General existing surface trimming sound Vision Technology utilizes its video signal mostly, detects specific image frame and is analyzed And classifying, and then segmentation sound video signal fragment.A kind of characteristics of image and sound of news main broadcaster of detecting is to take off the method for segment TV news It is exposed in the patent of invention notification number I283375 of Taiwan, as it is shown in figure 1, comprise the following steps: to utilize the first horizontal scanning line to sweep Retouch the pixel of this image frame, it is judged that in the range of whether the color of this pixel falls within predetermined color；Utilize a plurality of continuous print shadow As the pixel being positioned in picture on this first horizontal scanning line is to produce color map；If color map denotation predetermined number In continuous image picture, all comprise stable pixel region, and this pixel all falls within the color gamut that this is predetermined, then will be at present Image paragraph be denoted as the image paragraph of candidate；And this stable pixel region execution chromatographic curve color is compared, with Detecting camera lens conversion.And the sound signal of this fragment image can be analyzed further to verify the image paragraph of this candidate.So the party Method, with COLOR COMPOSITION THROUGH DISTRIBUTION in image scanning line analysis image frame, is dependent on pixel region as film segmentation foundation, if drawing in film Frequently, its precision will the most as expected in face variation.

Utilize the existing method that audible signals cutting film is another kind of surface trimming sound video signal fragment, as United States Patent (USP) is public A kind of instant language person disclosed by announcement US7181393B2 converts detecting and the method for language person tracking, as in figure 2 it is shown, the method Include two stages: in pre-staged program (pre-segmentation process), calculate the distance of adjacent two fragments, Determining whether roughly possible language person's change point, if not then the data of this fragment being added in original speaker model, updating Speaker model；If then performing refining procedures (refinement process), adding other audio characteristics and calculating mixing probability, Reaffirm whether be language person's change point with specific probability decision-making mechanism.So the method calculates multiple audio characteristic in adjacent two The distance that sheet is intersegmental, required operand is huge, increases its degree of difficulty implemented.

Summary of the invention

The present invention is the method for the cutting sound video signal fragment identified about a kind of person that utilizes language, can be according to language by the method Person's message cutting message fragment, and by corresponding for this message fragment to sound video signal, produce sound video signal fragment.The present invention is by i.e. Shi Xunlian speaker model, person's audible signals trains language person's sound model need to collect language in advance compared to traditional language person's discrimination method Inconvenience, utilize identical with sound video signal of originating audible signals training speaker model, significantly simplify the cumbersome process of training pattern. The present invention proposes speaker model's instant progression training method, and the nonspecific language person's feature audible signals of instant acquisition, Fast Learning is strong Strong property language person's message model, person's audible signals problem that the instant training of solution cannot obtain language, overcome simultaneously and cannot obtain enough instructions Practice model sample problem.Progression training method proposed by the invention, without waiting for collecting complete language person's feature audible signals, i.e. Time cut message fragment with speaker model, solve to collect the system delay that complete language person's feature audible signals required time produces. Compared to need to train specific language person in the past, only carrying out detecting sound video signal fragment with specific speaker model, the present invention is by instant instruction Practicing speaker model, the characteristic of available instant training, for detecting the sound video signal fragment of nonspecific language person and correspondence thereof, promotes language The practicality of person's detection techniques.The present invention passes through immediately to train speaker model, removable tradition precondition speaker model's method The acoustic background environmental difference caused, promotes the accuracy of language person's identification, more can tie according to language person's message of institute's identification meanwhile Fruit cutting sound video signal fragment, overcomes the tradition sound video signal cutting method need to be in off-line mode cutting fragment and be simply possible to use in on-demand film Shortcoming, can be used for cut television channel instant sound video signal fragment.

The method of the cutting sound video signal fragment that the person that utilizes language of the present invention identifies originates message i.e. with incremental unknown language person The nonspecific speaker model of Shi Xunlian, and the result that the person that utilizes language identifies determines sound video signal fragment, its middle pitch video signal fragment can be attached most importance to The sound video signal fragment corresponding to language person appearing again existing, also can sound video signal fragment corresponding to the language person that repeats initial time Between the sound video signal scope that contained between point.The method of the cutting sound video signal fragment that the person that utilizes language of the present invention identifies, comprise but not It is limited to cut news type film.The method of the cutting sound video signal fragment that the person that utilizes language of the present invention identifies, utilizes speaker model Determining sound video signal fragment, the message model that the language person that wherein speaker model repeats in can being sound video signal fragment trains immediately is such as News main broadcaster's model.The cutting sound video signal fragment approach that the person that utilizes language of the present invention identifies is more containing the following step: (1) trains immediately Nonspecific speaker model；(2) determine, according to this speaker model, message nonspecific language person's fragment of originating；(3) according to the source non-spy of message Attribute person fragment update speaker model.Wherein the nonspecific speaker model's mode of instant training of step (1) is by the message of source Capture language person's audible signals of one section of set time length.Source message nonspecific language person's fragment length of step (2) is more than instruction Practice the message length of this speaker model, and determine that source message language person's fragment comprises the steps of calculating according to this speaker model Source message and the similarity of speaker model；Choose the similarity fragment more than marginal value.

The method of cutting sound video signal fragment that person identifies that the present invention a kind of utilizes language, is with incremental unknown language person source Message trains nonspecific speaker model immediately, and the result that the person that utilizes language identifies determines sound video signal fragment.

Wherein, sound video signal fragment is the sound video signal fragment corresponding to language person repeated, also for the language person repeated The sound video signal scope contained between the start time point of corresponding sound video signal fragment.

Wherein, sound video signal segment contents comprises news type film.

Wherein, speaker model is news main broadcaster's model.

A kind of method cutting sound video signal fragment, step is as follows:

A. the nonspecific speaker model of instant training；

B. determine, according to this speaker model, message nonspecific language person's fragment of originating；And

C. according to source message nonspecific language person fragment update speaker model.

Wherein, the instant nonspecific speaker model of training of step A is to be captured one section of set time length by the message of source Language person's audible signals.

Wherein, source message nonspecific language person's fragment length of step B is more than the message length training this speaker model.

Wherein, step B comprises the steps of

D. the similarity of source message and speaker model is calculated；And

E. the similarity fragment more than marginal value is chosen.

Wherein, the similarity calculating source message and speaker model of step D, comprise according to speaker model, calculate source Message is similar in appearance to the probit value of speaker model.

Wherein, the marginal value of step E improves numerical value with the increase of language person's audible signals quantity.

The method of the cutting sound video signal fragment that a kind of person that utilizes language identifies, also comprises the steps of

Precondition mixing model；

Wherein, step determines, according to this speaker model, message nonspecific language person's fragment of originating, and comprises the steps of

F. source message and the speaker model similarity compared to mixed model is calculated；And

G. the similarity fragment more than marginal value is chosen.

Wherein, precondition mixing model by non-sourcing message capture random time length mixing audible signals, and Read mixing audible signals and be trained for mixed model.

Wherein, the content of mixing audible signals comprises plural number name language person's audible signals, musical sound, advertisement audible signals and new Hear the audible signals interviewing picture in type film.

Wherein, the calculating source message of step F, comprises according to language person compared to the similarity of mixed model with speaker model Model and mixing model, calculate source message similar to mixed model to the similarity of speaker model and source message respectively Degree, and deduct the latter's similarity with the former similarity.

Precondition mixing model；

Update mixed model；

Wherein step determines, according to this speaker model, message nonspecific language person's fragment of originating, and comprises the steps of

H. source message and the speaker model similarity compared to mixed model is calculated；And

I. the similarity fragment more than marginal value is chosen.

Wherein, update mixed model be mixing audible signals between the start time point combining two cutting fragments the most with by The mixing audible signals captured in non-sourcing message, is trained for mixed model by mixing audible signals.

Decompose sound video signal；

Language person's audible signals is found by audio characteristic；

By corresponding for message fragment to sound video signal；And

Play sound video signal fragment.

Wherein, step decomposes sound video signal for sound video signal is divided into source message and source video signal.

Wherein, step is found, by audio characteristic, the prompt tone (cue that the audio characteristic of language person's audible signals comprises fixing appearance Tone), key word and musical sound.

Wherein, corresponding for the message fragment mode to sound video signal is by the initial time code of message fragment and knot by step Bundle timing code is respectively corresponding to sound video signal, produces sound video signal fragment.

Wherein, step plays the mode of sound video signal fragment is to play with end time code with reference to message fragment initial time code Sound video signal fragment.

Accompanying drawing explanation

Fig. 1 is prior art block diagram；

Fig. 2 is prior art flowchart；

Fig. 3 is that the unknown language person of the present invention originates the incremental message embodiment schematic diagram of message；

Fig. 4 is the method step embodiment flow chart of the cutting sound video signal fragment of the present invention；

Fig. 5 is the method rapid embodiment flow chart further of the cutting sound video signal fragment of the present invention；

Fig. 6 is the deciding means message embodiment schematic diagram of nonspecific language person's fragment of the present invention；

Fig. 7 is the device block chart of the first embodiment of the present invention；

Fig. 8 is the flow chart of the second embodiment of the present invention；

Fig. 9 is the flow chart of the third embodiment of the present invention；

Figure 10 is the flow chart of the fourth embodiment of the present invention；

Figure 11 is the flow chart of the fifth embodiment of the present invention；

Figure 12 is the Organization Chart of the sixth embodiment of the present invention.

Description of reference numerals

301～303 message schematic diagrams

401～403 steps flow charts

4021～4022 steps flow charts

601～603 message schematic diagrams

701 language person's message model training unit

702～704 language person's message fragment identification units

705～706 language person's message model modification unit

707～709 time delay devices

801～804 steps flow charts

8031～8032 steps flow charts

901～905 steps flow charts

9031～9032 steps flow charts

1001～1007 steps flow charts

1101～1106 steps flow charts

11041～11043 steps flow charts

1201 fragment editing servers

1202 timing code provisioning servers

1203 fragment information accumulation devices

1204 stream server

1205 sound video signal storage devices

Box on 1206 multimedia machines

Detailed description of the invention

In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with the accompanying drawings and embodiment, right The present invention is further elaborated:

The method of the cutting sound video signal fragment that the person that utilizes language of the present invention identifies, originates message i.e. with incremental unknown language person The nonspecific speaker model of Shi Xunlian, and the result that the person that utilizes language identifies determines sound video signal fragment.The person that utilizes language of the present invention identifies The method of cutting sound video signal fragment, unknown language person originates being incremented by as it is shown on figure 3, source message the most gradually increases of message Add, as in Fig. 3, the message length of message schematic diagram 302 is more than the message length of message schematic diagram 301, message schematic diagram 303 again Message length more than the message length of message schematic diagram 302.Check block in message schematic diagram 301 represents and carries out for the first time Language person identifies the nonspecific language person's fragment determined, and immediately trains nonspecific speaker model with this most nonspecific language person's fragment. Check block representative in message schematic diagram 302 utilizes the nonspecific speaker model of the most instant training to carry out language person and identifies institute The two nonspecific language person's fragments determined, and immediately train nonspecific speaker model with these two nonspecific language person's fragments.Message Check block in schematic diagram 303 represents and utilizes the nonspecific speaker model of the instant training of second time to carry out language person to identify and determined Three nonspecific language person's fragments, and immediately train nonspecific speaker model with these three nonspecific language person's fragments.Nonspecific language Person's fragment can originate message with unknown language person and language person identifies that the increase of number of times is gradually incremented by.The person that utilizes language of the present invention identifies The method of cutting sound video signal fragment, its middle pitch video signal fragment can sound video signal sheet corresponding to the nonspecific language person that repeats Section, the sound video signal model also can contained between the start time point of sound video signal fragment corresponding to the nonspecific language person that repeats Enclose.The method of the cutting sound video signal fragment that the person that utilizes language of the present invention identifies, including but not limited to cutting news type film.This Invention utilizes speaker model to determine sound video signal fragment, and the language person that wherein speaker model repeats in can being sound video signal fragment is instant The message model such as news main broadcaster model of training.

The method of the cutting sound video signal fragment of the present invention implements step as shown in Figure 4, comprises the nonspecific language person of instant training Model 401, according to this speaker model determine originate message nonspecific language person's fragment 402, according to source message nonspecific language person's fragment Update speaker model 403.Wherein the nonspecific speaker model 401 of instant training trains nonspecific speaker model, origin source sound immediately News capture language person's audible signals of one section of set time length, and reads this language person's audible signals person's message mould that is trained for language Type, wherein speaker model comprises gauss hybrid models (Guassian Mixture Model, referred to as GMM) and concealed Marko Husband's model (Hidden Markov Model, referred to as HMM), the audible signals of set time length can ensure that offer is enough Language person's relevant information.

Message nonspecific language person's fragment 402 of originating, wherein source message nonspecific language person's sheet is determined according to this speaker model Segment length is more than the message length of this speaker model of training, and determines, according to this speaker model, message nonspecific language person's fragment of originating 402 further include the similarity 4021 calculating source message and speaker model shown in Fig. 5 and choose similarity more than marginal value Fragment 4022.Calculate the similarity 4021 of source message and speaker model, including but not limited to according to speaker model, calculate source Message is similar in appearance to the probit value of speaker model.Choosing similarity more than the fragment 4022 of marginal value can be artificial selected numerical value, The numerical values recited of this marginal value will affect the scope and accuracy rate access time of sound video signal fragment, and marginal value is the biggest then selected Sound video signal fractional time scope is the least.

According to source message nonspecific language person fragment update speaker model 403, read language person's message of nonspecific language person's fragment Signal is also trained for speaker model.Determine source message nonspecific language person's fragment 402 according to this speaker model, according to originating, message is non- Specific language person fragment update speaker model 403 can sequentially repeat, and repeats number of times the most, and language person's audible signals quantity is more Many, and choose similarity more than marginal value the marginal value described in fragment 4022 can with language person's audible signals quantity increase improve Numerical value, meanwhile, language person's audible signals quantity is the most, and the speaker model trained gets over this language person's utterance and feature, Judge that the accuracy rate of sound video signal fragment also will promote therewith.

The method of the cutting sound video signal fragment that the person that utilizes language of the present invention identifies, the deciding means of nonspecific language person's fragment is such as Shown in Fig. 6, source message is gradually increased over time, if the message length of message schematic diagram 602 is more than message schematic diagram 601 Message length, the message length of message schematic diagram 603 is more than the message length of message schematic diagram 602 again.Message schematic diagram 601 is Perform for the first time to determine, according to this speaker model, nonspecific language person's fragment that source message nonspecific language person's fragment 402 is determined, Twill block is the message scope that similarity is more than marginal value, and choosing this message scope is nonspecific language person's fragment, and performs to depend on Source message nonspecific language person fragment update speaker model 403, the audible signals reading this nonspecific language person's fragment is trained for non- Specific speaker model.Message schematic diagram 602 performs to determine, according to this speaker model, message nonspecific language person's sheet of originating for second time Two nonspecific language person's fragments being determined of section 402, twill block is the message scope that similarity is more than marginal value, choose this two Section message scope is nonspecific language person's fragment, and performs, according to source message nonspecific language person fragment update speaker model 403, to read The audible signals of these two nonspecific language person's fragments is trained for nonspecific speaker model, and wherein marginal value can be with the most selected Marginal value different.Message schematic diagram 603 performs to determine, according to this speaker model, message nonspecific language person's sheet of originating for third time Three nonspecific language person's fragments being determined of section 402, twill block is the message scope that similarity is more than marginal value, choose this three Section message scope is nonspecific language person's fragment, and performs, according to source message nonspecific language person fragment update speaker model 403, to read The audible signals of these three nonspecific language person's fragments is trained for nonspecific speaker model, wherein marginal value can with first twice selected by Marginal value different.Along with unknown language person message increase of originating can perform to determine, according to this speaker model, the non-spy of message that originates repeatedly Attribute person's fragment 402, with according to source message nonspecific language person fragment update speaker model 403, is gradually incremented by nonspecific language person's sheet Section, trains speaker model immediately, and the result that the person that utilizes language identifies determines sound video signal fragment.

The installation drawing of the first embodiment of the present invention is as it is shown in fig. 7, the person's message model training unit 701 that comprises language is in order to hold Row instant training nonspecific speaker model 401, language person's message fragment identification unit 702～704 are in order to perform according to this language person's mould Type determines that source message nonspecific language person's fragment 402, language person's message model modification unit 705～706 are in order to perform according to source sound Interrogate nonspecific language person fragment update speaker model 403 and time delay device 707～709.Language person's message model training unit 701, Origin source audible signals captures language person's audible signals of one section of set time length, and reads this language person's audible signals and be trained for language Person's message model.Language person's message fragment identification unit 702 in order to perform according to this speaker model determine originate the nonspecific language of message Person's fragment 402, wherein source message nonspecific language person's fragment is more than the message length training this speaker model.Language person's message fragment Identification unit receipt source audible signals and elapsed time delayer and produce the source audible signals of time delay, calculate source The similarity of message and speaker model, and to choose similarity more than the fragment of marginal value be source message nonspecific language person's fragment, Wherein Similarity Measure mode is including but not limited to according to speaker model, calculates the source message probability similar in appearance to speaker model Value.Source message nonspecific language person's fragment person's message model modification unit 705 that can input language, also can simultaneously as output fragment, Language person's message fragment identification unit 703 and language person's message model modification unit 706 are also same.Language person's message model modification unit 705, Read language person's audible signals of nonspecific language person's fragment of language person's message fragment identification unit 702 output and be trained for new language person Model.This new speaker model can input language person's message fragment identification unit 703, as determining source message nonspecific language next time Reference frame during person's fragment, language person's message model modification unit 706 and language person's message fragment identification unit 704 are also same.Training The audible signals quantity that speaker model is used is the most, and the speaker model trained gets over this language person's utterance and spy Levy, it is judged that the accuracy rate of sound video signal fragment also will promote therewith.

The second of the present invention relatively executes enforcement step such as Fig. 8 of example, comprises precondition mixed model 801, immediately training non- Specific speaker model 802, according to this speaker model determine originate message nonspecific language person's fragment 803, according to source message nonspecific Language person fragment update speaker model 804.Wherein precondition mixing model 801, long by non-sourcing message captures random time The mixing audible signals of degree, and read mixing audible signals and be trained for mixed model, and the content of mixing audible signals comprises multiple Several language person's audible signals, musical sound, advertisement audible signals and news type film are interviewed the audible signals of picture.Instant instruction Practice nonspecific speaker model 802, the nonspecific speaker model of instant training, origin source message captures one section of set time length Language person's audible signals, and read this language person's audible signals person's message model that is trained for language, wherein speaker model comprises Gaussian Mixture Model and concealed markov model, the audible signals of set time length can ensure that the language person's relevant information providing enough.Depend on Determine that source message nonspecific language person's fragment 803 comprises calculating source message with speaker model compared to mixing according to this speaker model The similarity 8031 of model and choose the similarity fragment 8032 more than marginal value.Calculate source message and speaker model compared to Similarity 8031 mode of mixed model, including but not limited to according to speaker model and mixing model, calculates source message respectively With the similarity of the similarity of speaker model and source message with mixed model, and deduct the latter's similarity with the former similarity, Calculation such as (1st) formula:

S (i)=S_a(i)-S_m(i)......(1)

Wherein S (i) wherein represents source message i-th time point and the speaker model similarity compared to mixed model, S_aI () represents the similarity of source message i-th time point and speaker model, S_m(i) represent source message i-th time point with The similarity of mixed model.Source message comprises the source message logarithm machine similar in appearance to speaker model with the similarity of speaker model Rate value, the similarity of source message and mixed model comprises source message similar in appearance to the logarithm probit value of mixed model, therefore comes Also can represent such as (2nd) formula if source message and speaker model calculate in probit value mode compared to the similarity of mixed model:

S(i)=exp(logP_a(i)-logP_m(i))......(2)

Wherein P_aI () represents the source message i-th time point probit value similar in appearance to speaker model, P_mI () represents source sound News i-th time point is similar in appearance to the probit value of mixed model.Choose the similarity fragment 8032 more than marginal value, can be artificial choosing Fixed numerical value, the numerical values recited of this marginal value will affect the scope and accuracy rate access time of sound video signal fragment, and marginal value is the biggest Then selected sound video signal fractional time scope is the least.Message nonspecific language person's fragment 804 of originating is determined according to this speaker model, It is to read language person's audible signals of nonspecific language person's fragment and be trained for speaker model.Message of originating is determined according to this speaker model Nonspecific language person's fragment 803, according to source message nonspecific language person fragment update speaker model 804 can sequentially repeat, repeat Perform number of times the most, language person's audible signals quantity is the most, and choose similarity more than marginal value fragment 8032 described in critical Value can improve numerical value with the increase of language person's audible signals quantity, and meanwhile, language person's audible signals quantity is the most, the language person's mould trained Type gets over this language person's utterance and feature, it is judged that the accuracy rate of sound video signal fragment also will promote therewith.

The enforcement step of the third embodiment of the present invention can refer to Fig. 9, comprises precondition mixed model 901, immediately instructs Practice nonspecific speaker model 902, determine, according to this speaker model, originate message nonspecific language person's fragment 903, renewal mixed model 904, according to source message nonspecific language person fragment update speaker model 905.Wherein precondition mixed model 901, immediately train Nonspecific speaker model 902, according to this speaker model determine originate message nonspecific language person's fragment 903 explanation can refer to Fig. 8 Precondition mixed model 801, the instant nonspecific speaker model 802 of training, the message that determines to originate according to this speaker model non- Specific language person's fragment 803.Update mixed model 904, in conjunction with the mixing message news between the start time point of two cutting fragments the most Number mixing audible signals captured with precondition mixed model 901, is trained for mixed model by mixing audible signals, and should The content of mixing audible signals comprises in plural number name language person's audible signals, musical sound, advertisement audible signals and news type film The audible signals of interview picture.Depending on of Fig. 8 is can refer to according to the explanation of source message nonspecific language person fragment update speaker model 905 Source message nonspecific language person fragment update speaker model 804.

The enforcement step of the fourth embodiment of the present invention can refer to Figure 10, comprises decomposition audio-video signal 1001, by message Feature is found language person's audible signals 1002, the instant nonspecific speaker model 1003 of training, is determined, according to this speaker model, sound of originating Interrogate nonspecific language person's fragment 1004, according to source message nonspecific language person fragment update speaker model 1005, by message fragment correspondence To audio-video signal 1006, play sound video signal fragment 1007.Wherein decompose audio-video signal 1001, audio-video signal is divided into Source message and source video signal, source message only comprises the signal of sound, voice, and source video signal the most only comprises image signal.By Audio characteristic finds language person's audible signals 1002, finds language by the audio characteristic occurred fixing in most of audio-video signals Time point position, person's audible signals place, and audio characteristic comprises prompt tone, key word and the musical sound of fixing appearance.Instant instruction Practice nonspecific speaker model 1003, determine source message nonspecific language person's fragment 1004 according to this speaker model, according to source message The explanation of nonspecific language person fragment update speaker model 1005 can join the instant nonspecific speaker model 401 of training of Fig. 4, according to being somebody's turn to do Speaker model determines source message nonspecific language person's fragment 402, according to source message nonspecific language person fragment update speaker model 403.By corresponding for message fragment to audio-video signal 1006, by the most corresponding with end time code division for the initial time code of message fragment To audio-video signal, its correspondence to audio-video signal can be absolute time described in audio-video signal, or is with audio frequency and video The initial time of signal is the relative time of Fixed Initial Point, produces sound video signal fragment.Play sound video signal fragment 1007, for playing sound News fragment is corresponding to the sound video signal segment contents corresponding to audio-video signal 1006.

Enforcement step such as Figure 11 of the fifth embodiment of the present invention, comprise decomposition audio-video signal 1101, precondition mix Matched moulds type 1102, by audio characteristic find language person's audible signals 1103, determine with obtain all sources message nonspecific language person's sheet Section 1104, by corresponding for message fragment to audio-video signal 1105, play sound video signal fragment 1106.Wherein decompose audio-video signal 1101, audio-video signal is divided into source message and source video signal, source message only comprises the signal of sound, voice, source Video signal the most only comprises image signal.Precondition mixing model 1102, by capturing the mixed of random time length in non-sourcing message Close audible signals, and read mixing audible signals and be trained for mixed model, and the content of mixing audible signals comprises plural number name language Person's audible signals, musical sound, advertisement audible signals and news type film are interviewed the audible signals of picture.Sought by audio characteristic Person's audible signals 1103 of looking for language, finds language person's audible signals by the audio characteristic occurred fixing in most of audio-video signals Time point position, place, and audio characteristic comprises prompt tone, key word and the musical sound of fixing appearance.Determine all with acquirement next Source message nonspecific language person's fragment 1104 comprises the instant nonspecific speaker model 11041 of training, determines to come according to this speaker model Source message nonspecific language person's fragment 11042 with according to source message nonspecific language person fragment update speaker model 11043, the most immediately Train nonspecific speaker model 11041, determine to originate message nonspecific language person's fragment 11042 and depend on source according to this speaker model The explanation of message nonspecific language person fragment update speaker model 11043 can refer to the instant of Fig. 8 and trains nonspecific speaker model 802, determine to originate message nonspecific language person's fragment 803 according to this speaker model and depend on source message nonspecific language person's fragment update Speaker model 804.Corresponding for the message fragment explanation to audio-video signal 1105, broadcasting sound video signal fragment 1106 be can refer to Figure 10 By corresponding for message fragment to audio-video signal 1006, play sound video signal fragment 1007.

The system architecture of the sixth embodiment of the present invention can refer to Figure 12, this system comprise fragment editing server 1201, Timing code provisioning server 1202, fragment information accumulation device 1203, stream server 1204, sound video signal storage device 1205. Fragment editing server 1201 decompose audio-video signal with capture source audible signals, determine with obtain all sources non-spy of message Attribute person's fragment, and store all fragment initial time codes with end time code in fragment information accumulation device 1203, wherein determine Fixed and acquirement all sources message nonspecific language person's fragment performs immediately to train nonspecific speaker model 401, according to this language person's mould Type determines source message nonspecific language person's fragment 402, according to source message nonspecific language person fragment update speaker model 403.Time Code provisioning server 1202, according to selected sound video signal fragment, is searched this fragment to fragment information accumulation device 1203 and takes Obtain this fragment initial time code and end time code.On multimedia machine, box 1206 is via computer network and timing code provisioning server 1202 set up online, and to timing code provisioning server 1202 send play sound video signal fragment requirement, timing code supply service After device 1202 obtains this fragment initial time code and end time code, carry out the transmission of sound video signal fragment.Sound video signal fragment transmission One of mode notifies stream server 1204 fragment initial time code and end time code for timing code provisioning server 1202, to On multimedia machine, box 1206 transmits the sound video signal fragment being stored in sound video signal storage device 1205, and by multimedia set top box 1206 Play after receiving sound video signal fragment；Another sound video signal fragment load mode is that timing code provisioning server 1202 is to multimedia machine Upper box 1206 transmits fragment initial time code and end time code, and on multimedia machine, box 1206 requires to pass to stream server 1204 Sending the sound video signal fragment being stored in sound video signal storage device 1205, multimedia set top box 1206 is broadcast after receiving sound video signal fragment Put.

These are only presently preferred embodiments of the present invention, be not used for limiting the practical range of the present invention；If without departing from this The spirit and scope of invention, modify to the present invention or equivalent, all should contain in scope of the present invention patent In the middle of protection domain.

Claims

1. the method for the cutting sound video signal fragment that the person that utilizes language identifies, it is characterised in that be to come with incremental unknown language person Source message trains nonspecific speaker model immediately, and the result that the person that utilizes language identifies determines sound video signal fragment, and step is as follows:

A. the nonspecific speaker model of instant training；The instant nonspecific speaker model of training is fixing by capturing one section in the message of source Language person's audible signals of time span；

The method of the cutting sound video signal fragment that the person that utilizes language the most according to claim 1 identifies, it is characterised in that sound video signal Fragment is the sound video signal fragment corresponding to language person repeated, also sound video signal fragment corresponding to the language person that repeats The sound video signal scope contained between start time point.

The method of the cutting sound video signal fragment that the person that utilizes language the most according to claim 1 identifies, it is characterised in that sound video signal Segment contents comprises news type film.

The method of the cutting sound video signal fragment that the person that utilizes language the most according to claim 1 identifies, it is characterised in that language person's mould Type is news main broadcaster's model.

The method of the cutting sound video signal fragment that the person that utilizes language the most according to claim 1 identifies, it is characterised in that step B Source message nonspecific language person's fragment length more than train this speaker model message length.

The method of the cutting sound video signal fragment that the person that utilizes language the most according to claim 1 identifies, it is characterised in that step B Comprise the steps of

D. the similarity of source message and speaker model is calculated；And

E. the similarity fragment more than marginal value is chosen；Marginal value improves numerical value with the increase of language person's audible signals quantity.

The method of the cutting sound video signal fragment that the person that utilizes language the most according to claim 6 identifies, it is characterised in that step D The similarity calculating source message and speaker model, comprise according to speaker model, calculate source message similar in appearance to speaker model Probit value.

The method of the cutting sound video signal fragment that the person that utilizes language the most according to claim 1 identifies, it is characterised in that also comprise The following step:

Precondition mixing model；

G. the similarity fragment more than marginal value is chosen.

The method of the cutting sound video signal fragment that the person that utilizes language the most according to claim 8 identifies, it is characterised in that instruct in advance Practicing mixing model is by the mixing audible signals capturing random time length in non-sourcing message, and reads mixing audible signals instruction Practice for mixed model.

The method of the cutting sound video signal fragment that the person that utilizes language the most according to claim 9 identifies, it is characterised in that mixing The content of audible signals comprises in plural number name language person's audible signals, musical sound, advertisement audible signals and news type film interviews The audible signals of picture.

The method of the cutting sound video signal fragment that 11. persons that utilize language according to claim 8 identify, it is characterised in that step F Calculate source message and the speaker model similarity compared to mixed model, comprise according to speaker model and mixing model, point Message of Ji Suan not originating and the similarity of speaker model and the message similarity with mixed model of originating, and subtract with the former similarity Go the latter's similarity.

The method of the cutting sound video signal fragment that 12. persons that utilize language according to claim 1 identify, it is characterised in that also wrap Containing the following step:

Precondition mixing model；

Update mixed model；

I. the similarity fragment more than marginal value is chosen.

The method of the cutting sound video signal fragment that 13. persons that utilize language according to claim 12 identify, it is characterised in that update Mixed model is that the mixing audible signals between the start time point combining two cutting fragments the most captures with by non-sourcing message Mixing audible signals, mixing audible signals is trained for mixed model.

The method of the cutting sound video signal fragment that 14. persons that utilize language according to claim 1 identify, it is characterised in that also wrap Containing the following step:

Decompose sound video signal；

Language person's audible signals is found by audio characteristic；

By corresponding for message fragment to sound video signal；And

Play sound video signal fragment.

The method of the cutting sound video signal fragment that 15. persons that utilize language according to claim 14 identify, it is characterised in that step Decompose sound video signal for sound video signal being divided into source message and source video signal.

The method of the cutting sound video signal fragment that 16. persons that utilize language according to claim 14 identify, it is characterised in that step Prompt tone, key word and the musical sound that the audio characteristic of language person's audible signals comprises fixing appearance is found by audio characteristic.

The method of the cutting sound video signal fragment that 17. persons that utilize language according to claim 14 identify, it is characterised in that step It is by the most corresponding with end time code division for the initial time code of message fragment by corresponding for the message fragment mode to sound video signal To sound video signal, produce sound video signal fragment.

The method of the cutting sound video signal fragment that 18. persons that utilize language according to claim 14 identify, it is characterised in that step The mode playing sound video signal fragment is to play sound video signal fragment with reference to message fragment initial time code with end time code.