CN103730111B - Method for cutting audio and video signal segments by speaker identification - Google Patents
Method for cutting audio and video signal segments by speaker identification Download PDFInfo
- Publication number
- CN103730111B CN103730111B CN201410001020.0A CN201410001020A CN103730111B CN 103730111 B CN103730111 B CN 103730111B CN 201410001020 A CN201410001020 A CN 201410001020A CN 103730111 B CN103730111 B CN 103730111B
- Authority
- CN
- China
- Prior art keywords
- fragment
- video signal
- language
- message
- person
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 238000012549 training Methods 0.000 claims abstract description 36
- 239000012634 fragment Substances 0.000 claims description 238
- 238000012358 sourcing Methods 0.000 claims description 6
- 238000005516 engineering process Methods 0.000 abstract description 5
- 230000005236 sound signal Effects 0.000 abstract description 3
- 238000001514 detection method Methods 0.000 abstract description 2
- 230000000750 progressive effect Effects 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 19
- 230000004048 modification Effects 0.000 description 6
- 238000012986 modification Methods 0.000 description 6
- 238000009825 accumulation Methods 0.000 description 4
- 239000000203 mixture Substances 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 238000009966 trimming Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000000354 decomposition reaction Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000012850 discrimination method Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 238000011524 similarity measure Methods 0.000 description 1
Abstract
The invention relates to a method for cutting audio and video segments by speaker identification, which trains a non-specific speaker model in real time by increasing unknown speaker source audio and determines audio and video segments by using speaker identification results, comprising the following steps: (1) training a non-specific speaker model in real time; (2) determining a source audio unspecific speaker segment according to the speaker model; (3) updating speaker models based on the source audio non-specific speaker segments. The present invention provides a real-time progressive training method for speaker model, which can immediately acquire the feature audio signal of unspecified speaker, quickly learn the robust speaker audio model, solve the problem that the speaker audio signal can not be acquired in real-time training, and overcome the problem that enough training model samples can not be acquired. The invention can utilize the characteristics of real-time training to detect the speaker without specific speaker and the corresponding audio-video message segment through the speaker model of real-time training, thereby improving the practicability of speaker detection technology.
Description
Technical field
The present invention relates to a kind of cutting sound video signal technology, the cutting sound video signal fragment that a kind of person that utilizes language identifies
Method.
Background technology
Video content source is the abundantest now, and content is more diversified, how from of all kinds and a large amount of
Video content obtain important content rapidly and become the subject under discussion of video signal spectators' growing interest already.It is said that in general, from electricity
The video content great majority of brain network are the filmstrip via artificial cutting, are easier to the need meeting user to video content
Ask.So for processing a large amount of video contents, surface trimming sound Vision Technology its importance the most aobvious.
General existing surface trimming sound Vision Technology utilizes its video signal mostly, detects specific image frame and is analyzed
And classifying, and then segmentation sound video signal fragment.A kind of characteristics of image and sound of news main broadcaster of detecting is to take off the method for segment TV news
It is exposed in the patent of invention notification number I283375 of Taiwan, as it is shown in figure 1, comprise the following steps: to utilize the first horizontal scanning line to sweep
Retouch the pixel of this image frame, it is judged that in the range of whether the color of this pixel falls within predetermined color;Utilize a plurality of continuous print shadow
As the pixel being positioned in picture on this first horizontal scanning line is to produce color map;If color map denotation predetermined number
In continuous image picture, all comprise stable pixel region, and this pixel all falls within the color gamut that this is predetermined, then will be at present
Image paragraph be denoted as the image paragraph of candidate;And this stable pixel region execution chromatographic curve color is compared, with
Detecting camera lens conversion.And the sound signal of this fragment image can be analyzed further to verify the image paragraph of this candidate.So the party
Method, with COLOR COMPOSITION THROUGH DISTRIBUTION in image scanning line analysis image frame, is dependent on pixel region as film segmentation foundation, if drawing in film
Frequently, its precision will the most as expected in face variation.
Utilize the existing method that audible signals cutting film is another kind of surface trimming sound video signal fragment, as United States Patent (USP) is public
A kind of instant language person disclosed by announcement US7181393B2 converts detecting and the method for language person tracking, as in figure 2 it is shown, the method
Include two stages: in pre-staged program (pre-segmentation process), calculate the distance of adjacent two fragments,
Determining whether roughly possible language person's change point, if not then the data of this fragment being added in original speaker model, updating
Speaker model;If then performing refining procedures (refinement process), adding other audio characteristics and calculating mixing probability,
Reaffirm whether be language person's change point with specific probability decision-making mechanism.So the method calculates multiple audio characteristic in adjacent two
The distance that sheet is intersegmental, required operand is huge, increases its degree of difficulty implemented.
Summary of the invention
The present invention is the method for the cutting sound video signal fragment identified about a kind of person that utilizes language, can be according to language by the method
Person's message cutting message fragment, and by corresponding for this message fragment to sound video signal, produce sound video signal fragment.The present invention is by i.e.
Shi Xunlian speaker model, person's audible signals trains language person's sound model need to collect language in advance compared to traditional language person's discrimination method
Inconvenience, utilize identical with sound video signal of originating audible signals training speaker model, significantly simplify the cumbersome process of training pattern.
The present invention proposes speaker model's instant progression training method, and the nonspecific language person's feature audible signals of instant acquisition, Fast Learning is strong
Strong property language person's message model, person's audible signals problem that the instant training of solution cannot obtain language, overcome simultaneously and cannot obtain enough instructions
Practice model sample problem.Progression training method proposed by the invention, without waiting for collecting complete language person's feature audible signals, i.e.
Time cut message fragment with speaker model, solve to collect the system delay that complete language person's feature audible signals required time produces.
Compared to need to train specific language person in the past, only carrying out detecting sound video signal fragment with specific speaker model, the present invention is by instant instruction
Practicing speaker model, the characteristic of available instant training, for detecting the sound video signal fragment of nonspecific language person and correspondence thereof, promotes language
The practicality of person's detection techniques.The present invention passes through immediately to train speaker model, removable tradition precondition speaker model's method
The acoustic background environmental difference caused, promotes the accuracy of language person's identification, more can tie according to language person's message of institute's identification meanwhile
Fruit cutting sound video signal fragment, overcomes the tradition sound video signal cutting method need to be in off-line mode cutting fragment and be simply possible to use in on-demand film
Shortcoming, can be used for cut television channel instant sound video signal fragment.
The method of the cutting sound video signal fragment that the person that utilizes language of the present invention identifies originates message i.e. with incremental unknown language person
The nonspecific speaker model of Shi Xunlian, and the result that the person that utilizes language identifies determines sound video signal fragment, its middle pitch video signal fragment can be attached most importance to
The sound video signal fragment corresponding to language person appearing again existing, also can sound video signal fragment corresponding to the language person that repeats initial time
Between the sound video signal scope that contained between point.The method of the cutting sound video signal fragment that the person that utilizes language of the present invention identifies, comprise but not
It is limited to cut news type film.The method of the cutting sound video signal fragment that the person that utilizes language of the present invention identifies, utilizes speaker model
Determining sound video signal fragment, the message model that the language person that wherein speaker model repeats in can being sound video signal fragment trains immediately is such as
News main broadcaster's model.The cutting sound video signal fragment approach that the person that utilizes language of the present invention identifies is more containing the following step: (1) trains immediately
Nonspecific speaker model;(2) determine, according to this speaker model, message nonspecific language person's fragment of originating;(3) according to the source non-spy of message
Attribute person fragment update speaker model.Wherein the nonspecific speaker model's mode of instant training of step (1) is by the message of source
Capture language person's audible signals of one section of set time length.Source message nonspecific language person's fragment length of step (2) is more than instruction
Practice the message length of this speaker model, and determine that source message language person's fragment comprises the steps of calculating according to this speaker model
Source message and the similarity of speaker model;Choose the similarity fragment more than marginal value.
The method of cutting sound video signal fragment that person identifies that the present invention a kind of utilizes language, is with incremental unknown language person source
Message trains nonspecific speaker model immediately, and the result that the person that utilizes language identifies determines sound video signal fragment.
Wherein, sound video signal fragment is the sound video signal fragment corresponding to language person repeated, also for the language person repeated
The sound video signal scope contained between the start time point of corresponding sound video signal fragment.
Wherein, sound video signal segment contents comprises news type film.
Wherein, speaker model is news main broadcaster's model.
A kind of method cutting sound video signal fragment, step is as follows:
A. the nonspecific speaker model of instant training;
B. determine, according to this speaker model, message nonspecific language person's fragment of originating;And
C. according to source message nonspecific language person fragment update speaker model.
Wherein, the instant nonspecific speaker model of training of step A is to be captured one section of set time length by the message of source
Language person's audible signals.
Wherein, source message nonspecific language person's fragment length of step B is more than the message length training this speaker model.
Wherein, step B comprises the steps of
D. the similarity of source message and speaker model is calculated;And
E. the similarity fragment more than marginal value is chosen.
Wherein, the similarity calculating source message and speaker model of step D, comprise according to speaker model, calculate source
Message is similar in appearance to the probit value of speaker model.
Wherein, the marginal value of step E improves numerical value with the increase of language person's audible signals quantity.
The method of the cutting sound video signal fragment that a kind of person that utilizes language identifies, also comprises the steps of
Precondition mixing model;
Wherein, step determines, according to this speaker model, message nonspecific language person's fragment of originating, and comprises the steps of
F. source message and the speaker model similarity compared to mixed model is calculated;And
G. the similarity fragment more than marginal value is chosen.
Wherein, precondition mixing model by non-sourcing message capture random time length mixing audible signals, and
Read mixing audible signals and be trained for mixed model.
Wherein, the content of mixing audible signals comprises plural number name language person's audible signals, musical sound, advertisement audible signals and new
Hear the audible signals interviewing picture in type film.
Wherein, the calculating source message of step F, comprises according to language person compared to the similarity of mixed model with speaker model
Model and mixing model, calculate source message similar to mixed model to the similarity of speaker model and source message respectively
Degree, and deduct the latter's similarity with the former similarity.
The method of the cutting sound video signal fragment that a kind of person that utilizes language identifies, also comprises the steps of
Precondition mixing model;
Update mixed model;
Wherein step determines, according to this speaker model, message nonspecific language person's fragment of originating, and comprises the steps of
H. source message and the speaker model similarity compared to mixed model is calculated;And
I. the similarity fragment more than marginal value is chosen.
Wherein, update mixed model be mixing audible signals between the start time point combining two cutting fragments the most with by
The mixing audible signals captured in non-sourcing message, is trained for mixed model by mixing audible signals.
The method of the cutting sound video signal fragment that a kind of person that utilizes language identifies, also comprises the steps of
Decompose sound video signal;
Language person's audible signals is found by audio characteristic;
By corresponding for message fragment to sound video signal;And
Play sound video signal fragment.
Wherein, step decomposes sound video signal for sound video signal is divided into source message and source video signal.
Wherein, step is found, by audio characteristic, the prompt tone (cue that the audio characteristic of language person's audible signals comprises fixing appearance
Tone), key word and musical sound.
Wherein, corresponding for the message fragment mode to sound video signal is by the initial time code of message fragment and knot by step
Bundle timing code is respectively corresponding to sound video signal, produces sound video signal fragment.
Wherein, step plays the mode of sound video signal fragment is to play with end time code with reference to message fragment initial time code
Sound video signal fragment.
Accompanying drawing explanation
Fig. 1 is prior art block diagram;
Fig. 2 is prior art flowchart;
Fig. 3 is that the unknown language person of the present invention originates the incremental message embodiment schematic diagram of message;
Fig. 4 is the method step embodiment flow chart of the cutting sound video signal fragment of the present invention;
Fig. 5 is the method rapid embodiment flow chart further of the cutting sound video signal fragment of the present invention;
Fig. 6 is the deciding means message embodiment schematic diagram of nonspecific language person's fragment of the present invention;
Fig. 7 is the device block chart of the first embodiment of the present invention;
Fig. 8 is the flow chart of the second embodiment of the present invention;
Fig. 9 is the flow chart of the third embodiment of the present invention;
Figure 10 is the flow chart of the fourth embodiment of the present invention;
Figure 11 is the flow chart of the fifth embodiment of the present invention;
Figure 12 is the Organization Chart of the sixth embodiment of the present invention.
Description of reference numerals
301~303 message schematic diagrams
401~403 steps flow charts
4021~4022 steps flow charts
601~603 message schematic diagrams
701 language person's message model training unit
702~704 language person's message fragment identification units
705~706 language person's message model modification unit
707~709 time delay devices
801~804 steps flow charts
8031~8032 steps flow charts
901~905 steps flow charts
9031~9032 steps flow charts
1001~1007 steps flow charts
1101~1106 steps flow charts
11041~11043 steps flow charts
1201 fragment editing servers
1202 timing code provisioning servers
1203 fragment information accumulation devices
1204 stream server
1205 sound video signal storage devices
Box on 1206 multimedia machines
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with the accompanying drawings and embodiment, right
The present invention is further elaborated:
The method of the cutting sound video signal fragment that the person that utilizes language of the present invention identifies, originates message i.e. with incremental unknown language person
The nonspecific speaker model of Shi Xunlian, and the result that the person that utilizes language identifies determines sound video signal fragment.The person that utilizes language of the present invention identifies
The method of cutting sound video signal fragment, unknown language person originates being incremented by as it is shown on figure 3, source message the most gradually increases of message
Add, as in Fig. 3, the message length of message schematic diagram 302 is more than the message length of message schematic diagram 301, message schematic diagram 303 again
Message length more than the message length of message schematic diagram 302.Check block in message schematic diagram 301 represents and carries out for the first time
Language person identifies the nonspecific language person's fragment determined, and immediately trains nonspecific speaker model with this most nonspecific language person's fragment.
Check block representative in message schematic diagram 302 utilizes the nonspecific speaker model of the most instant training to carry out language person and identifies institute
The two nonspecific language person's fragments determined, and immediately train nonspecific speaker model with these two nonspecific language person's fragments.Message
Check block in schematic diagram 303 represents and utilizes the nonspecific speaker model of the instant training of second time to carry out language person to identify and determined
Three nonspecific language person's fragments, and immediately train nonspecific speaker model with these three nonspecific language person's fragments.Nonspecific language
Person's fragment can originate message with unknown language person and language person identifies that the increase of number of times is gradually incremented by.The person that utilizes language of the present invention identifies
The method of cutting sound video signal fragment, its middle pitch video signal fragment can sound video signal sheet corresponding to the nonspecific language person that repeats
Section, the sound video signal model also can contained between the start time point of sound video signal fragment corresponding to the nonspecific language person that repeats
Enclose.The method of the cutting sound video signal fragment that the person that utilizes language of the present invention identifies, including but not limited to cutting news type film.This
Invention utilizes speaker model to determine sound video signal fragment, and the language person that wherein speaker model repeats in can being sound video signal fragment is instant
The message model such as news main broadcaster model of training.
The method of the cutting sound video signal fragment of the present invention implements step as shown in Figure 4, comprises the nonspecific language person of instant training
Model 401, according to this speaker model determine originate message nonspecific language person's fragment 402, according to source message nonspecific language person's fragment
Update speaker model 403.Wherein the nonspecific speaker model 401 of instant training trains nonspecific speaker model, origin source sound immediately
News capture language person's audible signals of one section of set time length, and reads this language person's audible signals person's message mould that is trained for language
Type, wherein speaker model comprises gauss hybrid models (Guassian Mixture Model, referred to as GMM) and concealed Marko
Husband's model (Hidden Markov Model, referred to as HMM), the audible signals of set time length can ensure that offer is enough
Language person's relevant information.
Message nonspecific language person's fragment 402 of originating, wherein source message nonspecific language person's sheet is determined according to this speaker model
Segment length is more than the message length of this speaker model of training, and determines, according to this speaker model, message nonspecific language person's fragment of originating
402 further include the similarity 4021 calculating source message and speaker model shown in Fig. 5 and choose similarity more than marginal value
Fragment 4022.Calculate the similarity 4021 of source message and speaker model, including but not limited to according to speaker model, calculate source
Message is similar in appearance to the probit value of speaker model.Choosing similarity more than the fragment 4022 of marginal value can be artificial selected numerical value,
The numerical values recited of this marginal value will affect the scope and accuracy rate access time of sound video signal fragment, and marginal value is the biggest then selected
Sound video signal fractional time scope is the least.
According to source message nonspecific language person fragment update speaker model 403, read language person's message of nonspecific language person's fragment
Signal is also trained for speaker model.Determine source message nonspecific language person's fragment 402 according to this speaker model, according to originating, message is non-
Specific language person fragment update speaker model 403 can sequentially repeat, and repeats number of times the most, and language person's audible signals quantity is more
Many, and choose similarity more than marginal value the marginal value described in fragment 4022 can with language person's audible signals quantity increase improve
Numerical value, meanwhile, language person's audible signals quantity is the most, and the speaker model trained gets over this language person's utterance and feature,
Judge that the accuracy rate of sound video signal fragment also will promote therewith.
The method of the cutting sound video signal fragment that the person that utilizes language of the present invention identifies, the deciding means of nonspecific language person's fragment is such as
Shown in Fig. 6, source message is gradually increased over time, if the message length of message schematic diagram 602 is more than message schematic diagram 601
Message length, the message length of message schematic diagram 603 is more than the message length of message schematic diagram 602 again.Message schematic diagram 601 is
Perform for the first time to determine, according to this speaker model, nonspecific language person's fragment that source message nonspecific language person's fragment 402 is determined,
Twill block is the message scope that similarity is more than marginal value, and choosing this message scope is nonspecific language person's fragment, and performs to depend on
Source message nonspecific language person fragment update speaker model 403, the audible signals reading this nonspecific language person's fragment is trained for non-
Specific speaker model.Message schematic diagram 602 performs to determine, according to this speaker model, message nonspecific language person's sheet of originating for second time
Two nonspecific language person's fragments being determined of section 402, twill block is the message scope that similarity is more than marginal value, choose this two
Section message scope is nonspecific language person's fragment, and performs, according to source message nonspecific language person fragment update speaker model 403, to read
The audible signals of these two nonspecific language person's fragments is trained for nonspecific speaker model, and wherein marginal value can be with the most selected
Marginal value different.Message schematic diagram 603 performs to determine, according to this speaker model, message nonspecific language person's sheet of originating for third time
Three nonspecific language person's fragments being determined of section 402, twill block is the message scope that similarity is more than marginal value, choose this three
Section message scope is nonspecific language person's fragment, and performs, according to source message nonspecific language person fragment update speaker model 403, to read
The audible signals of these three nonspecific language person's fragments is trained for nonspecific speaker model, wherein marginal value can with first twice selected by
Marginal value different.Along with unknown language person message increase of originating can perform to determine, according to this speaker model, the non-spy of message that originates repeatedly
Attribute person's fragment 402, with according to source message nonspecific language person fragment update speaker model 403, is gradually incremented by nonspecific language person's sheet
Section, trains speaker model immediately, and the result that the person that utilizes language identifies determines sound video signal fragment.
The installation drawing of the first embodiment of the present invention is as it is shown in fig. 7, the person's message model training unit 701 that comprises language is in order to hold
Row instant training nonspecific speaker model 401, language person's message fragment identification unit 702~704 are in order to perform according to this language person's mould
Type determines that source message nonspecific language person's fragment 402, language person's message model modification unit 705~706 are in order to perform according to source sound
Interrogate nonspecific language person fragment update speaker model 403 and time delay device 707~709.Language person's message model training unit 701,
Origin source audible signals captures language person's audible signals of one section of set time length, and reads this language person's audible signals and be trained for language
Person's message model.Language person's message fragment identification unit 702 in order to perform according to this speaker model determine originate the nonspecific language of message
Person's fragment 402, wherein source message nonspecific language person's fragment is more than the message length training this speaker model.Language person's message fragment
Identification unit receipt source audible signals and elapsed time delayer and produce the source audible signals of time delay, calculate source
The similarity of message and speaker model, and to choose similarity more than the fragment of marginal value be source message nonspecific language person's fragment,
Wherein Similarity Measure mode is including but not limited to according to speaker model, calculates the source message probability similar in appearance to speaker model
Value.Source message nonspecific language person's fragment person's message model modification unit 705 that can input language, also can simultaneously as output fragment,
Language person's message fragment identification unit 703 and language person's message model modification unit 706 are also same.Language person's message model modification unit 705,
Read language person's audible signals of nonspecific language person's fragment of language person's message fragment identification unit 702 output and be trained for new language person
Model.This new speaker model can input language person's message fragment identification unit 703, as determining source message nonspecific language next time
Reference frame during person's fragment, language person's message model modification unit 706 and language person's message fragment identification unit 704 are also same.Training
The audible signals quantity that speaker model is used is the most, and the speaker model trained gets over this language person's utterance and spy
Levy, it is judged that the accuracy rate of sound video signal fragment also will promote therewith.
The second of the present invention relatively executes enforcement step such as Fig. 8 of example, comprises precondition mixed model 801, immediately training non-
Specific speaker model 802, according to this speaker model determine originate message nonspecific language person's fragment 803, according to source message nonspecific
Language person fragment update speaker model 804.Wherein precondition mixing model 801, long by non-sourcing message captures random time
The mixing audible signals of degree, and read mixing audible signals and be trained for mixed model, and the content of mixing audible signals comprises multiple
Several language person's audible signals, musical sound, advertisement audible signals and news type film are interviewed the audible signals of picture.Instant instruction
Practice nonspecific speaker model 802, the nonspecific speaker model of instant training, origin source message captures one section of set time length
Language person's audible signals, and read this language person's audible signals person's message model that is trained for language, wherein speaker model comprises Gaussian Mixture
Model and concealed markov model, the audible signals of set time length can ensure that the language person's relevant information providing enough.Depend on
Determine that source message nonspecific language person's fragment 803 comprises calculating source message with speaker model compared to mixing according to this speaker model
The similarity 8031 of model and choose the similarity fragment 8032 more than marginal value.Calculate source message and speaker model compared to
Similarity 8031 mode of mixed model, including but not limited to according to speaker model and mixing model, calculates source message respectively
With the similarity of the similarity of speaker model and source message with mixed model, and deduct the latter's similarity with the former similarity,
Calculation such as (1st) formula:
S (i)=Sa(i)-Sm(i)......(1)
Wherein S (i) wherein represents source message i-th time point and the speaker model similarity compared to mixed model,
SaI () represents the similarity of source message i-th time point and speaker model, Sm(i) represent source message i-th time point with
The similarity of mixed model.Source message comprises the source message logarithm machine similar in appearance to speaker model with the similarity of speaker model
Rate value, the similarity of source message and mixed model comprises source message similar in appearance to the logarithm probit value of mixed model, therefore comes
Also can represent such as (2nd) formula if source message and speaker model calculate in probit value mode compared to the similarity of mixed model:
S(i)=exp(logPa(i)-logPm(i))......(2)
Wherein PaI () represents the source message i-th time point probit value similar in appearance to speaker model, PmI () represents source sound
News i-th time point is similar in appearance to the probit value of mixed model.Choose the similarity fragment 8032 more than marginal value, can be artificial choosing
Fixed numerical value, the numerical values recited of this marginal value will affect the scope and accuracy rate access time of sound video signal fragment, and marginal value is the biggest
Then selected sound video signal fractional time scope is the least.Message nonspecific language person's fragment 804 of originating is determined according to this speaker model,
It is to read language person's audible signals of nonspecific language person's fragment and be trained for speaker model.Message of originating is determined according to this speaker model
Nonspecific language person's fragment 803, according to source message nonspecific language person fragment update speaker model 804 can sequentially repeat, repeat
Perform number of times the most, language person's audible signals quantity is the most, and choose similarity more than marginal value fragment 8032 described in critical
Value can improve numerical value with the increase of language person's audible signals quantity, and meanwhile, language person's audible signals quantity is the most, the language person's mould trained
Type gets over this language person's utterance and feature, it is judged that the accuracy rate of sound video signal fragment also will promote therewith.
The enforcement step of the third embodiment of the present invention can refer to Fig. 9, comprises precondition mixed model 901, immediately instructs
Practice nonspecific speaker model 902, determine, according to this speaker model, originate message nonspecific language person's fragment 903, renewal mixed model
904, according to source message nonspecific language person fragment update speaker model 905.Wherein precondition mixed model 901, immediately train
Nonspecific speaker model 902, according to this speaker model determine originate message nonspecific language person's fragment 903 explanation can refer to Fig. 8
Precondition mixed model 801, the instant nonspecific speaker model 802 of training, the message that determines to originate according to this speaker model non-
Specific language person's fragment 803.Update mixed model 904, in conjunction with the mixing message news between the start time point of two cutting fragments the most
Number mixing audible signals captured with precondition mixed model 901, is trained for mixed model by mixing audible signals, and should
The content of mixing audible signals comprises in plural number name language person's audible signals, musical sound, advertisement audible signals and news type film
The audible signals of interview picture.Depending on of Fig. 8 is can refer to according to the explanation of source message nonspecific language person fragment update speaker model 905
Source message nonspecific language person fragment update speaker model 804.
The enforcement step of the fourth embodiment of the present invention can refer to Figure 10, comprises decomposition audio-video signal 1001, by message
Feature is found language person's audible signals 1002, the instant nonspecific speaker model 1003 of training, is determined, according to this speaker model, sound of originating
Interrogate nonspecific language person's fragment 1004, according to source message nonspecific language person fragment update speaker model 1005, by message fragment correspondence
To audio-video signal 1006, play sound video signal fragment 1007.Wherein decompose audio-video signal 1001, audio-video signal is divided into
Source message and source video signal, source message only comprises the signal of sound, voice, and source video signal the most only comprises image signal.By
Audio characteristic finds language person's audible signals 1002, finds language by the audio characteristic occurred fixing in most of audio-video signals
Time point position, person's audible signals place, and audio characteristic comprises prompt tone, key word and the musical sound of fixing appearance.Instant instruction
Practice nonspecific speaker model 1003, determine source message nonspecific language person's fragment 1004 according to this speaker model, according to source message
The explanation of nonspecific language person fragment update speaker model 1005 can join the instant nonspecific speaker model 401 of training of Fig. 4, according to being somebody's turn to do
Speaker model determines source message nonspecific language person's fragment 402, according to source message nonspecific language person fragment update speaker model
403.By corresponding for message fragment to audio-video signal 1006, by the most corresponding with end time code division for the initial time code of message fragment
To audio-video signal, its correspondence to audio-video signal can be absolute time described in audio-video signal, or is with audio frequency and video
The initial time of signal is the relative time of Fixed Initial Point, produces sound video signal fragment.Play sound video signal fragment 1007, for playing sound
News fragment is corresponding to the sound video signal segment contents corresponding to audio-video signal 1006.
Enforcement step such as Figure 11 of the fifth embodiment of the present invention, comprise decomposition audio-video signal 1101, precondition mix
Matched moulds type 1102, by audio characteristic find language person's audible signals 1103, determine with obtain all sources message nonspecific language person's sheet
Section 1104, by corresponding for message fragment to audio-video signal 1105, play sound video signal fragment 1106.Wherein decompose audio-video signal
1101, audio-video signal is divided into source message and source video signal, source message only comprises the signal of sound, voice, source
Video signal the most only comprises image signal.Precondition mixing model 1102, by capturing the mixed of random time length in non-sourcing message
Close audible signals, and read mixing audible signals and be trained for mixed model, and the content of mixing audible signals comprises plural number name language
Person's audible signals, musical sound, advertisement audible signals and news type film are interviewed the audible signals of picture.Sought by audio characteristic
Person's audible signals 1103 of looking for language, finds language person's audible signals by the audio characteristic occurred fixing in most of audio-video signals
Time point position, place, and audio characteristic comprises prompt tone, key word and the musical sound of fixing appearance.Determine all with acquirement next
Source message nonspecific language person's fragment 1104 comprises the instant nonspecific speaker model 11041 of training, determines to come according to this speaker model
Source message nonspecific language person's fragment 11042 with according to source message nonspecific language person fragment update speaker model 11043, the most immediately
Train nonspecific speaker model 11041, determine to originate message nonspecific language person's fragment 11042 and depend on source according to this speaker model
The explanation of message nonspecific language person fragment update speaker model 11043 can refer to the instant of Fig. 8 and trains nonspecific speaker model
802, determine to originate message nonspecific language person's fragment 803 according to this speaker model and depend on source message nonspecific language person's fragment update
Speaker model 804.Corresponding for the message fragment explanation to audio-video signal 1105, broadcasting sound video signal fragment 1106 be can refer to Figure 10
By corresponding for message fragment to audio-video signal 1006, play sound video signal fragment 1007.
The system architecture of the sixth embodiment of the present invention can refer to Figure 12, this system comprise fragment editing server 1201,
Timing code provisioning server 1202, fragment information accumulation device 1203, stream server 1204, sound video signal storage device 1205.
Fragment editing server 1201 decompose audio-video signal with capture source audible signals, determine with obtain all sources non-spy of message
Attribute person's fragment, and store all fragment initial time codes with end time code in fragment information accumulation device 1203, wherein determine
Fixed and acquirement all sources message nonspecific language person's fragment performs immediately to train nonspecific speaker model 401, according to this language person's mould
Type determines source message nonspecific language person's fragment 402, according to source message nonspecific language person fragment update speaker model 403.Time
Code provisioning server 1202, according to selected sound video signal fragment, is searched this fragment to fragment information accumulation device 1203 and takes
Obtain this fragment initial time code and end time code.On multimedia machine, box 1206 is via computer network and timing code provisioning server
1202 set up online, and to timing code provisioning server 1202 send play sound video signal fragment requirement, timing code supply service
After device 1202 obtains this fragment initial time code and end time code, carry out the transmission of sound video signal fragment.Sound video signal fragment transmission
One of mode notifies stream server 1204 fragment initial time code and end time code for timing code provisioning server 1202, to
On multimedia machine, box 1206 transmits the sound video signal fragment being stored in sound video signal storage device 1205, and by multimedia set top box 1206
Play after receiving sound video signal fragment;Another sound video signal fragment load mode is that timing code provisioning server 1202 is to multimedia machine
Upper box 1206 transmits fragment initial time code and end time code, and on multimedia machine, box 1206 requires to pass to stream server 1204
Sending the sound video signal fragment being stored in sound video signal storage device 1205, multimedia set top box 1206 is broadcast after receiving sound video signal fragment
Put.
These are only presently preferred embodiments of the present invention, be not used for limiting the practical range of the present invention;If without departing from this
The spirit and scope of invention, modify to the present invention or equivalent, all should contain in scope of the present invention patent
In the middle of protection domain.
Claims (18)
1. the method for the cutting sound video signal fragment that the person that utilizes language identifies, it is characterised in that be to come with incremental unknown language person
Source message trains nonspecific speaker model immediately, and the result that the person that utilizes language identifies determines sound video signal fragment, and step is as follows:
A. the nonspecific speaker model of instant training;The instant nonspecific speaker model of training is fixing by capturing one section in the message of source
Language person's audible signals of time span;
B. determine, according to this speaker model, message nonspecific language person's fragment of originating;And
C. according to source message nonspecific language person fragment update speaker model.
The method of the cutting sound video signal fragment that the person that utilizes language the most according to claim 1 identifies, it is characterised in that sound video signal
Fragment is the sound video signal fragment corresponding to language person repeated, also sound video signal fragment corresponding to the language person that repeats
The sound video signal scope contained between start time point.
The method of the cutting sound video signal fragment that the person that utilizes language the most according to claim 1 identifies, it is characterised in that sound video signal
Segment contents comprises news type film.
The method of the cutting sound video signal fragment that the person that utilizes language the most according to claim 1 identifies, it is characterised in that language person's mould
Type is news main broadcaster's model.
The method of the cutting sound video signal fragment that the person that utilizes language the most according to claim 1 identifies, it is characterised in that step B
Source message nonspecific language person's fragment length more than train this speaker model message length.
The method of the cutting sound video signal fragment that the person that utilizes language the most according to claim 1 identifies, it is characterised in that step B
Comprise the steps of
D. the similarity of source message and speaker model is calculated;And
E. the similarity fragment more than marginal value is chosen;Marginal value improves numerical value with the increase of language person's audible signals quantity.
The method of the cutting sound video signal fragment that the person that utilizes language the most according to claim 6 identifies, it is characterised in that step D
The similarity calculating source message and speaker model, comprise according to speaker model, calculate source message similar in appearance to speaker model
Probit value.
The method of the cutting sound video signal fragment that the person that utilizes language the most according to claim 1 identifies, it is characterised in that also comprise
The following step:
Precondition mixing model;
Wherein step determines, according to this speaker model, message nonspecific language person's fragment of originating, and comprises the steps of
F. source message and the speaker model similarity compared to mixed model is calculated;And
G. the similarity fragment more than marginal value is chosen.
The method of the cutting sound video signal fragment that the person that utilizes language the most according to claim 8 identifies, it is characterised in that instruct in advance
Practicing mixing model is by the mixing audible signals capturing random time length in non-sourcing message, and reads mixing audible signals instruction
Practice for mixed model.
The method of the cutting sound video signal fragment that the person that utilizes language the most according to claim 9 identifies, it is characterised in that mixing
The content of audible signals comprises in plural number name language person's audible signals, musical sound, advertisement audible signals and news type film interviews
The audible signals of picture.
The method of the cutting sound video signal fragment that 11. persons that utilize language according to claim 8 identify, it is characterised in that step F
Calculate source message and the speaker model similarity compared to mixed model, comprise according to speaker model and mixing model, point
Message of Ji Suan not originating and the similarity of speaker model and the message similarity with mixed model of originating, and subtract with the former similarity
Go the latter's similarity.
The method of the cutting sound video signal fragment that 12. persons that utilize language according to claim 1 identify, it is characterised in that also wrap
Containing the following step:
Precondition mixing model;
Update mixed model;
Wherein step determines, according to this speaker model, message nonspecific language person's fragment of originating, and comprises the steps of
H. source message and the speaker model similarity compared to mixed model is calculated;And
I. the similarity fragment more than marginal value is chosen.
The method of the cutting sound video signal fragment that 13. persons that utilize language according to claim 12 identify, it is characterised in that update
Mixed model is that the mixing audible signals between the start time point combining two cutting fragments the most captures with by non-sourcing message
Mixing audible signals, mixing audible signals is trained for mixed model.
The method of the cutting sound video signal fragment that 14. persons that utilize language according to claim 1 identify, it is characterised in that also wrap
Containing the following step:
Decompose sound video signal;
Language person's audible signals is found by audio characteristic;
By corresponding for message fragment to sound video signal;And
Play sound video signal fragment.
The method of the cutting sound video signal fragment that 15. persons that utilize language according to claim 14 identify, it is characterised in that step
Decompose sound video signal for sound video signal being divided into source message and source video signal.
The method of the cutting sound video signal fragment that 16. persons that utilize language according to claim 14 identify, it is characterised in that step
Prompt tone, key word and the musical sound that the audio characteristic of language person's audible signals comprises fixing appearance is found by audio characteristic.
The method of the cutting sound video signal fragment that 17. persons that utilize language according to claim 14 identify, it is characterised in that step
It is by the most corresponding with end time code division for the initial time code of message fragment by corresponding for the message fragment mode to sound video signal
To sound video signal, produce sound video signal fragment.
The method of the cutting sound video signal fragment that 18. persons that utilize language according to claim 14 identify, it is characterised in that step
The mode playing sound video signal fragment is to play sound video signal fragment with reference to message fragment initial time code with end time code.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW102129235A TWI518675B (en) | 2013-08-15 | 2013-08-15 | A method for segmenting videos and audios into clips using speaker recognition |
TW102129235 | 2013-08-15 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103730111A CN103730111A (en) | 2014-04-16 |
CN103730111B true CN103730111B (en) | 2016-11-30 |
Family
ID=
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101539929A (en) * | 2009-04-17 | 2009-09-23 | 无锡天脉聚源传媒科技有限公司 | Method for indexing TV news by utilizing computer system |
CN102024455A (en) * | 2009-09-10 | 2011-04-20 | 索尼株式会社 | Speaker recognition system and method |
CN102194455A (en) * | 2010-03-17 | 2011-09-21 | 博石金(北京)信息技术有限公司 | Voiceprint identification method irrelevant to speak content |
CN102760434A (en) * | 2012-07-09 | 2012-10-31 | 华为终端有限公司 | Method for updating voiceprint feature model and terminal |
CN103226951A (en) * | 2013-04-19 | 2013-07-31 | 清华大学 | Speaker verification system creation method based on model sequence adaptive technique |
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101539929A (en) * | 2009-04-17 | 2009-09-23 | 无锡天脉聚源传媒科技有限公司 | Method for indexing TV news by utilizing computer system |
CN102024455A (en) * | 2009-09-10 | 2011-04-20 | 索尼株式会社 | Speaker recognition system and method |
CN102194455A (en) * | 2010-03-17 | 2011-09-21 | 博石金(北京)信息技术有限公司 | Voiceprint identification method irrelevant to speak content |
CN102760434A (en) * | 2012-07-09 | 2012-10-31 | 华为终端有限公司 | Method for updating voiceprint feature model and terminal |
CN103226951A (en) * | 2013-04-19 | 2013-07-31 | 清华大学 | Speaker verification system creation method based on model sequence adaptive technique |
Non-Patent Citations (2)
Title |
---|
"speaker indexing for news articles,debates and drama in broadcasted TV programs";M.Nishida,Y.Ariki;《IEEE Conference Publications》;19990731;第2卷;第466-471页 * |
"UBM-based real-time speaker segmentation for broadcasting news";TingYao Wu等;《IEEE Conference Publications》;20030810;第2卷;第Ⅱ-193页 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108769723B (en) | Method, device, equipment and storage medium for pushing high-quality content in live video | |
WO2019228267A1 (en) | Short video synthesis method and apparatus, and device and storage medium | |
CN110198432B (en) | Video data processing method and device, computer readable medium and electronic equipment | |
US8566880B2 (en) | Device and method for providing a television sequence using database and user inputs | |
US11388480B2 (en) | Information processing apparatus, information processing method, and program | |
CN109218746A (en) | Obtain the method, apparatus and storage medium of video clip | |
CN103488764A (en) | Personalized video content recommendation method and system | |
CN103797482A (en) | Methods and systems for performing comparisons of received data and providing follow-on service based on the comparisons | |
US10893321B2 (en) | System and method for detecting and classifying direct response advertisements using fingerprints | |
CN104598541A (en) | Identification method and device for multimedia file | |
CN1820511A (en) | Method and device for generating and detecting a fingerprint functioning as a trigger marker in a multimedia signal | |
CN105788610A (en) | Audio processing method and device | |
CN104135671A (en) | Television video content interactive question and answer method | |
CN105530523B (en) | A kind of service implementation method and equipment | |
TWI518675B (en) | A method for segmenting videos and audios into clips using speaker recognition | |
CN109688430A (en) | A kind of court trial file playback method, system and storage medium | |
CN104065978A (en) | Method for positioning media content and system thereof | |
CN117319765A (en) | Video processing method, device, computing equipment and computer storage medium | |
CN103730111B (en) | Method for cutting audio and video signal segments by speaker identification | |
CN116781856A (en) | Audio-visual conversion control method, system and storage medium based on deep learning | |
CN107968942B (en) | Method and system for measuring audio and video time difference of live broadcast platform | |
US9741345B2 (en) | Method for segmenting videos and audios into clips using speaker recognition | |
KR101693381B1 (en) | Advertisement apparatus for recognizing video and method for providing advertisement contents in advertisement apparatus | |
JP4507351B2 (en) | Signal processing apparatus and method | |
CN112055260A (en) | Short video-based commodity display system and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |