CN102543080A - Audio editing system and audio editing method - Google Patents

Audio editing system and audio editing method Download PDF

Info

Publication number
CN102543080A
CN102543080A CN201010614165XA CN201010614165A CN102543080A CN 102543080 A CN102543080 A CN 102543080A CN 201010614165X A CN201010614165X A CN 201010614165XA CN 201010614165 A CN201010614165 A CN 201010614165A CN 102543080 A CN102543080 A CN 102543080A
Authority
CN
China
Prior art keywords
audio
cutting
clustering
audio stream
initial cutting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201010614165XA
Other languages
Chinese (zh)
Other versions
CN102543080B (en
Inventor
卢鲤
赵庆卫
颜永红
刘昆
吴伟国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Sony Corp
Original Assignee
Institute of Acoustics CAS
Sony Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, Sony Corp filed Critical Institute of Acoustics CAS
Priority to CN201010614165.XA priority Critical patent/CN102543080B/en
Priority claimed from CN201010614165.XA external-priority patent/CN102543080B/en
Publication of CN102543080A publication Critical patent/CN102543080A/en
Application granted granted Critical
Publication of CN102543080B publication Critical patent/CN102543080B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Stereophonic System (AREA)

Abstract

The invention relates to an audio editing system. The audio editing system comprises a plurality of initial segmentation devices, a multi-sound track fusion device, an audio clustering device and a re-segmentation device, wherein the plurality of the initial segmentation devices are respectively used for initially segmenting audio streams from a plurality of sound tracks into a plurality of different paragraphs; the multi-sound track fusion device is used for integrating segmentation points of the plurality of the initial segmentation devices, selecting the audio stream of the optimal sound track between every two adjacent segmentation points, further getting a plurality of initially segmented fragments and fusing the plurality of the obtained initially segmented fragments into an uniform audio data file; the audio clustering device is used for performing clustering on the plurality of the initially segmented fragments under supervision based on a hierarchical clustering algorithm and clustering the initially segmented fragments belonging to the same nature to a category; and the re-segmentation device is used for training according to the clustering result of the audio clustering device to get a hidden Markov model corresponding to each type and performing Viterbi alignment segmentation on the uniform audio file to get the audio stream after re-segmentation. The accuracy in final speaker clustering can be improved through a high-precision speaker segmentation system.

Description

Audio editing system and audio editing method
Technical field
The present invention relates to audio frequency clustering technique field, particularly a kind of audio editing system and audio editing method.
Background technology
Speaker's cluster is concrete use of clustering technique aspect the voice signal processing; Its objective is through voice segments is classified; Make each class only comprise same speaker's data; And same speaker's data all are integrated in same type, thereby obtain speaker's customizing messages.Say that from application point speaker's clustering technique can be applied to audio-frequency information management, fields such as retrieval.It helps in meeting, realizes in the audio stream of Voice Mailbox, lecture and news broadcast program that the speaker follows the tracks of, thereby realizes the structured analysis to voice data, understands and management.Special, clustering algorithm also has very important practical value to speech recognition system, and current nearly all automatic speech recognition system has all adopted adaptive technique, and the quality of clustering algorithm will directly influence the performance of speaker adaptation.
Concerning speaker's clustering system, a most key step is that voice data is carried out cutting apart of speaker, and what only cut apart is correct, could make the cluster performance of rear end ensure to some extent.To different clustering system frameworks; Two kinds of typical speaker's cutting techniques are arranged: at first; Concerning substep is cut apart clustering system, be representative with non-patent literature 1, at first audio stream is carried out speaker's cutting through the audible distance Calculation Method; Again the voice segments of disperseing is dropped into capable merger afterwards, realize the purpose of cluster; Secondly, concerning cutting apart clustering system synchronously, as representative, be generally method, when cutting apart, accomplished cluster based on model with non-patent literature 2 systems.These two kinds of frameworks respectively have relative merits, and the former does not add the mistake of having inherited segmentation procedure with revising in cluster process, because the method for distance measuring and calculating has certain limitation, can produce wrong accumulation; And the latter is owing to the HMM that is that uses mostly; The initialization of model assigns to carry out through voice data is directly waited; The initial error of introducing is bigger; Bring certain problem for the speed of convergence of model, because HMM is classified based on frame, when carrying out cutting, do not add restriction and can introduce certain error simultaneously; General way is to add certain duration restriction the time of parking to each HMM, and this way has brought great limitation for the dirigibility of system.
Non-patent literature 1:Dan Istrate; Corinne Fredouille, Sylvain Meignier.NIST RT ' 05S evaluation:Pre-processing techniques and Speaker Diarization on Multiple Microphone meetings.Machine Learning for Multimodal Interaction.2006
Non-patent literature 2:Fredouille; C.and Senay, G., Technical improvements of the e-hmm based speaker diarization system for meeting records; Machine Learning for Multimodal Interaction, 2006
Summary of the invention
In order to overcome the deficiency of said prior art, the present invention proposes the measuring and calculating of a kind of combination distance, speaker's cluster and model are cut apart speaker's segmentation framework of three kinds of technology.The objective of the invention is to obtain the initial segmental information of audio stream through the distance measuring and calculating; Utilize speaker's clustering technique to obtain speaker's seed data class, and train corresponding speaker's HMM; Simultaneously, the fragment number that utilizes initial cutting to obtain has been controlled the search volume structure of latent Markov models, iteration new model more in the cutting process, thus improve the performance that the speaker is cut apart.
In order to realize said purpose, the invention provides a kind of audio editing system, comprising:
Reading device reads audio stream in the said audio editing system;
Initial cutting device will be a plurality of different fragments by the initial cutting of the audio stream that said reading device reads;
The audio frequency clustering apparatus based on the algorithm of hierarchical clustering, has the supervision cluster to a plurality of fragments by the initial cutting of said initial cutting device, and the fragment that will belong to same nature is gathered into a classification;
Again cutting device utilizes said audio frequency clustering apparatus clustering result, and training obtains the corresponding HMM of each classification, and said audio stream is carried out Viterbi alignment cutting, obtains the audio stream after the category cutting.
In addition, the invention provides a kind of audio editing system, comprising:
Reading device reads the audio stream of a plurality of sound channels in the said audio editing system;
A plurality of initial cutting devices will be a plurality of different paragraphs by the initial cutting of the audio stream from a plurality of sound channels that said reading device reads respectively;
The multichannel fusing device; Cut-off to said a plurality of initial cutting devices carries out comprehensively; And from selecting the audio stream of optimum sound channel between per two adjacent cut-offs; Thereby obtain a plurality of initial cutting fragments, and the said a plurality of initial cutting fragment that will obtain merges the unified audio data file of formation;
The audio frequency clustering apparatus based on the algorithm of hierarchical clustering, has the supervision cluster to said a plurality of initial cutting fragments, and the initial cutting fragment that will belong to same nature is gathered into a classification;
Again cutting device utilizes said audio frequency clustering apparatus clustering result, and training obtains the corresponding HMM of each classification, and said unified audio file is carried out Viterbi alignment cutting, obtains the audio stream after the category cutting.
In order to realize said purpose, the audio editing method in a kind of audio editing system comprises:
Read step reads audio stream in the said audio editing system;
Initial cutting step will be a plurality of different fragments by the initial cutting of the audio stream that said read step reads;
Audio frequency cluster step based on the algorithm of hierarchical clustering, has the supervision cluster to a plurality of fragments by the initial cutting of said initial cutting step, and the fragment that will belong to same nature is gathered into a classification;
Again cutting step is utilized said audio frequency cluster step clustering result, and training obtains the corresponding HMM of each classification, and said audio stream is carried out Viterbi alignment cutting, obtains the audio stream after the category cutting.
In addition, the invention provides the audio editing method in a kind of audio editing system, comprising:
Read step reads the audio stream of a plurality of sound channels in the said audio editing system;
The cutting step is a plurality of different paragraphs with the initial cutting of the audio stream of said a plurality of sound channels respectively;
The multichannel fusion steps; The cut-off that in said cutting step, obtains is carried out comprehensively; And from selecting the audio stream of optimum sound channel between per two adjacent cut-offs; Thereby obtain a plurality of initial cutting fragments, and the said a plurality of initial cutting fragment that will obtain merges the unified audio data file of formation;
Audio frequency cluster step based on the algorithm of hierarchical clustering, has the supervision cluster to said a plurality of initial cutting fragments, and the initial cutting fragment that will belong to same nature is gathered into a classification;
Again cutting step is utilized clustering result in the said audio frequency cluster step, and training obtains the corresponding HMM of each classification, and said unified audio file is carried out Viterbi alignment cutting, obtains the audio stream after the category cutting.
Advantage of the present invention is to have utilized many Mikes' channel information to merge the more complete speaker's segmental information of acquisition mutually; Simultaneously; Utilize the method for distance measuring and calculating; Obtain audio properties turning point potential in the audio stream, and obtained potential audio frequency paragraph number, utilized the cycle index of the cycling element of this digital control HMM search volume; Compare the way that increases the duration restriction to HMM, greater flexibility is provided; Utilize speaker's clustering technique, at first, obtained speaker's seed data classification through the supervision cluster control cluster degree of depth is arranged; Compare and divide equally way as the model initialization data through data; The error of model is littler, and speed of convergence is faster, and cutting apart will be more accurate also.In addition, the present invention has combined the distance calculation in the substep cluster to cut apart two kinds of technology with the model in the synchronous cluster, and utilizes test data itself to train more new model, and data dependency is little, can be used as the current techique in a kind of speaker's of being applied to clustering system.
Description of drawings
Fig. 1 is the block scheme of speaker's clustering system of the present invention;
Fig. 2 is a model partitioning portion of the present invention search volume synoptic diagram;
Fig. 3 is the block scheme of the paragraph sheer in speaker's clustering system of the present invention;
Fig. 4 is a distance measuring and calculating sectionaliser treatment scheme synoptic diagram of the present invention;
Fig. 5 is the schematic flow sheet that merges hierarchical clustering algorithm;
Fig. 6 is many Mikes channel fusion device framework synoptic diagram of the present invention;
Fig. 7 is a channel segmental information integration program synoptic diagram of the present invention;
Fig. 8 is the schematic flow sheet of the heavy dispenser of HMM of the present invention.
Embodiment
Describe the present invention below in conjunction with specific embodiment and accompanying drawing.Be divided into a plurality of embodiment in the explanation below for the convenience of explaining, but each embodiment is an illustration, it should be appreciated by those skilled in the art various variation, revise example, replace example, permutations etc.Use concrete numerical example to be illustrated for the understanding that promotes invention, but under the situation that does not have to specify, those numerical value are an example, can use suitable any value.Use concrete mathematical expression to be illustrated for the understanding that promotes invention, but under the situation that does not have to specify, those mathematical expressions are an example, can use suitable any mathematical expression.The differentiation of each embodiment is not an internal for the present invention, can suitably be combined in the item of explaining among each embodiment.For the ease of explanation, the block diagram of the device functions of use property of embodiments of the invention is illustrated, but such device can be realized by hardware, software or their combination.The present invention is not limited to described embodiment, and various variation, correction example, replacement example, permutations etc. comprise and break away from spirit of the present invention in the present invention and not.
In addition; Mainly the present invention is described in the following description with the example that is applied as of cutting apart the field at voice; But it will be appreciated by those skilled in the art that; In similar audio area, can use the present invention and should not be limited to the scope of the embodiment that the present invention enumerates any object beyond the voice.Therefore, mainly as speaker's clustering system audio editing of the present invention system is described hereinafter.
Fig. 1 is the block scheme of speaker's clustering system of the present invention; As shown in Figure 1, speaker's clustering system 10 comprise a plurality of Mike 1~N, with the heavy dispenser of corresponding respectively a plurality of paragraph sheer 101-1~101-N, many Mikes channel fusion device 102, voice non-voice arbiter 103, speaker's cluster device 104, the model of a plurality of Mikes 105, speaker's cluster device 106.Wherein, comprise audio feature extraction device 1011-1~1011~N and distance measuring and calculating sectionaliser 1012-1~1012~N among a plurality of paragraph sheer 101-1~101-N.In addition, below as do not specify especially, under all situation that has a plurality of devices with identical function, only describe with regard to one.Below, describe as Mike 1, paragraph sheer 101, audio feature extraction device 1011 and distance measuring and calculating sectionaliser 1012.
Below, with regard to specifying of speaker's clustering system 10.
Typical application example as speaker's clustering system 10 can be arranged on it in meeting room.Under the situation of using speaker's clustering system 10; Usually carry out the environment of audio sample in hope; In meeting room; On position or the unfixed position of regulation, a plurality of Mikes are set, simultaneously the environment sound being sampled, and the various piece that will have the pending voice data of being recorded respectively by a plurality of Mikes to send into the back level is handled.At this; Speaker's clustering system 10 as the application is not to comprise Mike 1~N; It also can replace Mike 1~N and the reading device (not shown) is set, and reads in speaker's clustering system of the present invention through the voice data that reading device will be kept at the multichannel of recording in the same manner in advance in the recording medium and to carry out same processing.Certainly, in the present embodiment, a plurality of Mike 1~N are set, adopt the purpose of the voice data of a plurality of sound channels, be through from the voice data of a plurality of sound channels, selecting excellent, comprehensive, thereby improve the reliability and the efficiency of follow-up Audio Processing.In other words, for example in the smaller environment in space, perhaps consider many factors such as cost, also can only adopt the voice data of a sound channel to carry out Audio Processing in actual use, this is that those skilled in the art can understand.
Audio stream through a plurality of Mike 1~N samplings is imported in the paragraph sheer 101.In paragraph sheer 101, will be a plurality of different paragraphs from the initial cutting of Mike 1 audio stream.As long as can audio stream be cut into a plurality of different paragraphs as paragraph sheer 101, then can adopt various algorithm or mode.That adopts among the present invention is based on the MFCC audio frequency characteristics, carries out through distance between the measuring and calculating sliding window.At first,
(1) voice data of each Mike being recorded is imported the audio feature extraction device 1011 in the paragraph sheer 101, carries out pre-service, comprises the processing of branch frame.
In the present embodiment, following flow process is mainly adopted in pre-service:
2-1) carry out high boost through pre-emphasis:
Preemphasis filter is: H (z)=1-α z -1, α=0.98 wherein.
2-2) data being carried out the branch frame handles: get frame length 25ms, the overlapping 15ms of interframe, can suitably adjust as required;
2-3) windowing process:
Window function adopts hamming window function commonly used:
w ( n ) = 0.54 - 0.46 cos ( 2 πn N - 1 )
Wherein, 0≤n≤N-1, n represent the sampled point number.
Extract MFCC (Mel cepstrum coefficient) subsequently, parameter attribute method for distilling (referring to " Reynolds, D.A.and Rose; R.C., Robust text-independent speaker identification using Gaussian mixture speaker models, IEEE transactions on Speech and Audio Processing; 1995 "); For increasing the robustness of characteristic, characteristic is carried out cepstral mean, the regular technical finesse of variance.
The distance that (2) will be connected audio feature extraction device 1011 back levels by each the voice data characteristic of correspondence data input after 1011 processing of audio feature extraction device is calculated sectionaliser 1012; Seek its inner audio properties tr pt; As shown in Figure 4; In the present embodiment, mainly adopt following flow process:
2-1) at first in the cutting process, earlier input audio signal is extracted 12 dimension MFCC characteristics, frame length is 25ms, then characteristic is carried out windowing, every window window length is 50 frames, supposes the eigenvector Gaussian distributed N (μ in the window 1, ∑ 1) and N (μ 2, ∑ 2), calculate the Bhattachayya distance between two windows, suc as formula (1).We can obtain range points between a series of window thus; For example according to the criterion of the selection change point that proposes in " Lu, L.and Zhang, H.J.; Speaker change detection and tracking in real-time news broadcasting analysis; Proceedings of the tenth ACM international conference on Multimedia, 2006 ", we carry out cutting to the audio file of input.
d BHA 12 = 1 4 ( μ 1 - μ 2 ) T ( Σ 1 + Σ 2 ) - 1 ( μ 1 + μ 2 ) + 1 2 log | Σ 1 + Σ 2 | 2 Σ 1 Σ 2 - - - ( 1 )
Secondly be merging process 2-2), owing to existed a large amount of false-alarms in a series of cut-offs of obtaining of a last step, so we carry out some merging in utilization based on the algorithm of BIC; Concerning two audio fragments; BIC criterion through calculating its Gaussian distribution (referring to Lu, L.and Zhang, H.J.; Speaker change detection and tracking in real-time news broadcasting analysis; Proceedings of the tenth ACM international conference on Multimedia, 2006 "), judge whether the Gaussian distribution of two ends audio frequency can come match with a Gaussian distribution; we merge the result of first step cutting iteratively, and obtain the net result of segmentation.
Through said a series of processing, audio stream is a plurality of paragraphs by initial division.
Above, mainly be illustrated as an example, but in fact this initial cutting is not limited to this method with initial cutting based on the method for Bhattachayya distance; Can also enumerate method similarly based on model (GMM, SVM etc.) classification; The method of frequency spectrum cluster (referring to " A Ng, M Jordan, Y Weiss; On spectral clustering:Analysis and an algorithm ") is based on method of energy or the like.As long as can reach the initial cutting of audio stream is the purpose of a plurality of different paragraphs, can adopt various other methods.
Cutting fragment that (3) will in paragraph sheer 101, obtain and segmental information are imported many Mikes channel and selected fusion device 102, and be as shown in Figure 6, in the present embodiment, mainly adopts following flow process:
3-1) segment information is synthetic, the audio frequency segmental information that different Mikes obtain is got union, thereby obtain comprehensive audio frequency cut-off, supposes to exist under two Mikes' 1,2 the situation, and is as shown in Figure 7; Through comprehensive Mike 1 and Mike 2 segment information separately, the segmentation cut-off that obtains after synthetic for (t1, t2), (t2, t3); (t3, t4), (t4, t5), (t5; T6), (t6, t7), (t7, t8).
3-2) level and smooth measure, for fear of the harmful effect that too short fragment produces model training, can be with giving two adjacent paragraphs up and down less than the fragment average mark of 1s.
3-3) channel is selected, and selects optimum Mike's data to represent this fragment to each fragment, and the criterion of selection is formula (2), MaxE in formula iRepresent the mean value of maximum 10 values of frame short-time energy in this fragment, be used for characterizing voice signal energy, and MinE iThe mean value of representing minimum 10 values of frame short-time energy in this fragment is used for characterizing the energy of non-speech audio.Therefore, formula (2) is to different Mike's data signal-to-noise ratio (snr) estimations in this fragment, selects the maximum channel of signal to noise ratio (S/N ratio) to characterize this fragment, thereby has obtained unified audio frequency cutting fragment.
i * = arg min i MinE i MaxE i - - - ( 2 )
3-4) Composite tone, the data slot that the different channels that utilized a last step to obtain is originated, being cascaded forms the unified audio data file that merges.
In addition; In introducing in the above; Because what adopt in the present embodiment is the voice data of a plurality of sound channels, so need select fusion device 102 carry out the fusion of cut-off through said many Mikes channel, but same as introduction in the above; Because in fact also might adopt the voice data of a sound channel, so need not to be provided with this many Mikes channel selection fusion device 102 under this situation.
(4) in voice non-voice arbiter 103, to the audio fragment of selecting at many Mikes channel to obtain in the fusion device 102, each fragment is carried out the differentiation of voice and non-voice; In this course, we have used mixed Gauss model (referring to " Reynolds, D.A.and Rose; R.C.; Robust text-independent speaker identification using Gaussian mixture speaker models, IEEE transactions on Speech and Audio Processing, 1995 ") as sorter; And training utterance and two types of audio model of non-voice; Through the method for category of model, abandon the non-voice fragment, obtain sound bite.Certainly, the main purpose of differentiation that each audio fragment is carried out the voice non-voice is that as required and suitably whether selection is provided with voice non-voice arbiter 103 for the efficient of the speaker's cluster that helps to improve the back level.
(5) then; In speaker's cluster device 104; Utilize speaker's cluster device to be the cluster of known cluster numbers (being the number of speaking in the present embodiment) to the supervision cluster is arranged through the sound bite after 103 screenings of voice non-voice arbiter, based on merging hierarchical clustering algorithm, as shown in Figure 5; On distance was selected, (Generalized Likelihood Ratio was GLR) apart from the definition as between class distance to select Generalized Likelihood Ratio GLR commonly used; GLR distance is widely used in comprising that the speaker cuts apart; Speaker's cluster, fields such as speaker verification, and obtained good effect.Definition is like formula (3), and wherein, x, y represent two types of data of computed range, obey distribution N (μ respectively x, ∑ x) and N (μ y, ∑ y), z representes that with x the y data merge to together, and obeys distribution N (μ z, ∑ z).
d GLR ( x , y ) = L ( z , N ( μ z , Σ z ) ) L ( x , N ( μ x , Σ x ) ) L ( y , N ( μ y , Σ y ) ) - - - ( 3 )
For this step the supervision cluster arranged, stopping criterion is for equaling the maximum estimated value to speaker's number in this deal with data when cluster classification number, such as to general conferencing data, this numerical value is got the round values of the number of being slightly larger than, and gets final product as 8.Through after the clustering processing, obtain speaker's seed data class, promptly initial cluster result.
(6) then; In speaker's seed data class that will in speaker's cluster device 104, obtain and the heavy dispenser 105 of fusion audio file input model that in many Mikes channel is selected fusion device 102, obtains, carry out speaker's cutting, in the present embodiment; The following flow process of main employing, as shown in Figure 8:
6-1) utilize speaker's seed data class to train HMM respectively; And cooperate a HMM that characterizes non-voice to construct search volume as shown in Figure 2; Wherein, cycling element is the parallel connection of each speaker model and non-voice model, and cycle index is from Min to Max; Being located at the paragraph number that obtains in many Mikes channel selection fusion device 102 is SegNum; Then Min representes SegNum-Region, and Max representes SegNum+Region, and wherein Region is a paragraph unsteady number up and down.
6-2) audio file that merges is carried out the Viterbi alignment, realize cutting audio stream based on the search volume.
6-3) utilize model cutting and classified information to train each speaker's HMM again, the neotectonics search volume of laying equal stress on.
6-4), can repeat 6-2 in order more accurately to carry out cutting), 6-3) two steps stop to certain number of times, thereby obtain final speaker's cutting result.
(7) with the paragraph input as a result of cutting in (6) step speaker cluster device 106, and do not have the supervision cluster, obtain final speaker's cluster result.In the present embodiment, the stopping criterion of using is the BIC stopping criterion, and suc as formula (4), wherein, D is the dimension of characteristic, and N is a frame number, and λ is a penalty factor.In cluster process carried out, when the BIC criterion between all two types was negative, then cluster process stopped.Because the existence of BIC criterion penalty factor, generally this value need be regulated on development set to confirm.Cluster process finishes to mean and has obtained final speaker's cluster result.
ΔBIC = log L ( x 1 , . . . x N | μ , Σ ) L ( x 1 , . . . x i | μ 1 , Σ 1 ) L ( x i + 1 , . . . x N | μ 2 , Σ 2 ) - λP .
P = 1 2 ( D + 1 2 D ( D + 1 ) ) log N - - - ( 4 )
Use the RT-04S conferencing data set pair native system of NIST issue to carry out Performance Evaluation in the present embodiment, the RT-04S data set comprises development set and test set, includes 4 partial data, respectively from CMU, and ICSI, LDC and NIST.Wherein, development set comprises 8 conferencing data files altogether, and 10 minutes data have been carried out artificial mark in each file.Test set comprises 8 conferencing data files equally, and wherein 11 minutes data of each file mark, and all data all convert 16KHz into, 16bit, PCM form.Present embodiment at first carries out the systematic parameter debugging on development set, on test set, directly test providing the result simultaneously.
Below, the flow process of audio editing method of the present invention is described.
S1: will have the pending voice data of recording respectively by a plurality of Mikes to send into the audio feature extraction device, audio stream is carried out pre-service, and comprise windowing, and divide frame to handle, each frame sound signal is extracted the MFCC audio frequency characteristics;
S2: audio frequency characteristics is sent into distance measuring and calculating sectionaliser, through to the initial cutting result of the speaker who calculates each audio file of distance between sliding window;
S3: cutting paragraph and segmental information in that S2 obtains are imported many Mikes channel fusion device; Each segment information is merged; Select optimum channel data according to the method for signal-to-noise ratio (snr) estimation for each fragment subsequently and represent, thereby obtain the unified audio file after initial cutting paragraph and the many Mikes fusion;
S4: at the initial cutting paragraph input voice non-voice arbiter that S3 arrives, judge that each fragment is a voice signal, or non-speech audio, after removing the non-voice fragment, obtained initial speaker's cutting fragment;
S5: S4 to speaker's fragment input speaker cluster device the supervision cluster is arranged, the cluster stopping criterion is for when the classification number equals N when, wherein N representes maximum numbers that the estimation of its audio frequency is comprised, thereby obtains initial speaker's category data;
S6: the heavy dispenser of the speaker's category data input model that obtains at S5, training speaker's HMM separately, and cooperate a HMM that characterizes non-speech audio together, and the structure search volume, as shown in Figure 2;
S7: in the search volume of S6 structure to 4) the fusion audio file that obtains carries out the Viterbi alignment, realize cutting, and obtained the audio properties information (belonging to some speakers or non-voice) of each cutting paragraph audio stream;
S8 is utilized in speaker's segmental information that S7 obtains and trains different speaker's HMMs again;
S9: after repeating S7, the certain number of times of S8 step, obtain final speaker's cutting result;
S10: be based on the speaker's cutting result who obtains among the S9, utilize speaker's cluster device not have the supervision cluster, obtain speaker's clustering result.
In the said technical scheme, the pre-service of among the step S2 input audio stream being carried out comprises carries out digitizing, and the pre-emphasis high boost divides frame and windowing process.
In the said technical scheme, comprise in the fragmentation procedure of step S3 based on the cutting process of Pasteur's distance with based on the merging process of BIC criterion, as shown in Figure 4.
In the said technical scheme, step S4 has used mixed Gauss model to realize the classification to voice and non-speech audio.
In the said technical scheme; The cycle index that step S5 utilizes the number of the initial cutting paragraph that step S3 obtains to come the Control Circulation unit in structure search volume process; As shown in Figure 2, being located at the paragraph number that S4 step obtains is SegNum, and then Min representes SegNum-Region; Max representes SegNum+Region, and Region is a paragraph unsteady number up and down.
In the said technical scheme; Step S5, the employed speaker's cluster of S9 device are based on the hierarchical clustering algorithm that merges; As shown in Figure 5; And used Generalized Likelihood Ratio as the distance calculation criterion between speaker's classification, do not have the stopping criterion of supervision cluster among the S9 and use stopping criterion based on the BIC criterion.
Above, clear specifically technical scheme of the present invention can be calculated the initial segmental information that obtain audio stream through distance through technical scheme of the present invention; Utilize speaker's clustering technique to obtain speaker's seed data class, and train corresponding speaker's HMM; Simultaneously, the fragment number that utilizes initial cutting to obtain has been controlled the search volume structure of latent Markov models, iteration new model more in the cutting process, thus improve the performance that the speaker is cut apart.The evaluation metrics that will provide system of the present invention below is to explain its technique effect.
Present embodiment uses speaker's daily record mistake of NIST official standard, and (Diarization error is DER) as the evaluation metrics of system.The mistake of speaker mark possibly comprise three types: undetected error (in the answer certain time-labeling speaker information; But system does not detect the speaker in this time); (certain time does not mark the speaker to the false-alarm mistake in the answer; But system detects the speaker in this time), speaker's mistake (it is inconsistent that speaker who marks in the answer and system detect the speaker).Therefore, the DER comprehensive statistics said three kinds of mistakes, calculate according to formula (5).
DER = Σ allS { dur ( S ) * ( max ( N ref ( S ) , N sys ( S ) ) - N correct ( S ) ) } Σ allS { dur ( S ) * N ref ( S ) } - - - ( 5 )
Wherein, dur (S) is the duration of fragment, N Ref(S) represent the number of speaking that answer marks in the fragment interval, N Sys(S) represent speaker's number that system detects in the fragment interval, N Correct(S) represent the speaker and the system that mark in the answer in this fragment interval to detect the corresponding correct speaker's number of speaker.
The present invention can reach effect preferably under the interval not serious overlapping situation of speaking of different speakers.Present embodiment has at first carried out the systematic parameter adjusting on the development set of RT-04S, under the experiment condition shown in the table 1, system obtains optimum performance on development set, and table 2 has provided the performance comparison that whether adds speaker's dividing method of the present invention.It is thus clear that, on all data, use speaker's dividing method of the present invention all to bring lifting to a certain degree as the performance of speaker's cluster, average DE R drops to 32.51 by 36.68.Because flow process and the parameter of cluster are selected not change in the two cover frameworks, so experimental result introducing that speaker's dividing method of the present invention is described has certain correcting action to the judgement of speaker's turning point, causes the lifting of final speaker's cluster performance.Under same experiment parameter condition, we have also obtained similar comparing result on test set, as shown in table 2, explain that speaker's cutting method of the present invention has improved the performance of speaker's clustering system effectively.
Table 1 experiment parameter condition stub
Audio frequency characteristics MFCC?13+pitch,CMN
Initial speaker's number 8
The HMM Gaussage 16
There is not supervision cluster BIC penalty factor 9
Hop count float area Region 20
Table 2 experiment comparing result
Figure BSA00000409647700131
Though enumerated certain embodiments explanation in the instructions, these embodiment are used for limiting scope of invention, the scope that all belongs to the present invention and protected with constitutive requirements equivalent configurations of the present invention and method.

Claims (14)

1. audio editing system comprises:
Reading device reads audio stream in the said audio editing system;
Initial cutting device will be a plurality of different fragments by the initial cutting of the audio stream that said reading device reads;
The audio frequency clustering apparatus based on the algorithm of hierarchical clustering, has the supervision cluster to a plurality of fragments by the initial cutting of said initial cutting device, and the fragment that will belong to same nature is gathered into a classification;
Again cutting device utilizes said audio frequency clustering apparatus clustering result, and training obtains the corresponding HMM of each classification, and said audio stream is carried out Viterbi alignment cutting, obtains the audio stream after the category cutting.
2. audio editing as claimed in claim 1 system, wherein,
Said initial cutting device comprises:
The audio feature extraction device is used for sequentially said audio stream being carried out windowing backward in the past, and extracts the characteristic information of sound signal in the window; And
Distance measuring and calculating sectionaliser; Be used for through in the past backward order respectively the said characteristic information from said audio feature extraction device is added sliding window; Calculate the audible distance in the adjacent windows, thereby be a plurality of paragraphs with the audio stream cutting by audio stream audio properties turning point.
3. audio editing as claimed in claim 2 system, wherein,
To give two adjacent paragraphs up and down less than the fragment average mark of 1s.
4. audio editing as claimed in claim 1 system, wherein,
Whether also comprise audio frequency distinguishing validity device, being used for a plurality of fragments that obtain at said initial cutting device are differentiated it respectively is effective audio frequency, and deletes the fragment that is judged as invalid audio frequency,
Said audio frequency clustering apparatus is deleted the remaining fragment in back to said audio frequency distinguishing validity device and is carried out the said supervision cluster that has.
5. audio editing as claimed in claim 4 system, wherein,
Said invalid audio frequency is blank audio frequency or noise audio frequency.
6. audio editing as claimed in claim 1 system, wherein,
Also comprise clustering apparatus again, the said audio stream after the cutting of said cutting device is not again had the supervision cluster.
7. audio editing system comprises:
Reading device reads the audio stream of a plurality of sound channels in the said audio editing system;
A plurality of initial cutting devices will be a plurality of different paragraphs by the initial cutting of the audio stream from a plurality of sound channels that said reading device reads respectively;
The multichannel fusing device; Cut-off to said a plurality of initial cutting devices carries out comprehensively; And from selecting the audio stream of optimum sound channel between per two adjacent cut-offs; Thereby obtain a plurality of initial cutting fragments, and the said a plurality of initial cutting fragment that will obtain merges the unified audio data file of formation;
The audio frequency clustering apparatus based on the algorithm of hierarchical clustering, has the supervision cluster to said a plurality of initial cutting fragments, and the initial cutting fragment that will belong to same nature is gathered into a classification;
Again cutting device utilizes said audio frequency clustering apparatus clustering result, and training obtains the corresponding HMM of each classification, and said unified audio file is carried out Viterbi alignment cutting, obtains the audio stream after the category cutting.
8. audio editing as claimed in claim 7 system, wherein,
Each of said a plurality of initial cutting devices comprises:
The audio feature extraction device is used for sequentially said audio stream being carried out windowing backward in the past, and extracts the characteristic information of sound signal in the window; And
Distance measuring and calculating sectionaliser; Be used for through in the past backward order respectively the said characteristic information from said audio feature extraction device is added sliding window; Calculate the audible distance in the adjacent windows, thereby be a plurality of paragraphs with the audio stream cutting by audio stream audio properties turning point.
9. audio editing as claimed in claim 8 system, wherein,
To give two adjacent paragraphs up and down less than the fragment average mark of 1s.
10. audio editing as claimed in claim 7 system, wherein,
Whether also comprise audio frequency distinguishing validity device, being used for the said a plurality of initial cutting fragment that obtains at said multichannel fusing device is differentiated it respectively is effective audio frequency, and deletes the initial cutting fragment that is judged as invalid audio frequency,
Said audio frequency clustering apparatus is deleted the remaining initial cutting fragment in back to said audio frequency distinguishing validity device and is carried out the said supervision cluster that has.
11. audio editing as claimed in claim 10 system, wherein,
Said invalid audio frequency is blank audio frequency or noise audio frequency.
12. audio editing as claimed in claim 7 system, wherein,
Also comprise clustering apparatus again, the said audio stream after the cutting of said cutting device is not again had the supervision cluster.
13. the audio editing method in the audio editing system comprises:
Read step reads audio stream in the said audio editing system;
Initial cutting step will be a plurality of different fragments by the initial cutting of the audio stream that said read step reads;
Audio frequency cluster step based on the algorithm of hierarchical clustering, has the supervision cluster to a plurality of fragments by the initial cutting of said initial cutting step, and the fragment that will belong to same nature is gathered into a classification;
Again cutting step is utilized said audio frequency cluster step clustering result, and training obtains the corresponding HMM of each classification, and said audio stream is carried out Viterbi alignment cutting, obtains the audio stream after the category cutting.
14. the audio editing method in the audio editing system comprises:
Read step reads the audio stream of a plurality of sound channels in the said audio editing system;
The cutting step is a plurality of different paragraphs with the initial cutting of the audio stream of said a plurality of sound channels respectively;
The multichannel fusion steps; The cut-off that in said cutting step, obtains is carried out comprehensively; And from selecting the audio stream of optimum sound channel between per two adjacent cut-offs; Thereby obtain a plurality of initial cutting fragments, and the said a plurality of initial cutting fragment that will obtain merges the unified audio data file of formation;
Audio frequency cluster step based on the algorithm of hierarchical clustering, has the supervision cluster to said a plurality of initial cutting fragments, and the initial cutting fragment that will belong to same nature is gathered into a classification;
Again cutting step is utilized clustering result in the said audio frequency cluster step, and training obtains the corresponding HMM of each classification, and said unified audio file is carried out Viterbi alignment cutting, obtains the audio stream after the category cutting.
CN201010614165.XA 2010-12-24 audio editing system and audio editing method Expired - Fee Related CN102543080B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010614165.XA CN102543080B (en) 2010-12-24 audio editing system and audio editing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010614165.XA CN102543080B (en) 2010-12-24 audio editing system and audio editing method

Publications (2)

Publication Number Publication Date
CN102543080A true CN102543080A (en) 2012-07-04
CN102543080B CN102543080B (en) 2016-12-14

Family

ID=

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103971677A (en) * 2013-02-01 2014-08-06 腾讯科技(深圳)有限公司 Acoustic language model training method and device
CN104159145A (en) * 2014-08-26 2014-11-19 中译语通科技(北京)有限公司 Automatic timeline generating method specific to lecture videos
CN104217715A (en) * 2013-08-12 2014-12-17 北京诺亚星云科技有限责任公司 Real-time voice sample detection method and system
CN104731913A (en) * 2015-03-23 2015-06-24 华南理工大学 GLR-based homologous audio advertisement retrieving method
CN104851423A (en) * 2014-02-19 2015-08-19 联想(北京)有限公司 Sound message processing method and device
US9396723B2 (en) 2013-02-01 2016-07-19 Tencent Technology (Shenzhen) Company Limited Method and device for acoustic language model training
CN106528678A (en) * 2016-10-24 2017-03-22 腾讯音乐娱乐(深圳)有限公司 Song processing method and device
CN106887231A (en) * 2015-12-16 2017-06-23 芋头科技(杭州)有限公司 A kind of identification model update method and system and intelligent terminal
CN106981289A (en) * 2016-01-14 2017-07-25 芋头科技(杭州)有限公司 A kind of identification model training method and system and intelligent terminal
CN107342077A (en) * 2017-05-27 2017-11-10 国家计算机网络与信息安全管理中心 A kind of speaker segmentation clustering method and system based on factorial analysis
CN107358969A (en) * 2017-07-19 2017-11-17 无锡冰河计算机科技发展有限公司 One kind recording fusion method
CN108597521A (en) * 2018-05-04 2018-09-28 徐涌 Audio role divides interactive system, method, terminal and the medium with identification word
CN109686382A (en) * 2018-12-29 2019-04-26 平安科技(深圳)有限公司 A kind of speaker clustering method and device
CN109903752A (en) * 2018-05-28 2019-06-18 华为技术有限公司 The method and apparatus for being aligned voice
CN110021302A (en) * 2019-03-06 2019-07-16 厦门快商通信息咨询有限公司 A kind of Intelligent office conference system and minutes method
CN110491392A (en) * 2019-08-29 2019-11-22 广州国音智能科技有限公司 A kind of audio data cleaning method, device and equipment based on speaker's identity
CN111402919A (en) * 2019-12-12 2020-07-10 南京邮电大学 Game cavity style identification method based on multiple scales and multiple views
WO2021097666A1 (en) * 2019-11-19 2021-05-27 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for processing audio signals
CN113449626A (en) * 2021-06-23 2021-09-28 中国科学院上海高等研究院 Hidden Markov model vibration signal analysis method and device, storage medium and terminal

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1716380A (en) * 2005-07-26 2006-01-04 浙江大学 Audio frequency splitting method for changing detection based on decision tree and speaking person
CN101136199A (en) * 2006-08-30 2008-03-05 国际商业机器公司 Voice data processing method and equipment
CN101452704A (en) * 2007-11-29 2009-06-10 中国科学院声学研究所 Speaker clustering method based on information transfer

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1716380A (en) * 2005-07-26 2006-01-04 浙江大学 Audio frequency splitting method for changing detection based on decision tree and speaking person
CN101136199A (en) * 2006-08-30 2008-03-05 国际商业机器公司 Voice data processing method and equipment
CN101452704A (en) * 2007-11-29 2009-06-10 中国科学院声学研究所 Speaker clustering method based on information transfer

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
吕 萍,颜永红: "广播新闻语料识别中的自动分段和分类算法", 《电子与信息学报》, 31 December 2006 (2006-12-31), pages 2292 - 2295 *
张世磊 等: "广播语音的说话人切分聚类算法", 《第八届全国人机语音通讯学术会议论文集》, 31 October 2005 (2005-10-31), pages 248 - 252 *
张薇 等: "电话语音的多说话人分割聚类研究", 《清华大学学报(自然科学版)》, 30 April 2008 (2008-04-30), pages 575 - 578 *
王炜 等: "一种改进的基于层次聚类的说话人自动聚类算法", 《声学学报》, 31 January 2008 (2008-01-31), pages 9 - 14 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103971677B (en) * 2013-02-01 2015-08-12 腾讯科技(深圳)有限公司 A kind of acoustics language model training method and device
US9396723B2 (en) 2013-02-01 2016-07-19 Tencent Technology (Shenzhen) Company Limited Method and device for acoustic language model training
CN103971677A (en) * 2013-02-01 2014-08-06 腾讯科技(深圳)有限公司 Acoustic language model training method and device
CN104217715B (en) * 2013-08-12 2017-06-16 北京诺亚星云科技有限责任公司 A kind of real-time voice sample testing method and system
CN104217715A (en) * 2013-08-12 2014-12-17 北京诺亚星云科技有限责任公司 Real-time voice sample detection method and system
CN104851423A (en) * 2014-02-19 2015-08-19 联想(北京)有限公司 Sound message processing method and device
CN104159145A (en) * 2014-08-26 2014-11-19 中译语通科技(北京)有限公司 Automatic timeline generating method specific to lecture videos
CN104159145B (en) * 2014-08-26 2018-03-09 中译语通科技股份有限公司 A kind of time shaft automatic generation method for lecture video
CN104731913A (en) * 2015-03-23 2015-06-24 华南理工大学 GLR-based homologous audio advertisement retrieving method
CN104731913B (en) * 2015-03-23 2018-05-15 华南理工大学 A kind of homologous audio advertisement search method based on GLR
CN106887231A (en) * 2015-12-16 2017-06-23 芋头科技(杭州)有限公司 A kind of identification model update method and system and intelligent terminal
CN106981289A (en) * 2016-01-14 2017-07-25 芋头科技(杭州)有限公司 A kind of identification model training method and system and intelligent terminal
CN106528678A (en) * 2016-10-24 2017-03-22 腾讯音乐娱乐(深圳)有限公司 Song processing method and device
CN106528678B (en) * 2016-10-24 2019-07-23 腾讯音乐娱乐(深圳)有限公司 A kind of song processing method and processing device
CN107342077A (en) * 2017-05-27 2017-11-10 国家计算机网络与信息安全管理中心 A kind of speaker segmentation clustering method and system based on factorial analysis
CN107358969A (en) * 2017-07-19 2017-11-17 无锡冰河计算机科技发展有限公司 One kind recording fusion method
CN108597521A (en) * 2018-05-04 2018-09-28 徐涌 Audio role divides interactive system, method, terminal and the medium with identification word
CN109903752A (en) * 2018-05-28 2019-06-18 华为技术有限公司 The method and apparatus for being aligned voice
US11631397B2 (en) 2018-05-28 2023-04-18 Huawei Technologies Co., Ltd. Voice alignment method and apparatus
CN109686382A (en) * 2018-12-29 2019-04-26 平安科技(深圳)有限公司 A kind of speaker clustering method and device
CN110021302A (en) * 2019-03-06 2019-07-16 厦门快商通信息咨询有限公司 A kind of Intelligent office conference system and minutes method
CN110491392A (en) * 2019-08-29 2019-11-22 广州国音智能科技有限公司 A kind of audio data cleaning method, device and equipment based on speaker's identity
WO2021097666A1 (en) * 2019-11-19 2021-05-27 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for processing audio signals
CN111402919A (en) * 2019-12-12 2020-07-10 南京邮电大学 Game cavity style identification method based on multiple scales and multiple views
CN113449626A (en) * 2021-06-23 2021-09-28 中国科学院上海高等研究院 Hidden Markov model vibration signal analysis method and device, storage medium and terminal
CN113449626B (en) * 2021-06-23 2023-11-07 中国科学院上海高等研究院 Method and device for analyzing vibration signal of hidden Markov model, storage medium and terminal

Similar Documents

Publication Publication Date Title
US11636860B2 (en) Word-level blind diarization of recorded calls with arbitrary number of speakers
CN105405439B (en) Speech playing method and device
Rouvier et al. An open-source state-of-the-art toolbox for broadcast news diarization
Harb et al. Voice-based gender identification in multimedia applications
Zhou et al. Efficient audio stream segmentation via the combined T/sup 2/statistic and Bayesian information criterion
Zhu et al. Combining speaker identification and BIC for speaker diarization
CA3033675A1 (en) Method and system for automatically diarising a sound recording
Butko et al. Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion
CN107342077A (en) A kind of speaker segmentation clustering method and system based on factorial analysis
CN101923855A (en) Test-irrelevant voice print identifying system
Reynolds et al. A study of new approaches to speaker diarization.
Johnson Who spoke when?-automatic segmentation and clustering for determining speaker turns.
CN103730112B (en) Multi-channel voice simulation and acquisition method
CN103871424A (en) Online speaking people cluster analysis method based on bayesian information criterion
Dubey et al. Non-intrusive speech quality assessment using several combinations of auditory features
CN104021785A (en) Method of extracting speech of most important guest in meeting
Zewoudie et al. The use of long-term features for GMM-and i-vector-based speaker diarization systems
CN109545191A (en) The real-time detection method of voice initial position in a kind of song
Istrate et al. NIST RT’05S evaluation: pre-processing techniques and speaker diarization on multiple microphone meetings
US7680654B2 (en) Apparatus and method for segmentation of audio data into meta patterns
CN101196888A (en) System and method for using digital audio characteristic set to specify audio frequency
Li et al. Unsupervised classification of speaker roles in multi-participant conversational speech
Magrin-Chagnolleau et al. Detection of target speakers in audio databases
CN102543080A (en) Audio editing system and audio editing method
CN109410968A (en) Voice initial position detection method in a kind of efficient song

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20161214