CN103530432A

CN103530432A - Conference recorder with speech extracting function and speech extracting method

Info

Publication number: CN103530432A
Application number: CN201310439113.7A
Authority: CN
Inventors: 王梓里; 李艳雄; 李广隆
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2013-09-24
Filing date: 2013-09-24
Publication date: 2014-01-22

Abstract

The invention discloses a conference recorder with a speaker speech extracting function. The conference recorder with the speaker speech extracting function comprises a main control module, a recording and playback module, a removable storage module, an interaction and display module and a speaker speech processing module, wherein the speaker speech processing module comprises a speaker segmenting module and a speaker clustering module. The main control module transmits a conference speech stream to the speaker segmenting module, and the speaker segmenting module detects speaker changing points in the speech stream and segments the speech stream into a plurality of speech sections according to the changing points; the speaker clustering module carries out speaker clustering on the segmented speech sections through a spectral clustering method, the speech sections of the same speakers are jointed together in sequence, and thus the number of the speakers and the speech of each speaker are obtained. The conference recorder and the speech extracting method are capable of automatically extracting the speech of each speaker from the conference speech, comprehensive in function and convenient to use.

Description

A kind of minutes device and voice extracting method with voice abstraction function

Technical field

The present invention relates to field of audio processing, particularly a kind of minutes device and voice extracting method with voice abstraction function.

Background technology

Minutes device in the market just has the functions such as simple recording, playback, unloading, does not have speaker's voice content to analyze and the function of understanding.User, when affected minutes, if need to gather and processing for some specific speaker's speeches, must hear out whole recording, and whether manually identify is same speaker.In order to save time, can there is the risk of missing useful information again in fast-forward play.By manual, speech data being marked and extracted, concerning user, is very inconvenient.

Therefore, the functions such as people wish minutes device except can record, playback, can also carry out content analysis and understanding to minutes content, wish that especially minutes device can automatically extract each speaker's voice according to conference voice data from all participants.

Summary of the invention

The shortcoming that the object of the invention is to overcome prior art, with not enough, provides a kind of minutes device with voice abstraction function, and it not only has recording, playback, unloading function, but also can automatically extract each speaker's voice.

Another object of the present invention is to provide a kind of voice extracting method, it can be analyzed speaker's number and each speaker's voice are classified.

Object of the present invention realizes by following technical scheme: a kind of minutes device with voice abstraction function, comprise main control module, recording and playback module, removable memory module, mutual and display module, also comprise speaker's speech processing module, speaker's speech processing module comprises speaker and cuts apart module and speaker clustering module, wherein

Speaker is cut apart module: main control module is cut apart module by meeting sound flow transmission to speaker, and speaker is cut apart module and detected speaker in above-mentioned voice flow and change a little, according to these changes voice flow of naming a person for a particular job, is divided into a plurality of voice segments;

Speaker clustering module, utilizes spectral clustering to cut apart module segmentation voice segments out to speaker and carries out speaker clustering, and identical speaker's voice segments is stitched together in order, obtains speaker's number and each speaker's voice.

Described speaker is cut apart module, comprises that quiet section and voice segments detection module, audio feature extraction module, speaker change detection of change-point module and voice segments is cut apart module, wherein

Quiet section and voice segments detection module, utilize the quiet detection algorithm based on threshold judgement from the above-mentioned voice flow reading in, to find out quiet section and voice segments;

Audio feature extraction module, is spliced into a long voice segments in order by above-mentioned voice segments, and extracts audio frequency characteristics from long voice segments;

Speaker changes detection of change-point module, utilizes said extracted audio frequency characteristics out, and according to bayesian information criterion, the similarity in the long voice segments of judgement between adjacent data window detects speaker and changes a little;

Voice segments is cut apart module, according to above-mentioned speaker, changes a little, and voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker.

In quiet section and voice segments detection module, the step that the described quiet detection algorithm based on threshold judgement comprises following order:

(1) to the voice flow reading in, divide frame, and calculate the energy of every frame voice, obtain the energy feature vector of voice flow;

(2) calculating energy thresholding;

(3) by the energy of every frame voice and energy threshold comparison, lower than the frame of energy threshold, be mute frame, otherwise be speech frame, adjacent mute frame is spliced into one quiet section in order, adjacent speech frame is spliced into a voice segments in order.

In audio feature extraction module, described audio frequency characteristics comprises Mel frequency cepstral coefficient (Mel Frequency Cepstral Coefficients, MFCCs) and first order difference (Delta-MFCCs) thereof.Mel frequency cepstral coefficient and first order difference thereof are known features in the industry.

Described recording and playback module, comprise microphone, loudspeaker and audio processing chip.

Described mutual and display module, comprises a touch-screen and control circuit thereof, provides and has the User Interface of controlling function, utilizes touch-screen and user interactions.

Described removable memory module, adopts SD card to realize the storage to data.

Another object of the present invention realizes by following technical scheme: a kind of voice extracting method, and the step that comprises following order:

(1) read in voice flow: read in the voice flow that records many speakers voice;

(2) by speaker's speech processing module, the voice flow reading in is processed, wherein speaker's speech processing module comprises that speaker cuts apart module and speaker clustering module;

(3) by speaker, cut apart module and detect speaker in above-mentioned voice flow and change a little, according to these changes voice flow of naming a person for a particular job, be divided into a plurality of voice segments;

(4) speaker clustering module is utilized spectral clustering to cut apart module segmentation voice segments out to speaker to carry out speaker clustering, identical speaker's voice segments is stitched together in order, obtain speaker's number and each speaker's voice.

Described step (3) specifically comprises following steps:

A, speaker are cut apart module and are comprised quiet section and voice segments detection module, audio feature extraction module, speaker change detection of change-point module and voice segments is cut apart module;

B, quiet section and the quiet detection algorithm of voice segments detection module utilization based on threshold judgement are found out quiet section and voice segments from the above-mentioned voice flow reading in;

C, audio feature extraction module, be spliced into a long voice segments in order by above-mentioned voice segments, and extract audio frequency characteristics from long voice segments;

D, speaker change detection of change-point module, utilize said extracted audio frequency characteristics out, and according to bayesian information criterion, the similarity in the long voice segments of judgement between adjacent data window detects speaker and changes a little;

E, voice segments are cut apart module, according to above-mentioned speaker, change a little, and voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker.

In step b, the step that the described quiet detection algorithm based on threshold judgement comprises following order:

(2) calculating energy thresholding;

(3) by the energy of every frame voice and energy threshold comparison, lower than the frame of energy threshold, be mute frame, otherwise be speech frame, adjacent mute frame is spliced into one quiet section in order, adjacent speech frame is spliced into a voice segments in order;

In step c, described audio frequency characteristics comprises Mel frequency cepstral coefficient and first order difference thereof.

Compared with prior art, tool has the following advantages and beneficial effect in the present invention:

A, easy to use, save time: after minutes device of the present invention gathers speech data by recording and playback module, can automatically process voice data, each speaker's difference is come, and each speaker's voice are sorted out, stored, user can directly select according to the needs of oneself speaker dependent and speaker dependent's voice.

B, complete function: minutes device of the present invention has the function of general minutes device simultaneously, as recording, playback, unloading, the speech data that its removable memory module can obtain other places in addition copies these minutes device to and carries out analyzing and processing.

Accompanying drawing explanation

Fig. 1 is a kind of structured flowchart with the minutes device of speaker's voice abstraction function of the present invention;

Fig. 2 is the workflow diagram of minutes device described in Fig. 1;

Fig. 3 is the process flow diagram of voice extracting method of the present invention.

Embodiment

Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, as Fig. 1,2, a kind of minutes device with speaker's voice abstraction function, as Fig. 1, comprise main control module, recording and playback module, removable memory module, mutual and display module, also comprise speaker's speech processing module, speaker's speech processing module comprises speaker and cuts apart module and speaker clustering module, wherein

Recording and playback module, comprise microphone, loudspeaker and audio processing chip;

Mutual and display module, comprises a touch-screen and control circuit thereof, provides and has the User Interface of controlling function, utilizes touch-screen and user interactions;

Removable memory module, adopts SD card to realize the storage to data;

Recording and playback module, be responsible for typing and the broadcasting of voice data;

Main control module, sends instruction, controls the co-ordination between modules, and main control module adopts the micro computer processing platform based on Samsung S5PV210 processor, carries embedded Linux system;

Speaker is cut apart module, main control module transfers to speaker and cuts apart module reading in the voice flow that records a plurality of speaker's voice, speaker is cut apart module and is detected speaker in above-mentioned voice flow and change a little, according to these changes voice flow of naming a person for a particular job, be divided into a plurality of voice segments, speaker is cut apart module and is specifically comprised that quiet section and voice segments detection module, audio feature extraction module, speaker change detection of change-point module and voice segments is cut apart module, wherein

Quiet section and voice segments detection module, utilize the quiet detection algorithm based on threshold judgement from the above-mentioned voice flow reading in, to find out quiet section and voice segments, the step that wherein the quiet detection algorithm based on threshold judgement comprises following order:

(2) calculating energy thresholding;

Audio feature extraction module, is spliced into a long voice segments in order by above-mentioned voice segments, and extracts audio frequency characteristics from long voice segments, and audio frequency characteristics comprises Mel frequency cepstral coefficient and first order difference thereof;

Speaker changes in detection of change-point module, and the described method of utilizing bayesian information criterion to determine that speaker changes a little specifically comprises the following steps:

(1) each voice segments obtaining through quiet detection is spliced into a long voice segments in order, length voice segments is cut into data window, window length is 2 seconds, and it is 0.1 second that window moves.To each data window, divide frame, frame length is 32 milliseconds, it is 16 milliseconds that frame moves, from each frame voice signal, extract MFCCs and Delta-MFCCs feature, the dimension M of MFCCs and Delta-MFCCs gets 12, the feature of each data window forms an eigenmatrix F, and the dimension d=2M of eigenmatrix F is 24;

(2) calculate the BIC distance between two adjacent data windows (x and y), BIC is as follows apart from computing formula:

ΔBIC = (n_{x} + n_{y}) \ln (| \det (cov (F_{z})) |) - n_{x} \ln (| \det (cov (F_{x})) |) -

n_{y} \ln (| \det (cov (F_{y})) |) - α (d + \frac{d (d + 1)}{2}) \ln (n_{x} + n_{y})

Wherein, z merges the data window obtaining afterwards, n by data window x and y _xand n _yrespectively the frame number of data window x and y, F _x, F _yand F _zrespectively the eigenmatrix of data window x, y and z, cov (F _x), cov (F _y) and cov (F _z) be respectively eigenmatrix F _x, F _yand F _zcovariance matrix, it is that penalty coefficient and experiment value are 2.0 that det () represents to ask determinant of a matrix value, α;

(3) if BIC distance, delta BIC is greater than zero, these two data windows are regarded as belonging to two different speakers (being to exist speaker to change a little between them), otherwise these two data windows are regarded as belonging to same speaker and they are merged;

(4) data window that constantly slides judges whether two BIC between adjacent data window distances are greater than zero, and preserves speaker and change a little, until the BIC distance between all adjacent data windows of long voice segments has all been judged;

Voice segments is cut apart module, according to above-mentioned speaker, changes a little, and voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker;

In speaker clustering module, described Spectral Clustering specifically comprises the following steps:

(1) from every frame voice, extract the audio frequency characteristics of Mel frequency cepstral coefficient and first order difference thereof, the dimension M of MFCCs and Delta-MFCCs, the feature of each voice segments forms an eigenmatrix F _j, eigenmatrix F _jdimension d=2M;

(2) according to each eigenmatrix F _jobtain the eigenmatrix set F={F of all voice segments to be clustered ₁..., F _j, J is the total number of voice segments, then constructs affine matrix A ∈ R according to F ^{j * J}, (i, j) individual elements A of A _ijbe defined as follows:

A_{ij} = \{\begin{matrix} \exp (\frac{- d^{2} (F_{i}, F_{j})}{2 σ_{i} σ_{j}}) & i &NotEqual; j, 1 \leq i, j \leq J \\ 0 & i = j, 1 \leq i, j \leq J \end{matrix}

Wherein, d (F _i, F _j) be eigenmatrix F _iwith F _jbetween Euclidean distance, σ _ior σ _jrepresent scale parameter, be defined as i or j eigenmatrix F _ior F _jand the variance of the Euclidean distance vector between other J-1 eigenmatrix, described T represents the totalframes that multi-conference voice are divided into, i, j represent the numbering of voice segments;

(3) structure diagonal matrix D, its (i, i) individual element equals the capable all elements sum of i of affine matrix A, then according to matrix D and the normalized affine matrix L=D of A structure ^-1/2aD ^-1/2;

(4) calculate the front K of affine matrix L _maxthe eigenwert of individual maximum

and eigenwert vector v wherein _kfor column vector and

according to the difference between adjacent feature value, estimate speaker's number K:

K = \underset{i &Element; [1, K_{\max} - 1]}{\arg \max} (λ_{i} - λ_{i + 1})

According to the speaker's number K estimating, structural matrix V=[v ₁, v ₂..., v _k] ∈ R ^{j * K}, in formula: 1≤k≤K _max;

(5) every a line of normalization matrix V, obtains matrix Y ∈ R ^{j * K}, (j, k) individual element Y of Y _jk:

Y_{jk} = \frac{V_{jk}}{\sqrt{(Σ_{k = 1}^{K} V_{jk}^{2})}}, 1 \leq j \leq J;

(6) each trade in matrix Y is made to space R ^kin a point, utilize K mean algorithm to be clustered into K class;

(7) when the j of matrix Y capable by cluster in k class, eigenmatrix F _jcorresponding voice segments is judged to i.e. k the speaker of k class;

(8), according to above-mentioned cluster result, obtain speaker's number, each speaker's voice duration and each speaker's voice hop count.

As Fig. 2, a kind of workflow of the minutes device with speaker's voice abstraction function is as follows:

1) minutes device start, carries out system initialization;

2), by mutual and display module, minutes device shows interactive interface;

3) user selects by interactive interface the action of whether recording:

If recording, master control module controls recording starts recording with playback module, and recording material is stored in removable memory module, after finishing, returns to interactive interface;

If do not record, user selects to record file by interactive interface, then master control module controls speaker speech processing module is that speaker is cut apart module and speaker clustering module, to speaker's voice cut apart, clustering processing, extract each speaker's voice;

4) then interactive interface prompting user selects whether to play raw tone:

If so, play raw tone;

If not, further prompting certain speaker's voice whether: if so, select this person and play its voice; If not, turn back to interactive interface.

An extracting method, as Fig. 3, the step that comprises following order:

(3) by speaker, cut apart module and detect speaker in above-mentioned voice flow and change a little, according to these changes voice flow of naming a person for a particular job, be divided into a plurality of voice segments, specifically comprise following steps:

B, quiet section and the quiet detection algorithm of voice segments detection module utilization based on threshold judgement are found out quiet section and voice segments, the step that wherein the quiet detection algorithm based on threshold judgement comprises following order from the above-mentioned voice flow reading in:

(2) calculating energy thresholding;

C, audio feature extraction module, be spliced into a long voice segments in order by above-mentioned voice segments, and extract audio frequency characteristics from long voice segments, and audio frequency characteristics comprises Mel frequency cepstral coefficient and first order difference thereof;

E, voice segments are cut apart module, according to above-mentioned speaker, change a little, and voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker;

Above-described embodiment is preferably embodiment of the present invention; but embodiments of the present invention are not restricted to the described embodiments; other any do not deviate from change, the modification done under Spirit Essence of the present invention and principle, substitutes, combination, simplify; all should be equivalent substitute mode, within being included in protection scope of the present invention.

Claims

1. a minutes device with voice abstraction function, comprise main control module, recording and playback module, removable memory module, mutual and display module, it is characterized in that, also comprise speaker's speech processing module, speaker's speech processing module comprises speaker and cuts apart module and speaker clustering module, wherein

Speaker is cut apart module: main control module transfers to speaker by meeting voice flow and cuts apart module, and speaker is cut apart module and detected speaker in above-mentioned conference voice stream and change a little, according to these changes voice flow of naming a person for a particular job, is divided into a plurality of voice segments;

2. the minutes device with voice abstraction function according to claim 1, it is characterized in that, described speaker is cut apart module, comprises that quiet section and voice segments detection module, audio feature extraction module, speaker change detection of change-point module and voice segments is cut apart module, wherein

3. the minutes device with voice abstraction function according to claim 2, is characterized in that, in quiet section and voice segments detection module, and the step that the described quiet detection algorithm based on threshold judgement comprises following order:

(2) calculating energy thresholding;

4. the minutes device with voice abstraction function according to claim 2, is characterized in that, in audio feature extraction module, described audio frequency characteristics comprises Mel frequency cepstral coefficient and first order difference thereof.

5. the minutes device with voice abstraction function according to claim 1, is characterized in that described recording and playback module comprise microphone, loudspeaker and audio processing chip.

6. the minutes device with voice abstraction function according to claim 1, it is characterized in that, described mutual and display module, comprises a touch-screen and control circuit thereof, provide and there is the User Interface of controlling function, utilize touch-screen and user interactions.

7. the minutes device with voice abstraction function according to claim 1, is characterized in that, described removable memory module adopts SD card to realize the storage to data.

8. a voice extracting method, the step that comprises following order:

9. voice extracting method according to claim 8, is characterized in that, described step (3) specifically comprises following steps:

10. voice extracting method according to claim 9, is characterized in that, in step b, and the step that the described quiet detection algorithm based on threshold judgement comprises following order:

(2) calculating energy thresholding;

In step c, described audio frequency characteristics comprises Mel frequency cepstral coefficient (Mel Frequency Cepstral Coefficients, MFCCs) and first order difference (Delta-MFCCs) thereof.