CN103530432A - Conference recorder with speech extracting function and speech extracting method - Google Patents

Conference recorder with speech extracting function and speech extracting method Download PDF

Info

Publication number
CN103530432A
CN103530432A CN201310439113.7A CN201310439113A CN103530432A CN 103530432 A CN103530432 A CN 103530432A CN 201310439113 A CN201310439113 A CN 201310439113A CN 103530432 A CN103530432 A CN 103530432A
Authority
CN
China
Prior art keywords
speaker
voice
module
voice segments
segments
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310439113.7A
Other languages
Chinese (zh)
Inventor
王梓里
李艳雄
李广隆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201310439113.7A priority Critical patent/CN103530432A/en
Publication of CN103530432A publication Critical patent/CN103530432A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a conference recorder with a speaker speech extracting function. The conference recorder with the speaker speech extracting function comprises a main control module, a recording and playback module, a removable storage module, an interaction and display module and a speaker speech processing module, wherein the speaker speech processing module comprises a speaker segmenting module and a speaker clustering module. The main control module transmits a conference speech stream to the speaker segmenting module, and the speaker segmenting module detects speaker changing points in the speech stream and segments the speech stream into a plurality of speech sections according to the changing points; the speaker clustering module carries out speaker clustering on the segmented speech sections through a spectral clustering method, the speech sections of the same speakers are jointed together in sequence, and thus the number of the speakers and the speech of each speaker are obtained. The conference recorder and the speech extracting method are capable of automatically extracting the speech of each speaker from the conference speech, comprehensive in function and convenient to use.

Description

A kind of minutes device and voice extracting method with voice abstraction function
Technical field
The present invention relates to field of audio processing, particularly a kind of minutes device and voice extracting method with voice abstraction function.
Background technology
Minutes device in the market just has the functions such as simple recording, playback, unloading, does not have speaker's voice content to analyze and the function of understanding.User, when affected minutes, if need to gather and processing for some specific speaker's speeches, must hear out whole recording, and whether manually identify is same speaker.In order to save time, can there is the risk of missing useful information again in fast-forward play.By manual, speech data being marked and extracted, concerning user, is very inconvenient.
Therefore, the functions such as people wish minutes device except can record, playback, can also carry out content analysis and understanding to minutes content, wish that especially minutes device can automatically extract each speaker's voice according to conference voice data from all participants.
Summary of the invention
The shortcoming that the object of the invention is to overcome prior art, with not enough, provides a kind of minutes device with voice abstraction function, and it not only has recording, playback, unloading function, but also can automatically extract each speaker's voice.
Another object of the present invention is to provide a kind of voice extracting method, it can be analyzed speaker's number and each speaker's voice are classified.
Object of the present invention realizes by following technical scheme: a kind of minutes device with voice abstraction function, comprise main control module, recording and playback module, removable memory module, mutual and display module, also comprise speaker's speech processing module, speaker's speech processing module comprises speaker and cuts apart module and speaker clustering module, wherein
Speaker is cut apart module: main control module is cut apart module by meeting sound flow transmission to speaker, and speaker is cut apart module and detected speaker in above-mentioned voice flow and change a little, according to these changes voice flow of naming a person for a particular job, is divided into a plurality of voice segments;
Speaker clustering module, utilizes spectral clustering to cut apart module segmentation voice segments out to speaker and carries out speaker clustering, and identical speaker's voice segments is stitched together in order, obtains speaker's number and each speaker's voice.
Described speaker is cut apart module, comprises that quiet section and voice segments detection module, audio feature extraction module, speaker change detection of change-point module and voice segments is cut apart module, wherein
Quiet section and voice segments detection module, utilize the quiet detection algorithm based on threshold judgement from the above-mentioned voice flow reading in, to find out quiet section and voice segments;
Audio feature extraction module, is spliced into a long voice segments in order by above-mentioned voice segments, and extracts audio frequency characteristics from long voice segments;
Speaker changes detection of change-point module, utilizes said extracted audio frequency characteristics out, and according to bayesian information criterion, the similarity in the long voice segments of judgement between adjacent data window detects speaker and changes a little;
Voice segments is cut apart module, according to above-mentioned speaker, changes a little, and voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker.
In quiet section and voice segments detection module, the step that the described quiet detection algorithm based on threshold judgement comprises following order:
(1) to the voice flow reading in, divide frame, and calculate the energy of every frame voice, obtain the energy feature vector of voice flow;
(2) calculating energy thresholding;
(3) by the energy of every frame voice and energy threshold comparison, lower than the frame of energy threshold, be mute frame, otherwise be speech frame, adjacent mute frame is spliced into one quiet section in order, adjacent speech frame is spliced into a voice segments in order.
In audio feature extraction module, described audio frequency characteristics comprises Mel frequency cepstral coefficient (Mel Frequency Cepstral Coefficients, MFCCs) and first order difference (Delta-MFCCs) thereof.Mel frequency cepstral coefficient and first order difference thereof are known features in the industry.
Described recording and playback module, comprise microphone, loudspeaker and audio processing chip.
Described mutual and display module, comprises a touch-screen and control circuit thereof, provides and has the User Interface of controlling function, utilizes touch-screen and user interactions.
Described removable memory module, adopts SD card to realize the storage to data.
Another object of the present invention realizes by following technical scheme: a kind of voice extracting method, and the step that comprises following order:
(1) read in voice flow: read in the voice flow that records many speakers voice;
(2) by speaker's speech processing module, the voice flow reading in is processed, wherein speaker's speech processing module comprises that speaker cuts apart module and speaker clustering module;
(3) by speaker, cut apart module and detect speaker in above-mentioned voice flow and change a little, according to these changes voice flow of naming a person for a particular job, be divided into a plurality of voice segments;
(4) speaker clustering module is utilized spectral clustering to cut apart module segmentation voice segments out to speaker to carry out speaker clustering, identical speaker's voice segments is stitched together in order, obtain speaker's number and each speaker's voice.
Described step (3) specifically comprises following steps:
A, speaker are cut apart module and are comprised quiet section and voice segments detection module, audio feature extraction module, speaker change detection of change-point module and voice segments is cut apart module;
B, quiet section and the quiet detection algorithm of voice segments detection module utilization based on threshold judgement are found out quiet section and voice segments from the above-mentioned voice flow reading in;
C, audio feature extraction module, be spliced into a long voice segments in order by above-mentioned voice segments, and extract audio frequency characteristics from long voice segments;
D, speaker change detection of change-point module, utilize said extracted audio frequency characteristics out, and according to bayesian information criterion, the similarity in the long voice segments of judgement between adjacent data window detects speaker and changes a little;
E, voice segments are cut apart module, according to above-mentioned speaker, change a little, and voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker.
In step b, the step that the described quiet detection algorithm based on threshold judgement comprises following order:
(1) to the voice flow reading in, divide frame, and calculate the energy of every frame voice, obtain the energy feature vector of voice flow;
(2) calculating energy thresholding;
(3) by the energy of every frame voice and energy threshold comparison, lower than the frame of energy threshold, be mute frame, otherwise be speech frame, adjacent mute frame is spliced into one quiet section in order, adjacent speech frame is spliced into a voice segments in order;
In step c, described audio frequency characteristics comprises Mel frequency cepstral coefficient and first order difference thereof.
Compared with prior art, tool has the following advantages and beneficial effect in the present invention:
A, easy to use, save time: after minutes device of the present invention gathers speech data by recording and playback module, can automatically process voice data, each speaker's difference is come, and each speaker's voice are sorted out, stored, user can directly select according to the needs of oneself speaker dependent and speaker dependent's voice.
B, complete function: minutes device of the present invention has the function of general minutes device simultaneously, as recording, playback, unloading, the speech data that its removable memory module can obtain other places in addition copies these minutes device to and carries out analyzing and processing.
Accompanying drawing explanation
Fig. 1 is a kind of structured flowchart with the minutes device of speaker's voice abstraction function of the present invention;
Fig. 2 is the workflow diagram of minutes device described in Fig. 1;
Fig. 3 is the process flow diagram of voice extracting method of the present invention.
Embodiment
Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, as Fig. 1,2, a kind of minutes device with speaker's voice abstraction function, as Fig. 1, comprise main control module, recording and playback module, removable memory module, mutual and display module, also comprise speaker's speech processing module, speaker's speech processing module comprises speaker and cuts apart module and speaker clustering module, wherein
Recording and playback module, comprise microphone, loudspeaker and audio processing chip;
Mutual and display module, comprises a touch-screen and control circuit thereof, provides and has the User Interface of controlling function, utilizes touch-screen and user interactions;
Removable memory module, adopts SD card to realize the storage to data;
Recording and playback module, be responsible for typing and the broadcasting of voice data;
Main control module, sends instruction, controls the co-ordination between modules, and main control module adopts the micro computer processing platform based on Samsung S5PV210 processor, carries embedded Linux system;
Speaker is cut apart module, main control module transfers to speaker and cuts apart module reading in the voice flow that records a plurality of speaker's voice, speaker is cut apart module and is detected speaker in above-mentioned voice flow and change a little, according to these changes voice flow of naming a person for a particular job, be divided into a plurality of voice segments, speaker is cut apart module and is specifically comprised that quiet section and voice segments detection module, audio feature extraction module, speaker change detection of change-point module and voice segments is cut apart module, wherein
Quiet section and voice segments detection module, utilize the quiet detection algorithm based on threshold judgement from the above-mentioned voice flow reading in, to find out quiet section and voice segments, the step that wherein the quiet detection algorithm based on threshold judgement comprises following order:
(1) to the voice flow reading in, divide frame, and calculate the energy of every frame voice, obtain the energy feature vector of voice flow;
(2) calculating energy thresholding;
(3) by the energy of every frame voice and energy threshold comparison, lower than the frame of energy threshold, be mute frame, otherwise be speech frame, adjacent mute frame is spliced into one quiet section in order, adjacent speech frame is spliced into a voice segments in order;
Audio feature extraction module, is spliced into a long voice segments in order by above-mentioned voice segments, and extracts audio frequency characteristics from long voice segments, and audio frequency characteristics comprises Mel frequency cepstral coefficient and first order difference thereof;
Speaker changes in detection of change-point module, and the described method of utilizing bayesian information criterion to determine that speaker changes a little specifically comprises the following steps:
(1) each voice segments obtaining through quiet detection is spliced into a long voice segments in order, length voice segments is cut into data window, window length is 2 seconds, and it is 0.1 second that window moves.To each data window, divide frame, frame length is 32 milliseconds, it is 16 milliseconds that frame moves, from each frame voice signal, extract MFCCs and Delta-MFCCs feature, the dimension M of MFCCs and Delta-MFCCs gets 12, the feature of each data window forms an eigenmatrix F, and the dimension d=2M of eigenmatrix F is 24;
(2) calculate the BIC distance between two adjacent data windows (x and y), BIC is as follows apart from computing formula:
ΔBIC = ( n x + n y ) ln ( | det ( cov ( F z ) ) | ) - n x ln ( | det ( cov ( F x ) ) | ) -
n y ln ( | det ( cov ( F y ) ) | ) - α ( d + d ( d + 1 ) 2 ) ln ( n x + n y )
Wherein, z merges the data window obtaining afterwards, n by data window x and y xand n yrespectively the frame number of data window x and y, F x, F yand F zrespectively the eigenmatrix of data window x, y and z, cov (F x), cov (F y) and cov (F z) be respectively eigenmatrix F x, F yand F zcovariance matrix, it is that penalty coefficient and experiment value are 2.0 that det () represents to ask determinant of a matrix value, α;
(3) if BIC distance, delta BIC is greater than zero, these two data windows are regarded as belonging to two different speakers (being to exist speaker to change a little between them), otherwise these two data windows are regarded as belonging to same speaker and they are merged;
(4) data window that constantly slides judges whether two BIC between adjacent data window distances are greater than zero, and preserves speaker and change a little, until the BIC distance between all adjacent data windows of long voice segments has all been judged;
Voice segments is cut apart module, according to above-mentioned speaker, changes a little, and voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker;
In speaker clustering module, described Spectral Clustering specifically comprises the following steps:
(1) from every frame voice, extract the audio frequency characteristics of Mel frequency cepstral coefficient and first order difference thereof, the dimension M of MFCCs and Delta-MFCCs, the feature of each voice segments forms an eigenmatrix F j, eigenmatrix F jdimension d=2M;
(2) according to each eigenmatrix F jobtain the eigenmatrix set F={F of all voice segments to be clustered 1..., F j, J is the total number of voice segments, then constructs affine matrix A ∈ R according to F j * J, (i, j) individual elements A of A ijbe defined as follows:
A ij = exp ( - d 2 ( F i , F j ) 2 σ i σ j ) i ≠ j , 1 ≤ i , j ≤ J 0 i = j , 1 ≤ i , j ≤ J
Wherein, d (F i, F j) be eigenmatrix F iwith F jbetween Euclidean distance, σ ior σ jrepresent scale parameter, be defined as i or j eigenmatrix F ior F jand the variance of the Euclidean distance vector between other J-1 eigenmatrix, described T represents the totalframes that multi-conference voice are divided into, i, j represent the numbering of voice segments;
(3) structure diagonal matrix D, its (i, i) individual element equals the capable all elements sum of i of affine matrix A, then according to matrix D and the normalized affine matrix L=D of A structure -1/2aD -1/2;
(4) calculate the front K of affine matrix L maxthe eigenwert of individual maximum
Figure BDA0000386703990000061
and eigenwert vector v wherein kfor column vector and
Figure BDA0000386703990000063
according to the difference between adjacent feature value, estimate speaker's number K:
K = arg max i ∈ [ 1 , K max - 1 ] ( λ i - λ i + 1 )
According to the speaker's number K estimating, structural matrix V=[v 1, v 2..., v k] ∈ R j * K, in formula: 1≤k≤K max;
(5) every a line of normalization matrix V, obtains matrix Y ∈ R j * K, (j, k) individual element Y of Y jk:
Y jk = V jk ( Σ k = 1 K V jk 2 ) , 1 ≤ j ≤ J ;
(6) each trade in matrix Y is made to space R kin a point, utilize K mean algorithm to be clustered into K class;
(7) when the j of matrix Y capable by cluster in k class, eigenmatrix F jcorresponding voice segments is judged to i.e. k the speaker of k class;
(8), according to above-mentioned cluster result, obtain speaker's number, each speaker's voice duration and each speaker's voice hop count.
As Fig. 2, a kind of workflow of the minutes device with speaker's voice abstraction function is as follows:
1) minutes device start, carries out system initialization;
2), by mutual and display module, minutes device shows interactive interface;
3) user selects by interactive interface the action of whether recording:
If recording, master control module controls recording starts recording with playback module, and recording material is stored in removable memory module, after finishing, returns to interactive interface;
If do not record, user selects to record file by interactive interface, then master control module controls speaker speech processing module is that speaker is cut apart module and speaker clustering module, to speaker's voice cut apart, clustering processing, extract each speaker's voice;
4) then interactive interface prompting user selects whether to play raw tone:
If so, play raw tone;
If not, further prompting certain speaker's voice whether: if so, select this person and play its voice; If not, turn back to interactive interface.
An extracting method, as Fig. 3, the step that comprises following order:
(1) read in voice flow: read in the voice flow that records many speakers voice;
(2) by speaker's speech processing module, the voice flow reading in is processed, wherein speaker's speech processing module comprises that speaker cuts apart module and speaker clustering module;
(3) by speaker, cut apart module and detect speaker in above-mentioned voice flow and change a little, according to these changes voice flow of naming a person for a particular job, be divided into a plurality of voice segments, specifically comprise following steps:
A, speaker are cut apart module and are comprised quiet section and voice segments detection module, audio feature extraction module, speaker change detection of change-point module and voice segments is cut apart module;
B, quiet section and the quiet detection algorithm of voice segments detection module utilization based on threshold judgement are found out quiet section and voice segments, the step that wherein the quiet detection algorithm based on threshold judgement comprises following order from the above-mentioned voice flow reading in:
(1) to the voice flow reading in, divide frame, and calculate the energy of every frame voice, obtain the energy feature vector of voice flow;
(2) calculating energy thresholding;
(3) by the energy of every frame voice and energy threshold comparison, lower than the frame of energy threshold, be mute frame, otherwise be speech frame, adjacent mute frame is spliced into one quiet section in order, adjacent speech frame is spliced into a voice segments in order;
C, audio feature extraction module, be spliced into a long voice segments in order by above-mentioned voice segments, and extract audio frequency characteristics from long voice segments, and audio frequency characteristics comprises Mel frequency cepstral coefficient and first order difference thereof;
D, speaker change detection of change-point module, utilize said extracted audio frequency characteristics out, and according to bayesian information criterion, the similarity in the long voice segments of judgement between adjacent data window detects speaker and changes a little;
E, voice segments are cut apart module, according to above-mentioned speaker, change a little, and voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker;
(4) speaker clustering module is utilized spectral clustering to cut apart module segmentation voice segments out to speaker to carry out speaker clustering, identical speaker's voice segments is stitched together in order, obtain speaker's number and each speaker's voice.
Above-described embodiment is preferably embodiment of the present invention; but embodiments of the present invention are not restricted to the described embodiments; other any do not deviate from change, the modification done under Spirit Essence of the present invention and principle, substitutes, combination, simplify; all should be equivalent substitute mode, within being included in protection scope of the present invention.

Claims (10)

1. a minutes device with voice abstraction function, comprise main control module, recording and playback module, removable memory module, mutual and display module, it is characterized in that, also comprise speaker's speech processing module, speaker's speech processing module comprises speaker and cuts apart module and speaker clustering module, wherein
Speaker is cut apart module: main control module transfers to speaker by meeting voice flow and cuts apart module, and speaker is cut apart module and detected speaker in above-mentioned conference voice stream and change a little, according to these changes voice flow of naming a person for a particular job, is divided into a plurality of voice segments;
Speaker clustering module, utilizes spectral clustering to cut apart module segmentation voice segments out to speaker and carries out speaker clustering, and identical speaker's voice segments is stitched together in order, obtains speaker's number and each speaker's voice.
2. the minutes device with voice abstraction function according to claim 1, it is characterized in that, described speaker is cut apart module, comprises that quiet section and voice segments detection module, audio feature extraction module, speaker change detection of change-point module and voice segments is cut apart module, wherein
Quiet section and voice segments detection module, utilize the quiet detection algorithm based on threshold judgement from the above-mentioned voice flow reading in, to find out quiet section and voice segments;
Audio feature extraction module, is spliced into a long voice segments in order by above-mentioned voice segments, and extracts audio frequency characteristics from long voice segments;
Speaker changes detection of change-point module, utilizes said extracted audio frequency characteristics out, and according to bayesian information criterion, the similarity in the long voice segments of judgement between adjacent data window detects speaker and changes a little;
Voice segments is cut apart module, according to above-mentioned speaker, changes a little, and voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker.
3. the minutes device with voice abstraction function according to claim 2, is characterized in that, in quiet section and voice segments detection module, and the step that the described quiet detection algorithm based on threshold judgement comprises following order:
(1) to the voice flow reading in, divide frame, and calculate the energy of every frame voice, obtain the energy feature vector of voice flow;
(2) calculating energy thresholding;
(3) by the energy of every frame voice and energy threshold comparison, lower than the frame of energy threshold, be mute frame, otherwise be speech frame, adjacent mute frame is spliced into one quiet section in order, adjacent speech frame is spliced into a voice segments in order.
4. the minutes device with voice abstraction function according to claim 2, is characterized in that, in audio feature extraction module, described audio frequency characteristics comprises Mel frequency cepstral coefficient and first order difference thereof.
5. the minutes device with voice abstraction function according to claim 1, is characterized in that described recording and playback module comprise microphone, loudspeaker and audio processing chip.
6. the minutes device with voice abstraction function according to claim 1, it is characterized in that, described mutual and display module, comprises a touch-screen and control circuit thereof, provide and there is the User Interface of controlling function, utilize touch-screen and user interactions.
7. the minutes device with voice abstraction function according to claim 1, is characterized in that, described removable memory module adopts SD card to realize the storage to data.
8. a voice extracting method, the step that comprises following order:
(1) read in voice flow: read in the voice flow that records many speakers voice;
(2) by speaker's speech processing module, the voice flow reading in is processed, wherein speaker's speech processing module comprises that speaker cuts apart module and speaker clustering module;
(3) by speaker, cut apart module and detect speaker in above-mentioned voice flow and change a little, according to these changes voice flow of naming a person for a particular job, be divided into a plurality of voice segments;
(4) speaker clustering module is utilized spectral clustering to cut apart module segmentation voice segments out to speaker to carry out speaker clustering, identical speaker's voice segments is stitched together in order, obtain speaker's number and each speaker's voice.
9. voice extracting method according to claim 8, is characterized in that, described step (3) specifically comprises following steps:
A, speaker are cut apart module and are comprised quiet section and voice segments detection module, audio feature extraction module, speaker change detection of change-point module and voice segments is cut apart module;
B, quiet section and the quiet detection algorithm of voice segments detection module utilization based on threshold judgement are found out quiet section and voice segments from the above-mentioned voice flow reading in;
C, audio feature extraction module, be spliced into a long voice segments in order by above-mentioned voice segments, and extract audio frequency characteristics from long voice segments;
D, speaker change detection of change-point module, utilize said extracted audio frequency characteristics out, and according to bayesian information criterion, the similarity in the long voice segments of judgement between adjacent data window detects speaker and changes a little;
E, voice segments are cut apart module, according to above-mentioned speaker, change a little, and voice flow is divided into a plurality of voice segments, and each voice segments only comprises a speaker.
10. voice extracting method according to claim 9, is characterized in that, in step b, and the step that the described quiet detection algorithm based on threshold judgement comprises following order:
(1) to the voice flow reading in, divide frame, and calculate the energy of every frame voice, obtain the energy feature vector of voice flow;
(2) calculating energy thresholding;
(3) by the energy of every frame voice and energy threshold comparison, lower than the frame of energy threshold, be mute frame, otherwise be speech frame, adjacent mute frame is spliced into one quiet section in order, adjacent speech frame is spliced into a voice segments in order;
In step c, described audio frequency characteristics comprises Mel frequency cepstral coefficient (Mel Frequency Cepstral Coefficients, MFCCs) and first order difference (Delta-MFCCs) thereof.
CN201310439113.7A 2013-09-24 2013-09-24 Conference recorder with speech extracting function and speech extracting method Pending CN103530432A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310439113.7A CN103530432A (en) 2013-09-24 2013-09-24 Conference recorder with speech extracting function and speech extracting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310439113.7A CN103530432A (en) 2013-09-24 2013-09-24 Conference recorder with speech extracting function and speech extracting method

Publications (1)

Publication Number Publication Date
CN103530432A true CN103530432A (en) 2014-01-22

Family

ID=49932441

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310439113.7A Pending CN103530432A (en) 2013-09-24 2013-09-24 Conference recorder with speech extracting function and speech extracting method

Country Status (1)

Country Link
CN (1) CN103530432A (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021785A (en) * 2014-05-28 2014-09-03 华南理工大学 Method of extracting speech of most important guest in meeting
CN104409080A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Voice end node detection method and device
CN105161093A (en) * 2015-10-14 2015-12-16 科大讯飞股份有限公司 Method and system for determining the number of speakers
CN105895102A (en) * 2015-11-15 2016-08-24 乐视移动智能信息技术(北京)有限公司 Recording editing method and recording device
WO2016165346A1 (en) * 2015-09-16 2016-10-20 中兴通讯股份有限公司 Method and apparatus for storing and playing audio file
CN106375182A (en) * 2016-08-22 2017-02-01 腾讯科技(深圳)有限公司 Voice communication method and device based on instant messaging application
CN107886955A (en) * 2016-09-29 2018-04-06 百度在线网络技术(北京)有限公司 A kind of personal identification method, device and the equipment of voice conversation sample
CN106610451B (en) * 2016-12-23 2019-01-04 杭州电子科技大学 Based on the extraction of the periodic signal fundamental frequency of cepstrum and Bayesian decision and matching process
CN109599120A (en) * 2018-12-25 2019-04-09 哈尔滨工程大学 One kind being based on large-scale farming field factory mammal abnormal sound monitoring method
CN109767757A (en) * 2019-01-16 2019-05-17 平安科技(深圳)有限公司 A kind of minutes generation method and device
CN109960743A (en) * 2019-01-16 2019-07-02 平安科技(深圳)有限公司 Conference content differentiating method, device, computer equipment and storage medium
CN110021302A (en) * 2019-03-06 2019-07-16 厦门快商通信息咨询有限公司 A kind of Intelligent office conference system and minutes method
CN110197665A (en) * 2019-06-25 2019-09-03 广东工业大学 A kind of speech Separation and tracking for police criminal detection monitoring
WO2019183904A1 (en) * 2018-03-29 2019-10-03 华为技术有限公司 Method for automatically identifying different human voices in audio
CN110517694A (en) * 2019-09-06 2019-11-29 北京清帆科技有限公司 A kind of teaching scene voice conversion detection system
CN110517667A (en) * 2019-09-03 2019-11-29 龙马智芯(珠海横琴)科技有限公司 A kind of method of speech processing, device, electronic equipment and storage medium
CN110689906A (en) * 2019-11-05 2020-01-14 江苏网进科技股份有限公司 Law enforcement detection method and system based on voice processing technology
CN110930984A (en) * 2019-12-04 2020-03-27 北京搜狗科技发展有限公司 Voice processing method and device and electronic equipment
CN111883159A (en) * 2020-08-05 2020-11-03 龙马智芯(珠海横琴)科技有限公司 Voice processing method and device
CN111968657A (en) * 2020-08-17 2020-11-20 北京字节跳动网络技术有限公司 Voice processing method and device, electronic equipment and computer readable medium
CN112053691A (en) * 2020-09-21 2020-12-08 广东迷听科技有限公司 Conference assisting method and device, electronic equipment and storage medium
CN112165599A (en) * 2020-10-10 2021-01-01 广州科天视畅信息科技有限公司 Automatic conference summary generation method for video conference
CN112382282A (en) * 2020-11-06 2021-02-19 北京五八信息技术有限公司 Voice denoising processing method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6873953B1 (en) * 2000-05-22 2005-03-29 Nuance Communications Prosody based endpoint detection
CN101211615A (en) * 2006-12-31 2008-07-02 于柏泉 Method, system and apparatus for automatic recording for specific human voice
CN102543063A (en) * 2011-12-07 2012-07-04 华南理工大学 Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers
CN102682760A (en) * 2011-03-07 2012-09-19 株式会社理光 Overlapped voice detection method and system
CN102968986A (en) * 2012-11-07 2013-03-13 华南理工大学 Overlapped voice and single voice distinguishing method based on long time characteristics and short time characteristics

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6873953B1 (en) * 2000-05-22 2005-03-29 Nuance Communications Prosody based endpoint detection
CN101211615A (en) * 2006-12-31 2008-07-02 于柏泉 Method, system and apparatus for automatic recording for specific human voice
CN102682760A (en) * 2011-03-07 2012-09-19 株式会社理光 Overlapped voice detection method and system
CN102543063A (en) * 2011-12-07 2012-07-04 华南理工大学 Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers
CN102968986A (en) * 2012-11-07 2013-03-13 华南理工大学 Overlapped voice and single voice distinguishing method based on long time characteristics and short time characteristics

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021785A (en) * 2014-05-28 2014-09-03 华南理工大学 Method of extracting speech of most important guest in meeting
CN104409080B (en) * 2014-12-15 2018-09-18 北京国双科技有限公司 Sound end detecting method and device
CN104409080A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Voice end node detection method and device
WO2016165346A1 (en) * 2015-09-16 2016-10-20 中兴通讯股份有限公司 Method and apparatus for storing and playing audio file
CN105161093B (en) * 2015-10-14 2019-07-09 科大讯飞股份有限公司 A kind of method and system judging speaker's number
CN105161093A (en) * 2015-10-14 2015-12-16 科大讯飞股份有限公司 Method and system for determining the number of speakers
WO2017080235A1 (en) * 2015-11-15 2017-05-18 乐视控股(北京)有限公司 Audio recording editing method and recording device
CN105895102A (en) * 2015-11-15 2016-08-24 乐视移动智能信息技术(北京)有限公司 Recording editing method and recording device
CN106375182A (en) * 2016-08-22 2017-02-01 腾讯科技(深圳)有限公司 Voice communication method and device based on instant messaging application
CN106375182B (en) * 2016-08-22 2019-08-27 腾讯科技(深圳)有限公司 Voice communication method and device based on instant messaging application
CN107886955A (en) * 2016-09-29 2018-04-06 百度在线网络技术(北京)有限公司 A kind of personal identification method, device and the equipment of voice conversation sample
CN107886955B (en) * 2016-09-29 2021-10-26 百度在线网络技术(北京)有限公司 Identity recognition method, device and equipment of voice conversation sample
CN106610451B (en) * 2016-12-23 2019-01-04 杭州电子科技大学 Based on the extraction of the periodic signal fundamental frequency of cepstrum and Bayesian decision and matching process
WO2019183904A1 (en) * 2018-03-29 2019-10-03 华为技术有限公司 Method for automatically identifying different human voices in audio
CN109599120A (en) * 2018-12-25 2019-04-09 哈尔滨工程大学 One kind being based on large-scale farming field factory mammal abnormal sound monitoring method
CN109599120B (en) * 2018-12-25 2021-12-07 哈尔滨工程大学 Abnormal mammal sound monitoring method based on large-scale farm plant
CN109767757A (en) * 2019-01-16 2019-05-17 平安科技(深圳)有限公司 A kind of minutes generation method and device
CN109960743A (en) * 2019-01-16 2019-07-02 平安科技(深圳)有限公司 Conference content differentiating method, device, computer equipment and storage medium
WO2020147407A1 (en) * 2019-01-16 2020-07-23 平安科技(深圳)有限公司 Conference record generation method and apparatus, storage medium and computer device
CN110021302A (en) * 2019-03-06 2019-07-16 厦门快商通信息咨询有限公司 A kind of Intelligent office conference system and minutes method
CN110197665A (en) * 2019-06-25 2019-09-03 广东工业大学 A kind of speech Separation and tracking for police criminal detection monitoring
CN110517667A (en) * 2019-09-03 2019-11-29 龙马智芯(珠海横琴)科技有限公司 A kind of method of speech processing, device, electronic equipment and storage medium
CN110517694A (en) * 2019-09-06 2019-11-29 北京清帆科技有限公司 A kind of teaching scene voice conversion detection system
CN110689906A (en) * 2019-11-05 2020-01-14 江苏网进科技股份有限公司 Law enforcement detection method and system based on voice processing technology
CN110930984A (en) * 2019-12-04 2020-03-27 北京搜狗科技发展有限公司 Voice processing method and device and electronic equipment
CN111883159A (en) * 2020-08-05 2020-11-03 龙马智芯(珠海横琴)科技有限公司 Voice processing method and device
CN111968657A (en) * 2020-08-17 2020-11-20 北京字节跳动网络技术有限公司 Voice processing method and device, electronic equipment and computer readable medium
CN112053691A (en) * 2020-09-21 2020-12-08 广东迷听科技有限公司 Conference assisting method and device, electronic equipment and storage medium
CN112165599A (en) * 2020-10-10 2021-01-01 广州科天视畅信息科技有限公司 Automatic conference summary generation method for video conference
CN112382282A (en) * 2020-11-06 2021-02-19 北京五八信息技术有限公司 Voice denoising processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN103530432A (en) Conference recorder with speech extracting function and speech extracting method
CN105405439B (en) Speech playing method and device
CN107274916B (en) Method and device for operating audio/video file based on voiceprint information
US9514751B2 (en) Speech recognition device and the operation method thereof
Heittola et al. Supervised model training for overlapping sound events based on unsupervised source separation
Eronen et al. Audio-based context recognition
US8793127B2 (en) Method and apparatus for automatically determining speaker characteristics for speech-directed advertising or other enhancement of speech-controlled devices or services
Temko et al. Acoustic event detection in meeting-room environments
US8867891B2 (en) Video concept classification using audio-visual grouplets
US8699852B2 (en) Video concept classification using video similarity scores
US7263485B2 (en) Robust detection and classification of objects in audio using limited training data
CN101247470B (en) Method realized by computer for detecting scene boundaries in videos
EP2642427A2 (en) Video concept classification using temporally-correlated grouplets
CN101470897B (en) Sensitive film detection method based on audio/video amalgamation policy
US20060224438A1 (en) Method and device for providing information
Imoto Introduction to acoustic event and scene analysis
US20220199099A1 (en) Audio Signal Processing Method and Related Product
KR100792016B1 (en) Apparatus and method for character based video summarization by audio and video contents analysis
Lailler et al. Semi-supervised and unsupervised data extraction targeting speakers: From speaker roles to fame?
CN104021785A (en) Method of extracting speech of most important guest in meeting
CN103559882A (en) Meeting presenter voice extracting method based on speaker division
CN107358947A (en) Speaker recognition methods and system again
WO2023088448A1 (en) Speech processing method and device, and storage medium
JP2008005167A (en) Device, method and program for classifying video image, and computer-readable recording medium
CN109997186B (en) Apparatus and method for classifying acoustic environments

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140122