CN112165599A - Automatic conference summary generation method for video conference - Google Patents

Automatic conference summary generation method for video conference Download PDF

Info

Publication number
CN112165599A
CN112165599A CN202011077651.2A CN202011077651A CN112165599A CN 112165599 A CN112165599 A CN 112165599A CN 202011077651 A CN202011077651 A CN 202011077651A CN 112165599 A CN112165599 A CN 112165599A
Authority
CN
China
Prior art keywords
audio
conference
classes
clustering
speaker
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011077651.2A
Other languages
Chinese (zh)
Inventor
刘玉强
张军
吴伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Ketianshichang Information Technology Co ltd
Original Assignee
Guangzhou Ketianshichang Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Ketianshichang Information Technology Co ltd filed Critical Guangzhou Ketianshichang Information Technology Co ltd
Priority to CN202011077651.2A priority Critical patent/CN112165599A/en
Publication of CN112165599A publication Critical patent/CN112165599A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Game Theory and Decision Science (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a conference summary automatic generation method for a video conference, which comprises the steps of firstly, segmentation judgment, namely segmenting original audio mixed in the video conference, marking conversion points of different speakers when the original audio is not the same speaker, and performing audio segmentation coding according to the conversion points; clustering, namely clustering all the segmented audio fragments respectively, clustering the audio fragments belonging to the same speaker together, and marking the clustered audio data; step three, identification, namely identifying the clustered audio data; and fourthly, performing sound-text conversion, namely performing sound-text conversion on the audio data after the clustering identification to generate and store a text file. The invention can be effectively applied to the video conference, automatically generates the text file and the summary text file of each speaker, thereby freeing the two hands of a conference recorder, improving the output efficiency and the user experience of the conference summary, having obvious effect and being convenient for popularization.

Description

Automatic conference summary generation method for video conference
Technical Field
The invention belongs to the technical field of communication, and particularly relates to an automatic conference summary generation method for a video conference.
Background
With the arrival of 5G and digital times, audio and video applications based on a 5G network and the digital times are more and more common, a video conference is used as an important business application of the large video times and is more and more widely used in governments, army, national enterprises and large and medium-sized enterprises, and how conference recorders perform conference preschool according to speech contents of different speakers in the video conference is a challenging task.
Among the prior art, video conference's meeting era automatic generation mainly records through the audio frequency input end at the meeting, then realize through the mode of manual entry, this implementation scheme does not consider video conference's particularity, whether someone carries out recording of audio frequency in the speech, not only can record invalid speech data, but also can waste conference resource, simultaneously in the sound-text translation stage, also can not intelligent conversion, need the user to select the audio file by hand, just can change, user experience is not good.
Disclosure of Invention
The technical problem to be solved by the invention is to provide an automatic generation method of a conference summary for a video conference, which is simple in steps and convenient to implement, can be effectively applied to the video conference, automatically classifies speakers by using acoustic features and an AI identification module, and automatically generates a text file and a summary text file of each speaker according to a clustered audio file, thereby freeing both hands of a conference recorder, improving the output efficiency and user experience of the conference summary, having remarkable effect and being convenient for popularization.
In order to solve the technical problems, the invention adopts the technical scheme that: a method for automatically generating a conference summary for a videoconference, the method comprising the steps of:
step one, segmentation judgment
Dividing original audio mixed in a video conference, extracting voiceprint characteristics, comparing and identifying the front and rear divided audio, judging whether the two divided audio are the same speaker, marking conversion points of different speakers when the two divided audio are not the same speaker, and performing audio segmentation coding according to the conversion points to generate audio fragments;
step two, clustering
Clustering all the segmented audio fragments respectively, clustering the audio fragments belonging to the same speaker together, and marking the clustered audio data;
step three, recognition
Identifying the clustered audio data, and determining the participants corresponding to the marked audio data by combining the information of the participants in the conference;
step four, sound and text conversion
And performing sound-text conversion on the audio data after the clustering identification to generate and store a text file.
In the method for automatically generating the conference summary for the video conference, the first step of the segmentation judgment method includes distance measurement based and model search based.
The method for automatically generating the conference summary for the video conference comprises the following specific processes based on the distance measurement:
step A1, using a sliding window mechanism, wherein the window length is fixed and moves forwards by a fixed window difference;
a2, calculating a feature vector in a window, a mean value and a variance;
step A3, checking whether the characteristic vector in the window obeys Gaussian distribution, and when the characteristic vector obeys the Gaussian distribution, having no conversion point; when the gaussian distribution is not obeyed, there is a transition point.
The method for automatically generating the conference summary for the video conference comprises the following specific processes of model-based search:
step B1, performing model training on each segmented audio;
b2, calculating the Bayes value of the model corresponding to each section of audio;
step B3, comparing the Bayes values of the front and back sections of audio, and when the difference value is not greater than the threshold value, having no conversion point; when the difference is greater than the threshold, a transition point exists.
In the above automatic generation method of a conference summary for a video conference, the clustering method in the second step includes an aggregation level clustering algorithm AHC.
The method for automatically generating the conference summary for the video conference comprises the following specific processes of:
step C1, initializing, wherein each sample point is a class and has N classes, calculating the distance between every two classes, and setting a distance threshold;
step C2, comparing the minimum distance value between the two classes with a distance threshold value, and executing the step C3 when the minimum distance value between the two classes is smaller than the distance threshold value; stopping iteration when the minimum value of the distance between the two classes is not less than the distance threshold value;
c3, classifying the two classes with the minimum distance into one class, wherein the number of the classes is N-1;
and C4, calculating the distance between every two classes in the N-1 classes, and returning to the step C2.
In the method for automatically generating the conference summary for the video conference, the specific process of identification in the third step includes that the characteristic comparison and association are performed on the audio characteristic information of the audio channel in the conference and the clustered information according to the speaker number determined by clustering and the relevant audio characteristic of each audio channel in the conference participant information, so that the speaker information is identified.
The conference summary automatic generation method for the video conference comprises the steps of obtaining the name, the place and the area of a speaker according to the speaker information.
In the method for automatically generating the conference summary for the video conference, the text file in the fourth step includes a single text file obtained by converting the audio data of each speaker and a summary text file obtained by converting the identification of all speakers.
Compared with the prior art, the invention has the following advantages:
1. the method has simple steps and convenient realization.
2. According to the method, voice signals are segmented and sliced, then voiceprint characteristics of each fragment are extracted, clustering is carried out by adopting a coacervation hierarchical clustering algorithm, the number of speakers in the voice can be judged according to differences of the voiceprint characteristics, and finally voice splicing is carried out to obtain the voice of each separated person; when the difference degree of the voiceprint characteristics of each slice is compared, the threshold set by the system is a parameter which is difficult to determine, repeated verification and optimization of a considerable number of meetings are needed, the scheme is combined with an AI training and recognition module, the selection of the threshold is continuously optimized and adjusted, and the instantaneity and the accuracy of segmentation judgment are guaranteed.
3. The invention automatically translates the audio data into text information through the sound-text conversion module with the AI machine algorithm function, generates M +1 text files according to the number M of the identified speakers, and ensures the accuracy and the integrity of the text content, wherein the last file is the summary text file of all the speakers.
4. The invention can be effectively applied to a video conference, automatically classifies speakers by using the acoustic characteristics and AI identification modules, and automatically generates the text file and the summary text file of each speaker according to the clustered audio files, thereby liberating the hands of a conference recorder, improving the output efficiency and the user experience of the conference summary, having remarkable effect and being convenient for popularization.
In conclusion, the method provided by the invention has the advantages that the steps are simple, the realization is convenient, the method can be effectively applied to the video conference, the acoustic characteristics and the AI identification module are utilized to automatically classify the speakers, and the text file and the summary text file of each speaker are automatically generated according to the audio files after clustering, so that the two hands of a conference recorder are liberated, the output efficiency and the user experience of the conference summary are improved, the effect is obvious, and the popularization is convenient.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a flow chart of a distance metric based method of the present invention;
FIG. 3 is a flow chart of a method for model-based searching according to the present invention.
Detailed Description
As shown in fig. 1, the method for automatically generating a conference summary for a video conference according to the present invention includes the following steps:
step one, segmentation judgment
Dividing original audio mixed in a video conference, extracting voiceprint characteristics, comparing and identifying the front and rear divided audio, judging whether the two divided audio are the same speaker, marking conversion points of different speakers when the two divided audio are not the same speaker, and performing audio segmentation coding according to the conversion points to generate audio fragments;
step two, clustering
Clustering all the segmented audio fragments respectively, clustering the audio fragments belonging to the same speaker together, and marking the clustered audio data;
step three, recognition
Identifying the clustered audio data, and determining the participants corresponding to the marked audio data by combining the information of the participants in the conference;
in specific implementation, the number of speakers in the conference can be judged according to the difference of the voiceprint characteristics.
Step four, sound and text conversion
And performing sound-text conversion on the audio data after the clustering identification to generate and store a text file.
In the method, the segmentation judgment method in the first step comprises distance measurement-based and model-based search.
In specific implementation, the method is carried out based on combination of distance measurement and model search, and the real-time performance and accuracy of segmentation judgment are guaranteed.
In the method, as shown in fig. 2, the specific process based on the distance metric includes:
step A1, using a sliding window mechanism, wherein the window length is fixed and moves forwards by a fixed window difference;
a2, calculating a feature vector in a window, a mean value and a variance;
step A3, checking whether the characteristic vector in the window obeys Gaussian distribution, and when the characteristic vector obeys the Gaussian distribution, having no conversion point; when the gaussian distribution is not obeyed, there is a transition point.
In specific implementation, the speech features are independent from each other and obey Gaussian distribution, speech feature parameters of different speakers are different in probability distribution, and when two speakers speak in a section of audio, feature vectors do not obey Gaussian distribution.
In the method, as shown in fig. 3, the specific process of the model-based search includes:
step B1, performing model training on each segmented audio;
b2, calculating the Bayes value of the model corresponding to each section of audio;
step B3, comparing the Bayes values of the front and back sections of audio, and when the difference value is not greater than the threshold value, having no conversion point; when the difference is greater than the threshold, a transition point exists.
In the method, the clustering method in the second step comprises an agglomeration hierarchical clustering algorithm AHC.
In the method, the concrete process of the coacervation hierarchical clustering algorithm AHC comprises the following steps:
step C1, initializing, wherein each sample point is a class and has N classes, calculating the distance between every two classes, and setting a distance threshold;
step C2, comparing the minimum distance value between the two classes with a distance threshold value, and executing the step C3 when the minimum distance value between the two classes is smaller than the distance threshold value; stopping iteration when the minimum value of the distance between the two classes is not less than the distance threshold value;
c3, classifying the two classes with the minimum distance into one class, wherein the number of the classes is N-1;
and C4, calculating the distance between every two classes in the N-1 classes, and returning to the step C2.
In specific implementation, for the clustering algorithm AHC, the value of the distance threshold is important, the stopping position of the method is determined after the distance threshold is determined, the stopping position is determined, the category number is determined, and the category number is the number of speakers. Therefore, the distance threshold value needs to be obtained through repeated verification optimization and adjustment of a considerable number of meetings, and therefore, the distance threshold value is also selected through a continuous optimization and adjustment process.
In the method, the specific process of identification in the third step comprises the steps of determining the number of speakers according to the clustering, and comparing and associating the audio characteristic information of the audio channel in the conference with the clustered information according to the related audio characteristics of the audio channel of each person in the conference participant information, so as to identify the speaker information.
In the method, the speaker information includes a name, a location, and an area of the speaker.
In the method, the text files in the fourth step comprise a single text file obtained by converting the audio data of each speaker and a summary text file obtained by identifying and converting all speakers.
In specific implementation, according to a meeting ending mark when a meeting is ended, audio data is automatically translated into text information through a sound-text conversion module with an AI machine algorithm function, M +1 text files are generated according to the number M of identified speakers, and the last file is a summary text file of all the speakers, so that the accuracy and the integrity of text contents are ensured.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and all simple modifications, changes and equivalent structural changes made to the above embodiment according to the technical spirit of the present invention still fall within the protection scope of the technical solution of the present invention.

Claims (9)

1. A method for automatically generating a conference summary for a video conference, the method comprising the steps of:
step one, segmentation judgment
Dividing original audio mixed in a video conference, extracting voiceprint characteristics, comparing and identifying the front and rear divided audio, judging whether the two divided audio are the same speaker, marking conversion points of different speakers when the two divided audio are not the same speaker, and performing audio segmentation coding according to the conversion points to generate audio fragments;
step two, clustering
Clustering all the segmented audio fragments respectively, clustering the audio fragments belonging to the same speaker together, and marking the clustered audio data;
step three, recognition
Identifying the clustered audio data, and determining the participants corresponding to the marked audio data by combining the information of the participants in the conference;
step four, sound and text conversion
And performing sound-text conversion on the audio data after the clustering identification to generate and store a text file.
2. The method of claim 1, wherein the step of segmenting comprises distance metric based and model based searching.
3. The method of claim 2, wherein the distance metric based process comprises:
step A1, using a sliding window mechanism, wherein the window length is fixed and moves forwards by a fixed window difference;
a2, calculating a feature vector in a window, a mean value and a variance;
step A3, checking whether the characteristic vector in the window obeys Gaussian distribution, and when the characteristic vector obeys the Gaussian distribution, having no conversion point; when the gaussian distribution is not obeyed, there is a transition point.
4. The method for automatically generating a conference summary for a video conference as claimed in claim 2, wherein the specific process of model-based search comprises:
step B1, performing model training on each segmented audio;
b2, calculating the Bayes value of the model corresponding to each section of audio;
step B3, comparing the Bayes values of the front and back sections of audio, and when the difference value is not greater than the threshold value, having no conversion point; when the difference is greater than the threshold, a transition point exists.
5. The method of claim 1, wherein said clustering in step two comprises a hierarchical clustering algorithm AHC.
6. The method for automatically generating a conference summary for a video conference according to claim 5, wherein the specific process of the hierarchical clustering algorithm AHC comprises:
step C1, initializing, wherein each sample point is a class and has N classes, calculating the distance between every two classes, and setting a distance threshold;
step C2, comparing the minimum distance value between the two classes with a distance threshold value, and executing the step C3 when the minimum distance value between the two classes is smaller than the distance threshold value; stopping iteration when the minimum value of the distance between the two classes is not less than the distance threshold value;
c3, classifying the two classes with the minimum distance into one class, wherein the number of the classes is N-1;
and C4, calculating the distance between every two classes in the N-1 classes, and returning to the step C2.
7. The method as claimed in claim 1, wherein the step three of identifying comprises determining the number of speakers according to the cluster, and performing feature comparison and association between the audio feature information of the audio channel in the conference and the clustered information according to the audio features associated with the audio channels of each person in the conference participant information, thereby identifying the speaker information.
8. The method of claim 7, wherein the speaker information comprises a name, a location, and an area of the speaker.
9. The method of claim 8, wherein the text file in step four comprises a single text file converted from audio data of each speaker and a summary text file converted from all speaker identifications.
CN202011077651.2A 2020-10-10 2020-10-10 Automatic conference summary generation method for video conference Pending CN112165599A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011077651.2A CN112165599A (en) 2020-10-10 2020-10-10 Automatic conference summary generation method for video conference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011077651.2A CN112165599A (en) 2020-10-10 2020-10-10 Automatic conference summary generation method for video conference

Publications (1)

Publication Number Publication Date
CN112165599A true CN112165599A (en) 2021-01-01

Family

ID=73867950

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011077651.2A Pending CN112165599A (en) 2020-10-10 2020-10-10 Automatic conference summary generation method for video conference

Country Status (1)

Country Link
CN (1) CN112165599A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051426A (en) * 2021-03-18 2021-06-29 深圳市声扬科技有限公司 Audio information classification method and device, electronic equipment and storage medium
CN113707130A (en) * 2021-08-16 2021-11-26 北京搜狗科技发展有限公司 Voice recognition method and device for voice recognition
WO2022161264A1 (en) * 2021-01-26 2022-08-04 阿里巴巴集团控股有限公司 Audio signal processing method, conference recording and presentation method, device, system, and medium
CN115100701A (en) * 2021-03-08 2022-09-23 福建福清核电有限公司 Conference speaker identity identification method based on artificial intelligence technology

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030048946A1 (en) * 2001-09-07 2003-03-13 Fuji Xerox Co., Ltd. Systems and methods for the automatic segmentation and clustering of ordered information
CN103530432A (en) * 2013-09-24 2014-01-22 华南理工大学 Conference recorder with speech extracting function and speech extracting method
US9584946B1 (en) * 2016-06-10 2017-02-28 Philip Scott Lyren Audio diarization system that segments audio input
CN106971713A (en) * 2017-01-18 2017-07-21 清华大学 Speaker's labeling method and system based on density peaks cluster and variation Bayes
CN110335612A (en) * 2019-07-11 2019-10-15 招商局金融科技有限公司 Minutes generation method, device and storage medium based on speech recognition
CN110491392A (en) * 2019-08-29 2019-11-22 广州国音智能科技有限公司 A kind of audio data cleaning method, device and equipment based on speaker's identity

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030048946A1 (en) * 2001-09-07 2003-03-13 Fuji Xerox Co., Ltd. Systems and methods for the automatic segmentation and clustering of ordered information
CN103530432A (en) * 2013-09-24 2014-01-22 华南理工大学 Conference recorder with speech extracting function and speech extracting method
US9584946B1 (en) * 2016-06-10 2017-02-28 Philip Scott Lyren Audio diarization system that segments audio input
CN106971713A (en) * 2017-01-18 2017-07-21 清华大学 Speaker's labeling method and system based on density peaks cluster and variation Bayes
CN110335612A (en) * 2019-07-11 2019-10-15 招商局金融科技有限公司 Minutes generation method, device and storage medium based on speech recognition
CN110491392A (en) * 2019-08-29 2019-11-22 广州国音智能科技有限公司 A kind of audio data cleaning method, device and equipment based on speaker's identity

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022161264A1 (en) * 2021-01-26 2022-08-04 阿里巴巴集团控股有限公司 Audio signal processing method, conference recording and presentation method, device, system, and medium
CN115100701A (en) * 2021-03-08 2022-09-23 福建福清核电有限公司 Conference speaker identity identification method based on artificial intelligence technology
CN113051426A (en) * 2021-03-18 2021-06-29 深圳市声扬科技有限公司 Audio information classification method and device, electronic equipment and storage medium
CN113707130A (en) * 2021-08-16 2021-11-26 北京搜狗科技发展有限公司 Voice recognition method and device for voice recognition

Similar Documents

Publication Publication Date Title
CN112165599A (en) Automatic conference summary generation method for video conference
Lu et al. Speaker change detection and tracking in real-time news broadcasting analysis
CN103700370A (en) Broadcast television voice recognition method and system
WO2021073116A1 (en) Method and apparatus for generating legal document, device and storage medium
WO2020238209A1 (en) Audio processing method, system and related device
CN101529500A (en) Content summarizing system, method, and program
CN103871424A (en) Online speaking people cluster analysis method based on bayesian information criterion
US20220375492A1 (en) End-To-End Speech Diarization Via Iterative Speaker Embedding
CN112633241B (en) News story segmentation method based on multi-feature fusion and random forest model
CN101867742A (en) Television system based on sound control
Lu et al. Unsupervised speaker segmentation and tracking in real-time audio content analysis
CN101950564A (en) Remote digital voice acquisition, analysis and identification system
CN114022923A (en) Intelligent collecting and editing system
TWI769520B (en) Multi-language speech recognition and translation method and system
CN113936236A (en) Video entity relationship and interaction identification method based on multi-modal characteristics
CN116996337B (en) Conference data management system and method based on Internet of things and microphone switching technology
Imoto et al. Acoustic scene classification based on generative model of acoustic spatial words for distributed microphone array
CN110322883B (en) Voice-to-text effect evaluation optimization method
CN110807370B (en) Conference speaker identity noninductive confirmation method based on multiple modes
CN114547264A (en) News diagram data identification method based on Mahalanobis distance and comparison learning
CN114155845A (en) Service determination method and device, electronic equipment and storage medium
Wu et al. Universal Background Models for Real-time Speaker Change Detection.
CN110400578A (en) The generation of Hash codes and its matching process, device, electronic equipment and storage medium
CN111914777B (en) Method and system for identifying robot instruction in cross-mode manner
CN117316165B (en) Conference audio analysis processing method and system based on time sequence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210101

RJ01 Rejection of invention patent application after publication