CN112165599A

CN112165599A - Automatic conference summary generation method for video conference

Info

Publication number: CN112165599A
Application number: CN202011077651.2A
Authority: CN
Inventors: 刘玉强; 张军; 吴伟
Original assignee: Guangzhou Ketianshichang Information Technology Co ltd
Current assignee: Guangzhou Ketianshichang Information Technology Co ltd
Priority date: 2020-10-10
Filing date: 2020-10-10
Publication date: 2021-01-01

Abstract

The invention discloses a conference summary automatic generation method for a video conference, which comprises the steps of firstly, segmentation judgment, namely segmenting original audio mixed in the video conference, marking conversion points of different speakers when the original audio is not the same speaker, and performing audio segmentation coding according to the conversion points; clustering, namely clustering all the segmented audio fragments respectively, clustering the audio fragments belonging to the same speaker together, and marking the clustered audio data; step three, identification, namely identifying the clustered audio data; and fourthly, performing sound-text conversion, namely performing sound-text conversion on the audio data after the clustering identification to generate and store a text file. The invention can be effectively applied to the video conference, automatically generates the text file and the summary text file of each speaker, thereby freeing the two hands of a conference recorder, improving the output efficiency and the user experience of the conference summary, having obvious effect and being convenient for popularization.

Description

Automatic conference summary generation method for video conference

Technical Field

The invention belongs to the technical field of communication, and particularly relates to an automatic conference summary generation method for a video conference.

Background

With the arrival of 5G and digital times, audio and video applications based on a 5G network and the digital times are more and more common, a video conference is used as an important business application of the large video times and is more and more widely used in governments, army, national enterprises and large and medium-sized enterprises, and how conference recorders perform conference preschool according to speech contents of different speakers in the video conference is a challenging task.

Among the prior art, video conference's meeting era automatic generation mainly records through the audio frequency input end at the meeting, then realize through the mode of manual entry, this implementation scheme does not consider video conference's particularity, whether someone carries out recording of audio frequency in the speech, not only can record invalid speech data, but also can waste conference resource, simultaneously in the sound-text translation stage, also can not intelligent conversion, need the user to select the audio file by hand, just can change, user experience is not good.

Disclosure of Invention

The technical problem to be solved by the invention is to provide an automatic generation method of a conference summary for a video conference, which is simple in steps and convenient to implement, can be effectively applied to the video conference, automatically classifies speakers by using acoustic features and an AI identification module, and automatically generates a text file and a summary text file of each speaker according to a clustered audio file, thereby freeing both hands of a conference recorder, improving the output efficiency and user experience of the conference summary, having remarkable effect and being convenient for popularization.

In order to solve the technical problems, the invention adopts the technical scheme that: a method for automatically generating a conference summary for a videoconference, the method comprising the steps of:

step one, segmentation judgment

Dividing original audio mixed in a video conference, extracting voiceprint characteristics, comparing and identifying the front and rear divided audio, judging whether the two divided audio are the same speaker, marking conversion points of different speakers when the two divided audio are not the same speaker, and performing audio segmentation coding according to the conversion points to generate audio fragments;

step two, clustering

Clustering all the segmented audio fragments respectively, clustering the audio fragments belonging to the same speaker together, and marking the clustered audio data;

step three, recognition

Identifying the clustered audio data, and determining the participants corresponding to the marked audio data by combining the information of the participants in the conference;

step four, sound and text conversion

And performing sound-text conversion on the audio data after the clustering identification to generate and store a text file.

In the method for automatically generating the conference summary for the video conference, the first step of the segmentation judgment method includes distance measurement based and model search based.

The method for automatically generating the conference summary for the video conference comprises the following specific processes based on the distance measurement:

step A1, using a sliding window mechanism, wherein the window length is fixed and moves forwards by a fixed window difference;

a2, calculating a feature vector in a window, a mean value and a variance;

step A3, checking whether the characteristic vector in the window obeys Gaussian distribution, and when the characteristic vector obeys the Gaussian distribution, having no conversion point; when the gaussian distribution is not obeyed, there is a transition point.

The method for automatically generating the conference summary for the video conference comprises the following specific processes of model-based search:

step B1, performing model training on each segmented audio;

b2, calculating the Bayes value of the model corresponding to each section of audio;

step B3, comparing the Bayes values of the front and back sections of audio, and when the difference value is not greater than the threshold value, having no conversion point; when the difference is greater than the threshold, a transition point exists.

In the above automatic generation method of a conference summary for a video conference, the clustering method in the second step includes an aggregation level clustering algorithm AHC.

The method for automatically generating the conference summary for the video conference comprises the following specific processes of:

step C1, initializing, wherein each sample point is a class and has N classes, calculating the distance between every two classes, and setting a distance threshold;

step C2, comparing the minimum distance value between the two classes with a distance threshold value, and executing the step C3 when the minimum distance value between the two classes is smaller than the distance threshold value; stopping iteration when the minimum value of the distance between the two classes is not less than the distance threshold value;

c3, classifying the two classes with the minimum distance into one class, wherein the number of the classes is N-1;

and C4, calculating the distance between every two classes in the N-1 classes, and returning to the step C2.

In the method for automatically generating the conference summary for the video conference, the specific process of identification in the third step includes that the characteristic comparison and association are performed on the audio characteristic information of the audio channel in the conference and the clustered information according to the speaker number determined by clustering and the relevant audio characteristic of each audio channel in the conference participant information, so that the speaker information is identified.

The conference summary automatic generation method for the video conference comprises the steps of obtaining the name, the place and the area of a speaker according to the speaker information.

In the method for automatically generating the conference summary for the video conference, the text file in the fourth step includes a single text file obtained by converting the audio data of each speaker and a summary text file obtained by converting the identification of all speakers.

Compared with the prior art, the invention has the following advantages:

1. the method has simple steps and convenient realization.

2. According to the method, voice signals are segmented and sliced, then voiceprint characteristics of each fragment are extracted, clustering is carried out by adopting a coacervation hierarchical clustering algorithm, the number of speakers in the voice can be judged according to differences of the voiceprint characteristics, and finally voice splicing is carried out to obtain the voice of each separated person; when the difference degree of the voiceprint characteristics of each slice is compared, the threshold set by the system is a parameter which is difficult to determine, repeated verification and optimization of a considerable number of meetings are needed, the scheme is combined with an AI training and recognition module, the selection of the threshold is continuously optimized and adjusted, and the instantaneity and the accuracy of segmentation judgment are guaranteed.

3. The invention automatically translates the audio data into text information through the sound-text conversion module with the AI machine algorithm function, generates M +1 text files according to the number M of the identified speakers, and ensures the accuracy and the integrity of the text content, wherein the last file is the summary text file of all the speakers.

4. The invention can be effectively applied to a video conference, automatically classifies speakers by using the acoustic characteristics and AI identification modules, and automatically generates the text file and the summary text file of each speaker according to the clustered audio files, thereby liberating the hands of a conference recorder, improving the output efficiency and the user experience of the conference summary, having remarkable effect and being convenient for popularization.

In conclusion, the method provided by the invention has the advantages that the steps are simple, the realization is convenient, the method can be effectively applied to the video conference, the acoustic characteristics and the AI identification module are utilized to automatically classify the speakers, and the text file and the summary text file of each speaker are automatically generated according to the audio files after clustering, so that the two hands of a conference recorder are liberated, the output efficiency and the user experience of the conference summary are improved, the effect is obvious, and the popularization is convenient.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a flow chart of a distance metric based method of the present invention;

FIG. 3 is a flow chart of a method for model-based searching according to the present invention.

Detailed Description

As shown in fig. 1, the method for automatically generating a conference summary for a video conference according to the present invention includes the following steps:

step one, segmentation judgment

step two, clustering

step three, recognition

in specific implementation, the number of speakers in the conference can be judged according to the difference of the voiceprint characteristics.

Step four, sound and text conversion

In the method, the segmentation judgment method in the first step comprises distance measurement-based and model-based search.

In specific implementation, the method is carried out based on combination of distance measurement and model search, and the real-time performance and accuracy of segmentation judgment are guaranteed.

In the method, as shown in fig. 2, the specific process based on the distance metric includes:

a2, calculating a feature vector in a window, a mean value and a variance;

In specific implementation, the speech features are independent from each other and obey Gaussian distribution, speech feature parameters of different speakers are different in probability distribution, and when two speakers speak in a section of audio, feature vectors do not obey Gaussian distribution.

In the method, as shown in fig. 3, the specific process of the model-based search includes:

step B1, performing model training on each segmented audio;

In the method, the clustering method in the second step comprises an agglomeration hierarchical clustering algorithm AHC.

In the method, the concrete process of the coacervation hierarchical clustering algorithm AHC comprises the following steps:

In specific implementation, for the clustering algorithm AHC, the value of the distance threshold is important, the stopping position of the method is determined after the distance threshold is determined, the stopping position is determined, the category number is determined, and the category number is the number of speakers. Therefore, the distance threshold value needs to be obtained through repeated verification optimization and adjustment of a considerable number of meetings, and therefore, the distance threshold value is also selected through a continuous optimization and adjustment process.

In the method, the specific process of identification in the third step comprises the steps of determining the number of speakers according to the clustering, and comparing and associating the audio characteristic information of the audio channel in the conference with the clustered information according to the related audio characteristics of the audio channel of each person in the conference participant information, so as to identify the speaker information.

In the method, the speaker information includes a name, a location, and an area of the speaker.

In the method, the text files in the fourth step comprise a single text file obtained by converting the audio data of each speaker and a summary text file obtained by identifying and converting all speakers.

In specific implementation, according to a meeting ending mark when a meeting is ended, audio data is automatically translated into text information through a sound-text conversion module with an AI machine algorithm function, M +1 text files are generated according to the number M of identified speakers, and the last file is a summary text file of all the speakers, so that the accuracy and the integrity of text contents are ensured.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and all simple modifications, changes and equivalent structural changes made to the above embodiment according to the technical spirit of the present invention still fall within the protection scope of the technical solution of the present invention.

Claims

1. A method for automatically generating a conference summary for a video conference, the method comprising the steps of:

step one, segmentation judgment

step two, clustering

step three, recognition

step four, sound and text conversion

2. The method of claim 1, wherein the step of segmenting comprises distance metric based and model based searching.

3. The method of claim 2, wherein the distance metric based process comprises:

a2, calculating a feature vector in a window, a mean value and a variance;

4. The method for automatically generating a conference summary for a video conference as claimed in claim 2, wherein the specific process of model-based search comprises:

step B1, performing model training on each segmented audio;

5. The method of claim 1, wherein said clustering in step two comprises a hierarchical clustering algorithm AHC.

6. The method for automatically generating a conference summary for a video conference according to claim 5, wherein the specific process of the hierarchical clustering algorithm AHC comprises:

7. The method as claimed in claim 1, wherein the step three of identifying comprises determining the number of speakers according to the cluster, and performing feature comparison and association between the audio feature information of the audio channel in the conference and the clustered information according to the audio features associated with the audio channels of each person in the conference participant information, thereby identifying the speaker information.

8. The method of claim 7, wherein the speaker information comprises a name, a location, and an area of the speaker.

9. The method of claim 8, wherein the text file in step four comprises a single text file converted from audio data of each speaker and a summary text file converted from all speaker identifications.