CN117316165A

CN117316165A - Conference audio analysis processing method and system based on time sequence

Info

Publication number: CN117316165A
Application number: CN202311586467.4A
Authority: CN
Inventors: 刘耀明; 翟立志
Original assignee: Shenzhen Cloudwinner Network Technology Co ltd
Current assignee: Shenzhen Cloudwinner Network Technology Co ltd
Priority date: 2023-11-27
Filing date: 2023-11-27
Publication date: 2023-12-29
Anticipated expiration: 2043-11-27
Also published as: CN117316165B

Abstract

The invention discloses a conference audio analysis processing method and system based on time sequence, which relate to the technical field of conference management and comprise the following steps: acquiring conference audio; noise reduction processing is carried out on the voice signals; preprocessing the voice signal; acquiring a first characteristic range of the human voice on a frequency domain, and removing a part of at least one basic sinusoidal signal outside the first characteristic range; performing inverse fourier transform on at least one modified sinusoidal signal to obtain separated human voice; voice recognition is carried out on human voice to generate characters, wherein the characters are staggered messy characters generated by multiple persons; and establishing a word rearrangement model by using deep learning, and classifying words generated by different people by using the word rearrangement model to obtain at least one word time sequence set. By arranging the voice signal separation module, the voice signal recognition module and the word processing module, words generated by different people are distinguished, and summarized according to a time sequence, so that at least one word time sequence set is obtained.

Description

Conference audio analysis processing method and system based on time sequence

Technical Field

The invention relates to the technical field of conference management, in particular to a time sequence-based conference audio analysis processing method and system.

Background

As technology advances, many products for automatically recording conference content are continually being introduced. From the earliest recorders to automated voice-to-text equipment. The content recorded by these recording methods is very much, because the session is often continued for several hours. Resulting in time and effort consuming review or retrieval of meeting records. While some advanced products tag meeting participants in the form of human biometric features such as voice prints, fingerprints, etc., and then quickly locate meeting recordings by the tags, they are not efficient.

The difficulty of conference audio analysis is that there may be multiple people speaking at the same time, and current speech recognition can only separate voice from voice with large difference from voice, and can not extract different voice classifications, so that characters can not be generated by classification recognition.

Disclosure of Invention

In order to solve the technical problems, the technical scheme provides a time sequence-based conference audio analysis processing method and system, and solves the problems that the conference audio analysis presented in the background technology is difficult in that a plurality of people speaking at the same time can exist, the current voice recognition can only separate voice and voice with larger difference from voice, and different voice cannot be extracted in a classified manner, so that characters cannot be generated in the classified recognition.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a conference audio analysis processing method based on time sequence comprises the following steps:

acquiring conference audio, wherein the conference audio is restored into a voice signal through digital receiving, decompression, information processing and digital-to-analog conversion;

noise reduction processing is carried out on the voice signals;

the method comprises the steps of obtaining a voice signal, preprocessing the voice signal, wherein the preprocessing flow sequentially comprises pre-emphasis of the voice signal, framing of the voice signal, windowing of the voice signal, fast Fourier transform of the voice signal, decomposing of the voice signal into at least one basic sinusoidal signal, frequency domain information forming of the at least one basic sinusoidal signal, and determination of the basic sinusoidal signal by amplitude, phase and frequency, wherein the pre-emphasis compensates for a damaged signal, high-frequency components of the signal are enhanced at the beginning end of a transmission line, and excessive attenuation of the high-frequency components in the transmission process is compensated;

acquiring a first characteristic range of the human voice on a frequency domain, and removing a part of at least one basic sinusoidal signal outside the first characteristic range to obtain at least one corrected sinusoidal signal;

performing inverse fourier transform on at least one modified sinusoidal signal to obtain separated human voice;

voice recognition is carried out on human voice to generate characters, wherein the characters are staggered messy characters generated by multiple persons;

establishing a word rearrangement model by using deep learning, classifying words generated by different people by using the word rearrangement model to obtain at least one word time sequence set, wherein the words generated by the same person belong to the same word time sequence set, the words in the word time sequence set are arranged according to the time sequence of generation, and the words in the word time sequence set are marked with the time points when the words appear;

and managing information of the text time sequence set to generate a meeting summary.

Preferably, the framing the voice signal and windowing the voice signal includes the following steps:

determining the frame length of the frame, wherein the frame length of the frame is 15ms-25ms;

determining a frame shift, wherein the time difference of the initial positions of two adjacent frames is called frame shift, the frame shift ensures that the two adjacent frames have an overlapped part, and the frame shift is one half to three quarters of the frame length;

dividing the voice signal into at least one voice segment according to the frame length to obtain an original voice sequence v (n) of the voice segment in the time domain, multiplying the original voice sequence v (n) by a moving window function w (n), and finishing windowing, wherein the window function can select a rectangular window or a triangular window.

Preferably, the performing the fast fourier transform on the voice signal includes the steps of:

obtaining at least one windowed speech segment f (x) of the speech signal, processing the at least one speech segment f (x) using a fourier transform;

the fourier transform is as follows:

where F (x) is a basic sinusoidal signal, i is the unit imaginary number, and F (t) is a function of the speech segment.

Preferably, the step of obtaining the first characteristic range of the human voice in the frequency domain and removing the part of the at least one basic sinusoidal signal outside the first characteristic range includes the following steps:

acquiring a first characteristic range (a, b) of human voice on a frequency domain, and acquiring a basic sinusoidal signal F (x);

as a transition signal function G (x), the transition signal function G (x) is initially constant with 0, and for any point c within the definition domain range of the basic sinusoidal signal F (x), if F (c) does not belong to (a, b), G (c) takes 0, and if F (c) belongs to (a, b), G (c) takes F (c);

and c, finishing the value resetting of the transition signal function G (x) when traversing the definition domain of the basic sinusoidal signal F (x), and taking the transition signal function G (x) as a correction sinusoidal signal.

Preferably, said performing an inverse fourier transform on the at least one modified sinusoidal signal to obtain a separated human voice comprises the steps of:

acquiring at least one modified sinusoidal signal, and performing inverse fourier transform on the at least one modified sinusoidal signal;

obtaining a function g (x) of at least one corrected voice small section, and combining the g (x) of the at least one corrected voice small section to obtain human voice;

the inverse fourier transform is as follows:

where G (t) is the modified sinusoidal signal, i is the unit imaginary number, and G (x) is a function of the modified speech segment.

Preferably, the voice recognition of the human voice includes the following steps:

establishing a voice recognition model, wherein the voice recognition model comprises a single word and a first acoustic function corresponding to the single word and a second acoustic function corresponding to at least one word;

acquiring a voice function H (x), and acquiring character spacing points of the voice function H (x), wherein the character spacing points are points at which the value of the voice function H (x) is 0;

dividing the voice function H (x) into at least one dividing function along a time axis according to the character spacing points;

comparing a third acoustic wave function with the minimum difference with the segmentation function in the voice recognition model, if the third acoustic wave function is a first acoustic wave function, taking characters corresponding to the first acoustic wave function as characters recognized by the segmentation function, and if the third acoustic wave function is a second acoustic wave function, taking at least one character corresponding to the second acoustic wave function as at least one character recognized by the segmentation function;

and regarding the recognized characters, taking the middle point of the definition domain of the corresponding third acoustic wave function as the time point of the recognized characters.

Preferably, the text rearrangement model building by deep learning comprises the following steps:

acquiring a daily dialogue sample data set, cutting sentences in the daily dialogue sample data set into at least one daily phrase, and taking adjacent daily phrases as daily phrase pairing groups;

in the daily phrase pairing group, daily phrases are arranged according to the sequence of sentences in the daily dialogue sample data set;

and obtaining all daily phrase and daily phrase pairing groups, and storing the daily phrase and the daily phrase pairing groups into a word rearrangement model.

Preferably, the text rearrangement model classifies the text generated by different people, including the steps of:

acquiring a character set of voice recognition, and calling the characters according to the sequence of time points of the characters in the character set;

the characters with adjacent time points but different time points are used as test groups, the sequence of the characters in the test groups is arranged according to time sequence, whether daily phrases consistent with the test groups exist or not is searched in a character rearrangement model, if not, the test groups are deleted, and if so, the test groups are changed into prepared phrases;

the character set generates at least one prepared phrase, and the average value of the time points of characters in the prepared phrase is used as the time sequence time of the prepared phrase;

the preparation phrase with adjacent time sequence is used as a test group, the sequence of the preparation phrase in the test group is arranged according to time sequence, whether a daily phrase pairing group consistent with the test group exists or not is searched in a word rearrangement model, if not, the test group is deleted, and if so, the test group is used as the preparation phrase pairing group;

the character set generates at least one prepared phrase pairing group, the at least one prepared phrase pairing group forms a pairing set, and the time sequence time average value of the prepared phrases in the prepared phrase pairing group is used as the judgment time of the prepared phrase pairing group;

selecting a first prepared phrase pairing group with the smallest judgment time, deleting the first prepared phrase pairing group in a pairing set, and selecting a second prepared phrase pairing group to meet preset conditions: the prepared phrase at the head end of the second prepared phrase pairing group is consistent with the prepared phrase at the tail end of the first prepared phrase pairing group;

after the head end of the second prepared phrase pairing group is removed, the second prepared phrase pairing group is butted at the tail end of the first prepared phrase pairing group, a new first prepared phrase pairing group is formed through fusion, and the second prepared phrase pairing group is deleted;

re-selecting a second prepared phrase pairing group, and repeating the previous step for fusion;

repeating the previous step until a second prepared phrase pairing group meeting the preset condition is not formed;

taking the first prepared phrase pairing group obtained by fusion as a first text time sequence set;

reselecting a first prepared phrase pairing group with the smallest judging time, deleting the first prepared phrase pairing group in the pairing set to obtain a second character time sequence set;

the above steps are repeated until the paired set becomes an empty set.

The conference audio analysis processing system based on time sequence is used for realizing the conference audio analysis processing method based on time sequence, and comprises the following steps:

the voice signal processing module is used for acquiring conference audio;

the voice signal noise reduction module is used for carrying out noise reduction processing on the voice signal;

the voice signal preprocessing module is used for acquiring a voice signal and preprocessing the voice signal;

the voice signal correction module is used for removing the part of at least one basic sinusoidal signal outside the first characteristic range;

the voice signal separation module is used for separating out human voice;

the voice signal recognition module is used for carrying out voice recognition on human voice and generating characters;

the word processing module is used for classifying words generated by different people by the word rearrangement model to obtain at least one word time sequence set;

and the information management module is used for managing the information of the text time sequence set and generating a meeting summary.

Compared with the prior art, the invention has the beneficial effects that:

through setting up speech signal separation module, speech signal recognition module and word processing module, separate out the voice to discernment produces the pronunciation characters that many people sent, distinguishes the characters that different people produced according to the word rearrangement model, and gathers according to time sequence, obtains at least one character time sequence collection, thereby can solve present speech recognition can only separate the voice and with the great voice of voice difference, can't draw different voice classification, can't classify the problem that discernment produced characters.

Drawings

FIG. 1 is a schematic flow chart of a method for analyzing and processing conference audio based on time sequence;

FIG. 2 is a schematic diagram of a process of framing and windowing a speech signal according to the present invention;

FIG. 3 is a schematic flow chart of a part of the method for obtaining the first characteristic range of the human voice in the frequency domain and rejecting at least one basic sinusoidal signal outside the first characteristic range;

FIG. 4 is a schematic diagram of a voice recognition process for human voice according to the present invention;

FIG. 5 is a schematic diagram of a process for creating a text rearrangement model using deep learning according to the present invention;

FIG. 6 is a schematic diagram of a text rearrangement model for classifying text generated by different persons according to the present invention.

Detailed Description

The following description is presented to enable one of ordinary skill in the art to make and use the invention. The preferred embodiments in the following description are by way of example only and other obvious variations will occur to those skilled in the art.

Referring to fig. 1, a method for analyzing and processing conference audio based on time sequence includes:

noise reduction processing is carried out on the voice signals;

Referring to fig. 2, framing a voice signal, and windowing the voice signal includes the steps of:

dividing a voice signal into at least one voice segment according to a frame length to obtain an original voice sequence v (n) of the voice segment in a time domain, multiplying the original voice sequence v (n) by a moving window function w (n), and finishing windowing, wherein the window function can select a rectangular window or a triangular window;

the speech signal is framed and windowed in order to make the speech signal smoother.

Performing a fast fourier transform on the speech signal comprises the steps of:

the fourier transform is as follows:

wherein F (x) is a basic sinusoidal signal, i is a unit imaginary number, and F (t) is a function of a small voice segment;

the fast fourier transform is performed on the speech signal to generate a plurality of basic sinusoidal signals, the basic sinusoidal signals are sinusoidal functions, and the data processing of the sinusoidal functions is a known means, so that the difficulty in processing the speech signal can be reduced.

Referring to fig. 3, obtaining a first characteristic range of a human voice in a frequency domain, and removing a portion of at least one basic sinusoidal signal outside the first characteristic range includes the following steps:

when c traverses the definition domain of the basic sinusoidal signal F (x), finishing the value resetting of the transition signal function G (x), and taking the transition signal function G (x) as a correction sinusoidal signal;

the human voice generally has a predetermined range in the frequency domain, and sound waves not in the range are other sound waves and need to be removed, so that the part of the basic sinusoidal signal out of the range is removed.

Performing an inverse fourier transform on the at least one modified sinusoidal signal to obtain a separated human voice, comprising the steps of:

the inverse fourier transform is as follows:

wherein G (t) is a modified sinusoidal signal, i is a unit imaginary number, and G (x) is a function of the modified speech segment;

the processed corrected sinusoidal signal is recombined into a new signal, namely, the part of the non-human voice is removed, so that the separated human voice is obtained.

Referring to fig. 4, the voice recognition of the human voice includes the steps of:

regarding the identified characters, taking the middle point of the definition domain of the corresponding third acoustic wave function as the time point of the identified characters;

when a person makes a sound, each word has very small pauses, and the sound wave is small and can be used as a word spacing point, so that the sound wave between adjacent word spacing points corresponds to one word, and therefore, recognition can be performed.

Referring to fig. 5, the creation of a text rearrangement model using deep learning includes the steps of:

acquiring all daily phrase and daily phrase pairing groups, and storing the daily phrase and the daily phrase pairing groups into a word rearrangement model;

the word rearrangement model mainly establishes the matching relation of the words according to the using habit of the words, the matching is divided into two stages, firstly, the word group is established, and secondly, the word group collocation is used for generating sentences step by step.

Referring to fig. 6, the text rearrangement model classifies text generated by different persons, comprising the steps of:

repeating the previous step until the pairing set becomes an empty set;

the word sets are generated by multiple people, so that the word sets are overlapped, firstly, the words are paired into phrases according to the sequence of time, and the words spoken by different people are not logically connected, so that the words cannot be paired into phrases, secondly, the phrases are paired into sentences, and the words spoken by different people are not logically connected, so that the words cannot be paired into sentences;

here, when the character set generates at least one prepared phrase pairing group, each prepared phrase will appear in two of the prepared phrase pairing groups, because the prepared phrase appears in the sentence, the prepared phrase first appears in front of the sentence, the prepared phrase second appears in back of the sentence, and the prepared phrase first and the prepared phrase second are respectively matched with the prepared phrase to generate the prepared phrase pairing group, so that the prepared phrase pairing groups need to be fused into the sentence, and different people can speak.

the voice signal processing module is used for acquiring conference audio;

the voice signal separation module is used for separating out human voice;

The conference audio analysis processing system based on time sequence has the following working procedures:

step one: the voice signal processing module acquires conference audio which is restored into a voice signal through digital receiving, decompression, information processing and digital-to-analog conversion;

step two: the voice signal noise reduction module performs noise reduction processing on the voice signal;

step three: the voice signal preprocessing module acquires a voice signal, performs preprocessing on the voice signal, sequentially performs pre-emphasis on the voice signal, frames the voice signal, windows the voice signal, performs fast Fourier transform on the voice signal, and decomposes the voice signal into at least one basic sinusoidal signal which forms frequency domain information;

step four: the voice signal correction module obtains a first characteristic range of the voice on a frequency domain, and eliminates the part of at least one basic sinusoidal signal outside the first characteristic range to obtain at least one corrected sinusoidal signal;

step five: the voice signal separation module performs inverse Fourier transform on at least one corrected sinusoidal signal to obtain separated human voice;

step six: the voice signal recognition module carries out voice recognition on voice to generate characters, wherein the characters are staggered messy characters generated by multiple people;

step seven: the word processing module uses deep learning to establish a word rearrangement model, and the word rearrangement model classifies words generated by different people to obtain at least one word time sequence set;

step eight: and the information management module is used for carrying out information management on the text time sequence set to generate a meeting summary.

Still further, the present disclosure provides a storage medium having a computer readable program stored thereon, the computer readable program when invoked performing the above-described time-based conference audio analysis processing method.

It is understood that the storage medium may be a magnetic medium, e.g., floppy disk, hard disk, magnetic tape; optical media such as DVD; or a semiconductor medium such as a solid state disk SolidStateDisk, SSD, etc.

In summary, the invention has the advantages that: through setting up speech signal separation module, speech signal recognition module and word processing module, separate out the voice to discernment produces the pronunciation characters that many people sent, distinguishes the characters that different people produced according to the word rearrangement model, and gathers according to time sequence, obtains at least one character time sequence collection, thereby can solve present speech recognition can only separate the voice and with the great voice of voice difference, can't draw different voice classification, can't classify the problem that discernment produced characters.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention, which is defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A method for analyzing and processing conference audio based on time sequence, which is characterized by comprising the following steps:

noise reduction processing is carried out on the voice signals;

2. The method for analyzing and processing conference audio based on time sequence according to claim 1, wherein framing the voice signal and windowing the voice signal comprises the steps of:

3. A method of time-series based conference audio analysis processing according to claim 2, wherein said performing a fast fourier transform on the speech signal comprises the steps of:

the fourier transform is as follows:

，

4. A method of analyzing and processing conference audio based on time sequence according to claim 3, wherein the step of obtaining a first characteristic range of the voice in the frequency domain and removing the portion of the at least one basic sinusoidal signal outside the first characteristic range comprises the steps of:

5. The method of time-series-based conference audio analysis processing according to claim 4, wherein said performing inverse fourier transform on at least one modified sinusoidal signal to obtain a separated voice comprises the steps of:

the inverse fourier transform is as follows:

，

6. The method of time-series based conference audio analysis processing according to claim 5, wherein said voice recognition of human voice comprises the steps of:

7. The method of time-series based conference audio analysis processing according to claim 6, wherein said using deep learning to build a text rearrangement model comprises the steps of:

8. The method of claim 7, wherein the word rearrangement model classifies words generated by different persons, comprising the steps of:

the above steps are repeated until the paired set becomes an empty set.

9. A time-series-based conference audio analysis processing system for implementing the time-series-based conference audio analysis processing method according to any one of claims 1 to 8, comprising:

the voice signal processing module is used for acquiring conference audio;

the voice signal separation module is used for separating out human voice;