CN117316165A - Conference audio analysis processing method and system based on time sequence - Google Patents

Conference audio analysis processing method and system based on time sequence Download PDF

Info

Publication number
CN117316165A
CN117316165A CN202311586467.4A CN202311586467A CN117316165A CN 117316165 A CN117316165 A CN 117316165A CN 202311586467 A CN202311586467 A CN 202311586467A CN 117316165 A CN117316165 A CN 117316165A
Authority
CN
China
Prior art keywords
voice
phrase
signal
prepared
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311586467.4A
Other languages
Chinese (zh)
Other versions
CN117316165B (en
Inventor
刘耀明
翟立志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Cloudwinner Network Technology Co ltd
Original Assignee
Shenzhen Cloudwinner Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Cloudwinner Network Technology Co ltd filed Critical Shenzhen Cloudwinner Network Technology Co ltd
Priority to CN202311586467.4A priority Critical patent/CN117316165B/en
Publication of CN117316165A publication Critical patent/CN117316165A/en
Application granted granted Critical
Publication of CN117316165B publication Critical patent/CN117316165B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a conference audio analysis processing method and system based on time sequence, which relate to the technical field of conference management and comprise the following steps: acquiring conference audio; noise reduction processing is carried out on the voice signals; preprocessing the voice signal; acquiring a first characteristic range of the human voice on a frequency domain, and removing a part of at least one basic sinusoidal signal outside the first characteristic range; performing inverse fourier transform on at least one modified sinusoidal signal to obtain separated human voice; voice recognition is carried out on human voice to generate characters, wherein the characters are staggered messy characters generated by multiple persons; and establishing a word rearrangement model by using deep learning, and classifying words generated by different people by using the word rearrangement model to obtain at least one word time sequence set. By arranging the voice signal separation module, the voice signal recognition module and the word processing module, words generated by different people are distinguished, and summarized according to a time sequence, so that at least one word time sequence set is obtained.

Description

Conference audio analysis processing method and system based on time sequence
Technical Field
The invention relates to the technical field of conference management, in particular to a time sequence-based conference audio analysis processing method and system.
Background
As technology advances, many products for automatically recording conference content are continually being introduced. From the earliest recorders to automated voice-to-text equipment. The content recorded by these recording methods is very much, because the session is often continued for several hours. Resulting in time and effort consuming review or retrieval of meeting records. While some advanced products tag meeting participants in the form of human biometric features such as voice prints, fingerprints, etc., and then quickly locate meeting recordings by the tags, they are not efficient.
The difficulty of conference audio analysis is that there may be multiple people speaking at the same time, and current speech recognition can only separate voice from voice with large difference from voice, and can not extract different voice classifications, so that characters can not be generated by classification recognition.
Disclosure of Invention
In order to solve the technical problems, the technical scheme provides a time sequence-based conference audio analysis processing method and system, and solves the problems that the conference audio analysis presented in the background technology is difficult in that a plurality of people speaking at the same time can exist, the current voice recognition can only separate voice and voice with larger difference from voice, and different voice cannot be extracted in a classified manner, so that characters cannot be generated in the classified recognition.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a conference audio analysis processing method based on time sequence comprises the following steps:
acquiring conference audio, wherein the conference audio is restored into a voice signal through digital receiving, decompression, information processing and digital-to-analog conversion;
noise reduction processing is carried out on the voice signals;
the method comprises the steps of obtaining a voice signal, preprocessing the voice signal, wherein the preprocessing flow sequentially comprises pre-emphasis of the voice signal, framing of the voice signal, windowing of the voice signal, fast Fourier transform of the voice signal, decomposing of the voice signal into at least one basic sinusoidal signal, frequency domain information forming of the at least one basic sinusoidal signal, and determination of the basic sinusoidal signal by amplitude, phase and frequency, wherein the pre-emphasis compensates for a damaged signal, high-frequency components of the signal are enhanced at the beginning end of a transmission line, and excessive attenuation of the high-frequency components in the transmission process is compensated;
acquiring a first characteristic range of the human voice on a frequency domain, and removing a part of at least one basic sinusoidal signal outside the first characteristic range to obtain at least one corrected sinusoidal signal;
performing inverse fourier transform on at least one modified sinusoidal signal to obtain separated human voice;
voice recognition is carried out on human voice to generate characters, wherein the characters are staggered messy characters generated by multiple persons;
establishing a word rearrangement model by using deep learning, classifying words generated by different people by using the word rearrangement model to obtain at least one word time sequence set, wherein the words generated by the same person belong to the same word time sequence set, the words in the word time sequence set are arranged according to the time sequence of generation, and the words in the word time sequence set are marked with the time points when the words appear;
and managing information of the text time sequence set to generate a meeting summary.
Preferably, the framing the voice signal and windowing the voice signal includes the following steps:
determining the frame length of the frame, wherein the frame length of the frame is 15ms-25ms;
determining a frame shift, wherein the time difference of the initial positions of two adjacent frames is called frame shift, the frame shift ensures that the two adjacent frames have an overlapped part, and the frame shift is one half to three quarters of the frame length;
dividing the voice signal into at least one voice segment according to the frame length to obtain an original voice sequence v (n) of the voice segment in the time domain, multiplying the original voice sequence v (n) by a moving window function w (n), and finishing windowing, wherein the window function can select a rectangular window or a triangular window.
Preferably, the performing the fast fourier transform on the voice signal includes the steps of:
obtaining at least one windowed speech segment f (x) of the speech signal, processing the at least one speech segment f (x) using a fourier transform;
the fourier transform is as follows:
where F (x) is a basic sinusoidal signal, i is the unit imaginary number, and F (t) is a function of the speech segment.
Preferably, the step of obtaining the first characteristic range of the human voice in the frequency domain and removing the part of the at least one basic sinusoidal signal outside the first characteristic range includes the following steps:
acquiring a first characteristic range (a, b) of human voice on a frequency domain, and acquiring a basic sinusoidal signal F (x);
as a transition signal function G (x), the transition signal function G (x) is initially constant with 0, and for any point c within the definition domain range of the basic sinusoidal signal F (x), if F (c) does not belong to (a, b), G (c) takes 0, and if F (c) belongs to (a, b), G (c) takes F (c);
and c, finishing the value resetting of the transition signal function G (x) when traversing the definition domain of the basic sinusoidal signal F (x), and taking the transition signal function G (x) as a correction sinusoidal signal.
Preferably, said performing an inverse fourier transform on the at least one modified sinusoidal signal to obtain a separated human voice comprises the steps of:
acquiring at least one modified sinusoidal signal, and performing inverse fourier transform on the at least one modified sinusoidal signal;
obtaining a function g (x) of at least one corrected voice small section, and combining the g (x) of the at least one corrected voice small section to obtain human voice;
the inverse fourier transform is as follows:
where G (t) is the modified sinusoidal signal, i is the unit imaginary number, and G (x) is a function of the modified speech segment.
Preferably, the voice recognition of the human voice includes the following steps:
establishing a voice recognition model, wherein the voice recognition model comprises a single word and a first acoustic function corresponding to the single word and a second acoustic function corresponding to at least one word;
acquiring a voice function H (x), and acquiring character spacing points of the voice function H (x), wherein the character spacing points are points at which the value of the voice function H (x) is 0;
dividing the voice function H (x) into at least one dividing function along a time axis according to the character spacing points;
comparing a third acoustic wave function with the minimum difference with the segmentation function in the voice recognition model, if the third acoustic wave function is a first acoustic wave function, taking characters corresponding to the first acoustic wave function as characters recognized by the segmentation function, and if the third acoustic wave function is a second acoustic wave function, taking at least one character corresponding to the second acoustic wave function as at least one character recognized by the segmentation function;
and regarding the recognized characters, taking the middle point of the definition domain of the corresponding third acoustic wave function as the time point of the recognized characters.
Preferably, the text rearrangement model building by deep learning comprises the following steps:
acquiring a daily dialogue sample data set, cutting sentences in the daily dialogue sample data set into at least one daily phrase, and taking adjacent daily phrases as daily phrase pairing groups;
in the daily phrase pairing group, daily phrases are arranged according to the sequence of sentences in the daily dialogue sample data set;
and obtaining all daily phrase and daily phrase pairing groups, and storing the daily phrase and the daily phrase pairing groups into a word rearrangement model.
Preferably, the text rearrangement model classifies the text generated by different people, including the steps of:
acquiring a character set of voice recognition, and calling the characters according to the sequence of time points of the characters in the character set;
the characters with adjacent time points but different time points are used as test groups, the sequence of the characters in the test groups is arranged according to time sequence, whether daily phrases consistent with the test groups exist or not is searched in a character rearrangement model, if not, the test groups are deleted, and if so, the test groups are changed into prepared phrases;
the character set generates at least one prepared phrase, and the average value of the time points of characters in the prepared phrase is used as the time sequence time of the prepared phrase;
the preparation phrase with adjacent time sequence is used as a test group, the sequence of the preparation phrase in the test group is arranged according to time sequence, whether a daily phrase pairing group consistent with the test group exists or not is searched in a word rearrangement model, if not, the test group is deleted, and if so, the test group is used as the preparation phrase pairing group;
the character set generates at least one prepared phrase pairing group, the at least one prepared phrase pairing group forms a pairing set, and the time sequence time average value of the prepared phrases in the prepared phrase pairing group is used as the judgment time of the prepared phrase pairing group;
selecting a first prepared phrase pairing group with the smallest judgment time, deleting the first prepared phrase pairing group in a pairing set, and selecting a second prepared phrase pairing group to meet preset conditions: the prepared phrase at the head end of the second prepared phrase pairing group is consistent with the prepared phrase at the tail end of the first prepared phrase pairing group;
after the head end of the second prepared phrase pairing group is removed, the second prepared phrase pairing group is butted at the tail end of the first prepared phrase pairing group, a new first prepared phrase pairing group is formed through fusion, and the second prepared phrase pairing group is deleted;
re-selecting a second prepared phrase pairing group, and repeating the previous step for fusion;
repeating the previous step until a second prepared phrase pairing group meeting the preset condition is not formed;
taking the first prepared phrase pairing group obtained by fusion as a first text time sequence set;
reselecting a first prepared phrase pairing group with the smallest judging time, deleting the first prepared phrase pairing group in the pairing set to obtain a second character time sequence set;
the above steps are repeated until the paired set becomes an empty set.
The conference audio analysis processing system based on time sequence is used for realizing the conference audio analysis processing method based on time sequence, and comprises the following steps:
the voice signal processing module is used for acquiring conference audio;
the voice signal noise reduction module is used for carrying out noise reduction processing on the voice signal;
the voice signal preprocessing module is used for acquiring a voice signal and preprocessing the voice signal;
the voice signal correction module is used for removing the part of at least one basic sinusoidal signal outside the first characteristic range;
the voice signal separation module is used for separating out human voice;
the voice signal recognition module is used for carrying out voice recognition on human voice and generating characters;
the word processing module is used for classifying words generated by different people by the word rearrangement model to obtain at least one word time sequence set;
and the information management module is used for managing the information of the text time sequence set and generating a meeting summary.
Compared with the prior art, the invention has the beneficial effects that:
through setting up speech signal separation module, speech signal recognition module and word processing module, separate out the voice to discernment produces the pronunciation characters that many people sent, distinguishes the characters that different people produced according to the word rearrangement model, and gathers according to time sequence, obtains at least one character time sequence collection, thereby can solve present speech recognition can only separate the voice and with the great voice of voice difference, can't draw different voice classification, can't classify the problem that discernment produced characters.
Drawings
FIG. 1 is a schematic flow chart of a method for analyzing and processing conference audio based on time sequence;
FIG. 2 is a schematic diagram of a process of framing and windowing a speech signal according to the present invention;
FIG. 3 is a schematic flow chart of a part of the method for obtaining the first characteristic range of the human voice in the frequency domain and rejecting at least one basic sinusoidal signal outside the first characteristic range;
FIG. 4 is a schematic diagram of a voice recognition process for human voice according to the present invention;
FIG. 5 is a schematic diagram of a process for creating a text rearrangement model using deep learning according to the present invention;
FIG. 6 is a schematic diagram of a text rearrangement model for classifying text generated by different persons according to the present invention.
Detailed Description
The following description is presented to enable one of ordinary skill in the art to make and use the invention. The preferred embodiments in the following description are by way of example only and other obvious variations will occur to those skilled in the art.
Referring to fig. 1, a method for analyzing and processing conference audio based on time sequence includes:
acquiring conference audio, wherein the conference audio is restored into a voice signal through digital receiving, decompression, information processing and digital-to-analog conversion;
noise reduction processing is carried out on the voice signals;
the method comprises the steps of obtaining a voice signal, preprocessing the voice signal, wherein the preprocessing flow sequentially comprises pre-emphasis of the voice signal, framing of the voice signal, windowing of the voice signal, fast Fourier transform of the voice signal, decomposing of the voice signal into at least one basic sinusoidal signal, frequency domain information forming of the at least one basic sinusoidal signal, and determination of the basic sinusoidal signal by amplitude, phase and frequency, wherein the pre-emphasis compensates for a damaged signal, high-frequency components of the signal are enhanced at the beginning end of a transmission line, and excessive attenuation of the high-frequency components in the transmission process is compensated;
acquiring a first characteristic range of the human voice on a frequency domain, and removing a part of at least one basic sinusoidal signal outside the first characteristic range to obtain at least one corrected sinusoidal signal;
performing inverse fourier transform on at least one modified sinusoidal signal to obtain separated human voice;
voice recognition is carried out on human voice to generate characters, wherein the characters are staggered messy characters generated by multiple persons;
establishing a word rearrangement model by using deep learning, classifying words generated by different people by using the word rearrangement model to obtain at least one word time sequence set, wherein the words generated by the same person belong to the same word time sequence set, the words in the word time sequence set are arranged according to the time sequence of generation, and the words in the word time sequence set are marked with the time points when the words appear;
and managing information of the text time sequence set to generate a meeting summary.
Referring to fig. 2, framing a voice signal, and windowing the voice signal includes the steps of:
determining the frame length of the frame, wherein the frame length of the frame is 15ms-25ms;
determining a frame shift, wherein the time difference of the initial positions of two adjacent frames is called frame shift, the frame shift ensures that the two adjacent frames have an overlapped part, and the frame shift is one half to three quarters of the frame length;
dividing a voice signal into at least one voice segment according to a frame length to obtain an original voice sequence v (n) of the voice segment in a time domain, multiplying the original voice sequence v (n) by a moving window function w (n), and finishing windowing, wherein the window function can select a rectangular window or a triangular window;
the speech signal is framed and windowed in order to make the speech signal smoother.
Performing a fast fourier transform on the speech signal comprises the steps of:
obtaining at least one windowed speech segment f (x) of the speech signal, processing the at least one speech segment f (x) using a fourier transform;
the fourier transform is as follows:
wherein F (x) is a basic sinusoidal signal, i is a unit imaginary number, and F (t) is a function of a small voice segment;
the fast fourier transform is performed on the speech signal to generate a plurality of basic sinusoidal signals, the basic sinusoidal signals are sinusoidal functions, and the data processing of the sinusoidal functions is a known means, so that the difficulty in processing the speech signal can be reduced.
Referring to fig. 3, obtaining a first characteristic range of a human voice in a frequency domain, and removing a portion of at least one basic sinusoidal signal outside the first characteristic range includes the following steps:
acquiring a first characteristic range (a, b) of human voice on a frequency domain, and acquiring a basic sinusoidal signal F (x);
as a transition signal function G (x), the transition signal function G (x) is initially constant with 0, and for any point c within the definition domain range of the basic sinusoidal signal F (x), if F (c) does not belong to (a, b), G (c) takes 0, and if F (c) belongs to (a, b), G (c) takes F (c);
when c traverses the definition domain of the basic sinusoidal signal F (x), finishing the value resetting of the transition signal function G (x), and taking the transition signal function G (x) as a correction sinusoidal signal;
the human voice generally has a predetermined range in the frequency domain, and sound waves not in the range are other sound waves and need to be removed, so that the part of the basic sinusoidal signal out of the range is removed.
Performing an inverse fourier transform on the at least one modified sinusoidal signal to obtain a separated human voice, comprising the steps of:
acquiring at least one modified sinusoidal signal, and performing inverse fourier transform on the at least one modified sinusoidal signal;
obtaining a function g (x) of at least one corrected voice small section, and combining the g (x) of the at least one corrected voice small section to obtain human voice;
the inverse fourier transform is as follows:
wherein G (t) is a modified sinusoidal signal, i is a unit imaginary number, and G (x) is a function of the modified speech segment;
the processed corrected sinusoidal signal is recombined into a new signal, namely, the part of the non-human voice is removed, so that the separated human voice is obtained.
Referring to fig. 4, the voice recognition of the human voice includes the steps of:
establishing a voice recognition model, wherein the voice recognition model comprises a single word and a first acoustic function corresponding to the single word and a second acoustic function corresponding to at least one word;
acquiring a voice function H (x), and acquiring character spacing points of the voice function H (x), wherein the character spacing points are points at which the value of the voice function H (x) is 0;
dividing the voice function H (x) into at least one dividing function along a time axis according to the character spacing points;
comparing a third acoustic wave function with the minimum difference with the segmentation function in the voice recognition model, if the third acoustic wave function is a first acoustic wave function, taking characters corresponding to the first acoustic wave function as characters recognized by the segmentation function, and if the third acoustic wave function is a second acoustic wave function, taking at least one character corresponding to the second acoustic wave function as at least one character recognized by the segmentation function;
regarding the identified characters, taking the middle point of the definition domain of the corresponding third acoustic wave function as the time point of the identified characters;
when a person makes a sound, each word has very small pauses, and the sound wave is small and can be used as a word spacing point, so that the sound wave between adjacent word spacing points corresponds to one word, and therefore, recognition can be performed.
Referring to fig. 5, the creation of a text rearrangement model using deep learning includes the steps of:
acquiring a daily dialogue sample data set, cutting sentences in the daily dialogue sample data set into at least one daily phrase, and taking adjacent daily phrases as daily phrase pairing groups;
in the daily phrase pairing group, daily phrases are arranged according to the sequence of sentences in the daily dialogue sample data set;
acquiring all daily phrase and daily phrase pairing groups, and storing the daily phrase and the daily phrase pairing groups into a word rearrangement model;
the word rearrangement model mainly establishes the matching relation of the words according to the using habit of the words, the matching is divided into two stages, firstly, the word group is established, and secondly, the word group collocation is used for generating sentences step by step.
Referring to fig. 6, the text rearrangement model classifies text generated by different persons, comprising the steps of:
acquiring a character set of voice recognition, and calling the characters according to the sequence of time points of the characters in the character set;
the characters with adjacent time points but different time points are used as test groups, the sequence of the characters in the test groups is arranged according to time sequence, whether daily phrases consistent with the test groups exist or not is searched in a character rearrangement model, if not, the test groups are deleted, and if so, the test groups are changed into prepared phrases;
the character set generates at least one prepared phrase, and the average value of the time points of characters in the prepared phrase is used as the time sequence time of the prepared phrase;
the preparation phrase with adjacent time sequence is used as a test group, the sequence of the preparation phrase in the test group is arranged according to time sequence, whether a daily phrase pairing group consistent with the test group exists or not is searched in a word rearrangement model, if not, the test group is deleted, and if so, the test group is used as the preparation phrase pairing group;
the character set generates at least one prepared phrase pairing group, the at least one prepared phrase pairing group forms a pairing set, and the time sequence time average value of the prepared phrases in the prepared phrase pairing group is used as the judgment time of the prepared phrase pairing group;
selecting a first prepared phrase pairing group with the smallest judgment time, deleting the first prepared phrase pairing group in a pairing set, and selecting a second prepared phrase pairing group to meet preset conditions: the prepared phrase at the head end of the second prepared phrase pairing group is consistent with the prepared phrase at the tail end of the first prepared phrase pairing group;
after the head end of the second prepared phrase pairing group is removed, the second prepared phrase pairing group is butted at the tail end of the first prepared phrase pairing group, a new first prepared phrase pairing group is formed through fusion, and the second prepared phrase pairing group is deleted;
re-selecting a second prepared phrase pairing group, and repeating the previous step for fusion;
repeating the previous step until a second prepared phrase pairing group meeting the preset condition is not formed;
taking the first prepared phrase pairing group obtained by fusion as a first text time sequence set;
reselecting a first prepared phrase pairing group with the smallest judging time, deleting the first prepared phrase pairing group in the pairing set to obtain a second character time sequence set;
repeating the previous step until the pairing set becomes an empty set;
the word sets are generated by multiple people, so that the word sets are overlapped, firstly, the words are paired into phrases according to the sequence of time, and the words spoken by different people are not logically connected, so that the words cannot be paired into phrases, secondly, the phrases are paired into sentences, and the words spoken by different people are not logically connected, so that the words cannot be paired into sentences;
here, when the character set generates at least one prepared phrase pairing group, each prepared phrase will appear in two of the prepared phrase pairing groups, because the prepared phrase appears in the sentence, the prepared phrase first appears in front of the sentence, the prepared phrase second appears in back of the sentence, and the prepared phrase first and the prepared phrase second are respectively matched with the prepared phrase to generate the prepared phrase pairing group, so that the prepared phrase pairing groups need to be fused into the sentence, and different people can speak.
The conference audio analysis processing system based on time sequence is used for realizing the conference audio analysis processing method based on time sequence, and comprises the following steps:
the voice signal processing module is used for acquiring conference audio;
the voice signal noise reduction module is used for carrying out noise reduction processing on the voice signal;
the voice signal preprocessing module is used for acquiring a voice signal and preprocessing the voice signal;
the voice signal correction module is used for removing the part of at least one basic sinusoidal signal outside the first characteristic range;
the voice signal separation module is used for separating out human voice;
the voice signal recognition module is used for carrying out voice recognition on human voice and generating characters;
the word processing module is used for classifying words generated by different people by the word rearrangement model to obtain at least one word time sequence set;
and the information management module is used for managing the information of the text time sequence set and generating a meeting summary.
The conference audio analysis processing system based on time sequence has the following working procedures:
step one: the voice signal processing module acquires conference audio which is restored into a voice signal through digital receiving, decompression, information processing and digital-to-analog conversion;
step two: the voice signal noise reduction module performs noise reduction processing on the voice signal;
step three: the voice signal preprocessing module acquires a voice signal, performs preprocessing on the voice signal, sequentially performs pre-emphasis on the voice signal, frames the voice signal, windows the voice signal, performs fast Fourier transform on the voice signal, and decomposes the voice signal into at least one basic sinusoidal signal which forms frequency domain information;
step four: the voice signal correction module obtains a first characteristic range of the voice on a frequency domain, and eliminates the part of at least one basic sinusoidal signal outside the first characteristic range to obtain at least one corrected sinusoidal signal;
step five: the voice signal separation module performs inverse Fourier transform on at least one corrected sinusoidal signal to obtain separated human voice;
step six: the voice signal recognition module carries out voice recognition on voice to generate characters, wherein the characters are staggered messy characters generated by multiple people;
step seven: the word processing module uses deep learning to establish a word rearrangement model, and the word rearrangement model classifies words generated by different people to obtain at least one word time sequence set;
step eight: and the information management module is used for carrying out information management on the text time sequence set to generate a meeting summary.
Still further, the present disclosure provides a storage medium having a computer readable program stored thereon, the computer readable program when invoked performing the above-described time-based conference audio analysis processing method.
It is understood that the storage medium may be a magnetic medium, e.g., floppy disk, hard disk, magnetic tape; optical media such as DVD; or a semiconductor medium such as a solid state disk SolidStateDisk, SSD, etc.
In summary, the invention has the advantages that: through setting up speech signal separation module, speech signal recognition module and word processing module, separate out the voice to discernment produces the pronunciation characters that many people sent, distinguishes the characters that different people produced according to the word rearrangement model, and gathers according to time sequence, obtains at least one character time sequence collection, thereby can solve present speech recognition can only separate the voice and with the great voice of voice difference, can't draw different voice classification, can't classify the problem that discernment produced characters.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention, which is defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (9)

1. A method for analyzing and processing conference audio based on time sequence, which is characterized by comprising the following steps:
acquiring conference audio, wherein the conference audio is restored into a voice signal through digital receiving, decompression, information processing and digital-to-analog conversion;
noise reduction processing is carried out on the voice signals;
the method comprises the steps of obtaining a voice signal, preprocessing the voice signal, wherein the preprocessing flow sequentially comprises pre-emphasis of the voice signal, framing of the voice signal, windowing of the voice signal, fast Fourier transform of the voice signal, decomposing of the voice signal into at least one basic sinusoidal signal, frequency domain information forming of the at least one basic sinusoidal signal, and determination of the basic sinusoidal signal by amplitude, phase and frequency, wherein the pre-emphasis compensates for a damaged signal, high-frequency components of the signal are enhanced at the beginning end of a transmission line, and excessive attenuation of the high-frequency components in the transmission process is compensated;
acquiring a first characteristic range of the human voice on a frequency domain, and removing a part of at least one basic sinusoidal signal outside the first characteristic range to obtain at least one corrected sinusoidal signal;
performing inverse fourier transform on at least one modified sinusoidal signal to obtain separated human voice;
voice recognition is carried out on human voice to generate characters, wherein the characters are staggered messy characters generated by multiple persons;
establishing a word rearrangement model by using deep learning, classifying words generated by different people by using the word rearrangement model to obtain at least one word time sequence set, wherein the words generated by the same person belong to the same word time sequence set, the words in the word time sequence set are arranged according to the time sequence of generation, and the words in the word time sequence set are marked with the time points when the words appear;
and managing information of the text time sequence set to generate a meeting summary.
2. The method for analyzing and processing conference audio based on time sequence according to claim 1, wherein framing the voice signal and windowing the voice signal comprises the steps of:
determining the frame length of the frame, wherein the frame length of the frame is 15ms-25ms;
determining a frame shift, wherein the time difference of the initial positions of two adjacent frames is called frame shift, the frame shift ensures that the two adjacent frames have an overlapped part, and the frame shift is one half to three quarters of the frame length;
dividing the voice signal into at least one voice segment according to the frame length to obtain an original voice sequence v (n) of the voice segment in the time domain, multiplying the original voice sequence v (n) by a moving window function w (n), and finishing windowing, wherein the window function can select a rectangular window or a triangular window.
3. A method of time-series based conference audio analysis processing according to claim 2, wherein said performing a fast fourier transform on the speech signal comprises the steps of:
obtaining at least one windowed speech segment f (x) of the speech signal, processing the at least one speech segment f (x) using a fourier transform;
the fourier transform is as follows:
where F (x) is a basic sinusoidal signal, i is the unit imaginary number, and F (t) is a function of the speech segment.
4. A method of analyzing and processing conference audio based on time sequence according to claim 3, wherein the step of obtaining a first characteristic range of the voice in the frequency domain and removing the portion of the at least one basic sinusoidal signal outside the first characteristic range comprises the steps of:
acquiring a first characteristic range (a, b) of human voice on a frequency domain, and acquiring a basic sinusoidal signal F (x);
as a transition signal function G (x), the transition signal function G (x) is initially constant with 0, and for any point c within the definition domain range of the basic sinusoidal signal F (x), if F (c) does not belong to (a, b), G (c) takes 0, and if F (c) belongs to (a, b), G (c) takes F (c);
and c, finishing the value resetting of the transition signal function G (x) when traversing the definition domain of the basic sinusoidal signal F (x), and taking the transition signal function G (x) as a correction sinusoidal signal.
5. The method of time-series-based conference audio analysis processing according to claim 4, wherein said performing inverse fourier transform on at least one modified sinusoidal signal to obtain a separated voice comprises the steps of:
acquiring at least one modified sinusoidal signal, and performing inverse fourier transform on the at least one modified sinusoidal signal;
obtaining a function g (x) of at least one corrected voice small section, and combining the g (x) of the at least one corrected voice small section to obtain human voice;
the inverse fourier transform is as follows:
where G (t) is the modified sinusoidal signal, i is the unit imaginary number, and G (x) is a function of the modified speech segment.
6. The method of time-series based conference audio analysis processing according to claim 5, wherein said voice recognition of human voice comprises the steps of:
establishing a voice recognition model, wherein the voice recognition model comprises a single word and a first acoustic function corresponding to the single word and a second acoustic function corresponding to at least one word;
acquiring a voice function H (x), and acquiring character spacing points of the voice function H (x), wherein the character spacing points are points at which the value of the voice function H (x) is 0;
dividing the voice function H (x) into at least one dividing function along a time axis according to the character spacing points;
comparing a third acoustic wave function with the minimum difference with the segmentation function in the voice recognition model, if the third acoustic wave function is a first acoustic wave function, taking characters corresponding to the first acoustic wave function as characters recognized by the segmentation function, and if the third acoustic wave function is a second acoustic wave function, taking at least one character corresponding to the second acoustic wave function as at least one character recognized by the segmentation function;
and regarding the recognized characters, taking the middle point of the definition domain of the corresponding third acoustic wave function as the time point of the recognized characters.
7. The method of time-series based conference audio analysis processing according to claim 6, wherein said using deep learning to build a text rearrangement model comprises the steps of:
acquiring a daily dialogue sample data set, cutting sentences in the daily dialogue sample data set into at least one daily phrase, and taking adjacent daily phrases as daily phrase pairing groups;
in the daily phrase pairing group, daily phrases are arranged according to the sequence of sentences in the daily dialogue sample data set;
and obtaining all daily phrase and daily phrase pairing groups, and storing the daily phrase and the daily phrase pairing groups into a word rearrangement model.
8. The method of claim 7, wherein the word rearrangement model classifies words generated by different persons, comprising the steps of:
acquiring a character set of voice recognition, and calling the characters according to the sequence of time points of the characters in the character set;
the characters with adjacent time points but different time points are used as test groups, the sequence of the characters in the test groups is arranged according to time sequence, whether daily phrases consistent with the test groups exist or not is searched in a character rearrangement model, if not, the test groups are deleted, and if so, the test groups are changed into prepared phrases;
the character set generates at least one prepared phrase, and the average value of the time points of characters in the prepared phrase is used as the time sequence time of the prepared phrase;
the preparation phrase with adjacent time sequence is used as a test group, the sequence of the preparation phrase in the test group is arranged according to time sequence, whether a daily phrase pairing group consistent with the test group exists or not is searched in a word rearrangement model, if not, the test group is deleted, and if so, the test group is used as the preparation phrase pairing group;
the character set generates at least one prepared phrase pairing group, the at least one prepared phrase pairing group forms a pairing set, and the time sequence time average value of the prepared phrases in the prepared phrase pairing group is used as the judgment time of the prepared phrase pairing group;
selecting a first prepared phrase pairing group with the smallest judgment time, deleting the first prepared phrase pairing group in a pairing set, and selecting a second prepared phrase pairing group to meet preset conditions: the prepared phrase at the head end of the second prepared phrase pairing group is consistent with the prepared phrase at the tail end of the first prepared phrase pairing group;
after the head end of the second prepared phrase pairing group is removed, the second prepared phrase pairing group is butted at the tail end of the first prepared phrase pairing group, a new first prepared phrase pairing group is formed through fusion, and the second prepared phrase pairing group is deleted;
re-selecting a second prepared phrase pairing group, and repeating the previous step for fusion;
repeating the previous step until a second prepared phrase pairing group meeting the preset condition is not formed;
taking the first prepared phrase pairing group obtained by fusion as a first text time sequence set;
reselecting a first prepared phrase pairing group with the smallest judging time, deleting the first prepared phrase pairing group in the pairing set to obtain a second character time sequence set;
the above steps are repeated until the paired set becomes an empty set.
9. A time-series-based conference audio analysis processing system for implementing the time-series-based conference audio analysis processing method according to any one of claims 1 to 8, comprising:
the voice signal processing module is used for acquiring conference audio;
the voice signal noise reduction module is used for carrying out noise reduction processing on the voice signal;
the voice signal preprocessing module is used for acquiring a voice signal and preprocessing the voice signal;
the voice signal correction module is used for removing the part of at least one basic sinusoidal signal outside the first characteristic range;
the voice signal separation module is used for separating out human voice;
the voice signal recognition module is used for carrying out voice recognition on human voice and generating characters;
the word processing module is used for classifying words generated by different people by the word rearrangement model to obtain at least one word time sequence set;
and the information management module is used for managing the information of the text time sequence set and generating a meeting summary.
CN202311586467.4A 2023-11-27 2023-11-27 Conference audio analysis processing method and system based on time sequence Active CN117316165B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311586467.4A CN117316165B (en) 2023-11-27 2023-11-27 Conference audio analysis processing method and system based on time sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311586467.4A CN117316165B (en) 2023-11-27 2023-11-27 Conference audio analysis processing method and system based on time sequence

Publications (2)

Publication Number Publication Date
CN117316165A true CN117316165A (en) 2023-12-29
CN117316165B CN117316165B (en) 2024-02-20

Family

ID=89281394

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311586467.4A Active CN117316165B (en) 2023-11-27 2023-11-27 Conference audio analysis processing method and system based on time sequence

Country Status (1)

Country Link
CN (1) CN117316165B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009156773A1 (en) * 2008-06-27 2009-12-30 Monting-I D.O.O. Device and procedure for recognizing words or phrases and their meaning from digital free text content
US20190392837A1 (en) * 2018-06-22 2019-12-26 Microsoft Technology Licensing, Llc Use of voice recognition to generate a transcript of conversation(s)
CN111326160A (en) * 2020-03-11 2020-06-23 南京奥拓电子科技有限公司 Speech recognition method, system and storage medium for correcting noise text
CN113380234A (en) * 2021-08-12 2021-09-10 明品云(北京)数据科技有限公司 Method, device, equipment and medium for generating form based on voice recognition
CN113505609A (en) * 2021-05-28 2021-10-15 引智科技(深圳)有限公司 One-key auxiliary translation method for multi-language conference and equipment with same
US20220237379A1 (en) * 2019-05-20 2022-07-28 Samsung Electronics Co., Ltd. Text reconstruction system and method thereof
CN115273821A (en) * 2022-07-13 2022-11-01 平顶山学院 Speech recognition system based on semantic understanding of computer application scene

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009156773A1 (en) * 2008-06-27 2009-12-30 Monting-I D.O.O. Device and procedure for recognizing words or phrases and their meaning from digital free text content
US20190392837A1 (en) * 2018-06-22 2019-12-26 Microsoft Technology Licensing, Llc Use of voice recognition to generate a transcript of conversation(s)
US20220237379A1 (en) * 2019-05-20 2022-07-28 Samsung Electronics Co., Ltd. Text reconstruction system and method thereof
CN111326160A (en) * 2020-03-11 2020-06-23 南京奥拓电子科技有限公司 Speech recognition method, system and storage medium for correcting noise text
CN113505609A (en) * 2021-05-28 2021-10-15 引智科技(深圳)有限公司 One-key auxiliary translation method for multi-language conference and equipment with same
CN113380234A (en) * 2021-08-12 2021-09-10 明品云(北京)数据科技有限公司 Method, device, equipment and medium for generating form based on voice recognition
CN115273821A (en) * 2022-07-13 2022-11-01 平顶山学院 Speech recognition system based on semantic understanding of computer application scene

Also Published As

Publication number Publication date
CN117316165B (en) 2024-02-20

Similar Documents

Publication Publication Date Title
CN109065031B (en) Voice labeling method, device and equipment
CN111524527B (en) Speaker separation method, speaker separation device, electronic device and storage medium
CN111128223B (en) Text information-based auxiliary speaker separation method and related device
US10515292B2 (en) Joint acoustic and visual processing
Zhou et al. Efficient audio stream segmentation via the combined T/sup 2/statistic and Bayesian information criterion
CN108305632A (en) A kind of the voice abstract forming method and system of meeting
CN105957531B (en) Speech content extraction method and device based on cloud platform
JP2005049859A (en) Method and device for automatically recognizing audio data
CN112633241B (en) News story segmentation method based on multi-feature fusion and random forest model
WO2022100692A9 (en) Human voice audio recording method and apparatus
CN112712824A (en) Crowd information fused speech emotion recognition method and system
CN111462758A (en) Method, device and equipment for intelligent conference role classification and storage medium
US7349477B2 (en) Audio-assisted video segmentation and summarization
WO2023048746A1 (en) Speaker-turn-based online speaker diarization with constrained spectral clustering
CN112562725A (en) Mixed voice emotion classification method based on spectrogram and capsule network
CN112259085A (en) Two-stage voice awakening algorithm based on model fusion framework
Venkatesan et al. Automatic language identification using machine learning techniques
CN111488813A (en) Video emotion marking method and device, electronic equipment and storage medium
CN115150660A (en) Video editing method based on subtitles and related equipment
CN117316165B (en) Conference audio analysis processing method and system based on time sequence
CN114022923A (en) Intelligent collecting and editing system
KR101369270B1 (en) Method for analyzing video stream data using multi-channel analysis
CN115831124A (en) Conference record role separation system and method based on voiceprint recognition
CN111402887A (en) Method and device for escaping characters by voice
CN110807370B (en) Conference speaker identity noninductive confirmation method based on multiple modes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant