CN117316165A - Conference audio analysis processing method and system based on time sequence - Google Patents
Conference audio analysis processing method and system based on time sequence Download PDFInfo
- Publication number
- CN117316165A CN117316165A CN202311586467.4A CN202311586467A CN117316165A CN 117316165 A CN117316165 A CN 117316165A CN 202311586467 A CN202311586467 A CN 202311586467A CN 117316165 A CN117316165 A CN 117316165A
- Authority
- CN
- China
- Prior art keywords
- voice
- phrase
- signal
- prepared
- function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 20
- 238000003672 processing method Methods 0.000 title claims abstract description 10
- 238000012545 processing Methods 0.000 claims abstract description 37
- 230000008707 rearrangement Effects 0.000 claims abstract description 33
- 238000007781 pre-processing Methods 0.000 claims abstract description 15
- 230000009467 reduction Effects 0.000 claims abstract description 12
- 238000013135 deep learning Methods 0.000 claims abstract description 9
- 238000000926 separation method Methods 0.000 claims abstract description 7
- 230000006870 function Effects 0.000 claims description 47
- 238000012360 testing method Methods 0.000 claims description 30
- 230000005428 wave function Effects 0.000 claims description 24
- 238000000034 method Methods 0.000 claims description 19
- 230000037433 frameshift Effects 0.000 claims description 12
- 230000009131 signaling function Effects 0.000 claims description 12
- 230000007704 transition Effects 0.000 claims description 12
- 230000004927 fusion Effects 0.000 claims description 9
- 238000002360 preparation method Methods 0.000 claims description 9
- 230000011218 segmentation Effects 0.000 claims description 9
- 238000012937 correction Methods 0.000 claims description 7
- 238000009432 framing Methods 0.000 claims description 7
- 230000005540 biological transmission Effects 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 230000006837 decompression Effects 0.000 claims description 4
- 230000010365 information processing Effects 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention discloses a conference audio analysis processing method and system based on time sequence, which relate to the technical field of conference management and comprise the following steps: acquiring conference audio; noise reduction processing is carried out on the voice signals; preprocessing the voice signal; acquiring a first characteristic range of the human voice on a frequency domain, and removing a part of at least one basic sinusoidal signal outside the first characteristic range; performing inverse fourier transform on at least one modified sinusoidal signal to obtain separated human voice; voice recognition is carried out on human voice to generate characters, wherein the characters are staggered messy characters generated by multiple persons; and establishing a word rearrangement model by using deep learning, and classifying words generated by different people by using the word rearrangement model to obtain at least one word time sequence set. By arranging the voice signal separation module, the voice signal recognition module and the word processing module, words generated by different people are distinguished, and summarized according to a time sequence, so that at least one word time sequence set is obtained.
Description
Technical Field
The invention relates to the technical field of conference management, in particular to a time sequence-based conference audio analysis processing method and system.
Background
As technology advances, many products for automatically recording conference content are continually being introduced. From the earliest recorders to automated voice-to-text equipment. The content recorded by these recording methods is very much, because the session is often continued for several hours. Resulting in time and effort consuming review or retrieval of meeting records. While some advanced products tag meeting participants in the form of human biometric features such as voice prints, fingerprints, etc., and then quickly locate meeting recordings by the tags, they are not efficient.
The difficulty of conference audio analysis is that there may be multiple people speaking at the same time, and current speech recognition can only separate voice from voice with large difference from voice, and can not extract different voice classifications, so that characters can not be generated by classification recognition.
Disclosure of Invention
In order to solve the technical problems, the technical scheme provides a time sequence-based conference audio analysis processing method and system, and solves the problems that the conference audio analysis presented in the background technology is difficult in that a plurality of people speaking at the same time can exist, the current voice recognition can only separate voice and voice with larger difference from voice, and different voice cannot be extracted in a classified manner, so that characters cannot be generated in the classified recognition.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a conference audio analysis processing method based on time sequence comprises the following steps:
acquiring conference audio, wherein the conference audio is restored into a voice signal through digital receiving, decompression, information processing and digital-to-analog conversion;
noise reduction processing is carried out on the voice signals;
the method comprises the steps of obtaining a voice signal, preprocessing the voice signal, wherein the preprocessing flow sequentially comprises pre-emphasis of the voice signal, framing of the voice signal, windowing of the voice signal, fast Fourier transform of the voice signal, decomposing of the voice signal into at least one basic sinusoidal signal, frequency domain information forming of the at least one basic sinusoidal signal, and determination of the basic sinusoidal signal by amplitude, phase and frequency, wherein the pre-emphasis compensates for a damaged signal, high-frequency components of the signal are enhanced at the beginning end of a transmission line, and excessive attenuation of the high-frequency components in the transmission process is compensated;
acquiring a first characteristic range of the human voice on a frequency domain, and removing a part of at least one basic sinusoidal signal outside the first characteristic range to obtain at least one corrected sinusoidal signal;
performing inverse fourier transform on at least one modified sinusoidal signal to obtain separated human voice;
voice recognition is carried out on human voice to generate characters, wherein the characters are staggered messy characters generated by multiple persons;
establishing a word rearrangement model by using deep learning, classifying words generated by different people by using the word rearrangement model to obtain at least one word time sequence set, wherein the words generated by the same person belong to the same word time sequence set, the words in the word time sequence set are arranged according to the time sequence of generation, and the words in the word time sequence set are marked with the time points when the words appear;
and managing information of the text time sequence set to generate a meeting summary.
Preferably, the framing the voice signal and windowing the voice signal includes the following steps:
determining the frame length of the frame, wherein the frame length of the frame is 15ms-25ms;
determining a frame shift, wherein the time difference of the initial positions of two adjacent frames is called frame shift, the frame shift ensures that the two adjacent frames have an overlapped part, and the frame shift is one half to three quarters of the frame length;
dividing the voice signal into at least one voice segment according to the frame length to obtain an original voice sequence v (n) of the voice segment in the time domain, multiplying the original voice sequence v (n) by a moving window function w (n), and finishing windowing, wherein the window function can select a rectangular window or a triangular window.
Preferably, the performing the fast fourier transform on the voice signal includes the steps of:
obtaining at least one windowed speech segment f (x) of the speech signal, processing the at least one speech segment f (x) using a fourier transform;
the fourier transform is as follows:
where F (x) is a basic sinusoidal signal, i is the unit imaginary number, and F (t) is a function of the speech segment.
Preferably, the step of obtaining the first characteristic range of the human voice in the frequency domain and removing the part of the at least one basic sinusoidal signal outside the first characteristic range includes the following steps:
acquiring a first characteristic range (a, b) of human voice on a frequency domain, and acquiring a basic sinusoidal signal F (x);
as a transition signal function G (x), the transition signal function G (x) is initially constant with 0, and for any point c within the definition domain range of the basic sinusoidal signal F (x), if F (c) does not belong to (a, b), G (c) takes 0, and if F (c) belongs to (a, b), G (c) takes F (c);
and c, finishing the value resetting of the transition signal function G (x) when traversing the definition domain of the basic sinusoidal signal F (x), and taking the transition signal function G (x) as a correction sinusoidal signal.
Preferably, said performing an inverse fourier transform on the at least one modified sinusoidal signal to obtain a separated human voice comprises the steps of:
acquiring at least one modified sinusoidal signal, and performing inverse fourier transform on the at least one modified sinusoidal signal;
obtaining a function g (x) of at least one corrected voice small section, and combining the g (x) of the at least one corrected voice small section to obtain human voice;
the inverse fourier transform is as follows:
where G (t) is the modified sinusoidal signal, i is the unit imaginary number, and G (x) is a function of the modified speech segment.
Preferably, the voice recognition of the human voice includes the following steps:
establishing a voice recognition model, wherein the voice recognition model comprises a single word and a first acoustic function corresponding to the single word and a second acoustic function corresponding to at least one word;
acquiring a voice function H (x), and acquiring character spacing points of the voice function H (x), wherein the character spacing points are points at which the value of the voice function H (x) is 0;
dividing the voice function H (x) into at least one dividing function along a time axis according to the character spacing points;
comparing a third acoustic wave function with the minimum difference with the segmentation function in the voice recognition model, if the third acoustic wave function is a first acoustic wave function, taking characters corresponding to the first acoustic wave function as characters recognized by the segmentation function, and if the third acoustic wave function is a second acoustic wave function, taking at least one character corresponding to the second acoustic wave function as at least one character recognized by the segmentation function;
and regarding the recognized characters, taking the middle point of the definition domain of the corresponding third acoustic wave function as the time point of the recognized characters.
Preferably, the text rearrangement model building by deep learning comprises the following steps:
acquiring a daily dialogue sample data set, cutting sentences in the daily dialogue sample data set into at least one daily phrase, and taking adjacent daily phrases as daily phrase pairing groups;
in the daily phrase pairing group, daily phrases are arranged according to the sequence of sentences in the daily dialogue sample data set;
and obtaining all daily phrase and daily phrase pairing groups, and storing the daily phrase and the daily phrase pairing groups into a word rearrangement model.
Preferably, the text rearrangement model classifies the text generated by different people, including the steps of:
acquiring a character set of voice recognition, and calling the characters according to the sequence of time points of the characters in the character set;
the characters with adjacent time points but different time points are used as test groups, the sequence of the characters in the test groups is arranged according to time sequence, whether daily phrases consistent with the test groups exist or not is searched in a character rearrangement model, if not, the test groups are deleted, and if so, the test groups are changed into prepared phrases;
the character set generates at least one prepared phrase, and the average value of the time points of characters in the prepared phrase is used as the time sequence time of the prepared phrase;
the preparation phrase with adjacent time sequence is used as a test group, the sequence of the preparation phrase in the test group is arranged according to time sequence, whether a daily phrase pairing group consistent with the test group exists or not is searched in a word rearrangement model, if not, the test group is deleted, and if so, the test group is used as the preparation phrase pairing group;
the character set generates at least one prepared phrase pairing group, the at least one prepared phrase pairing group forms a pairing set, and the time sequence time average value of the prepared phrases in the prepared phrase pairing group is used as the judgment time of the prepared phrase pairing group;
selecting a first prepared phrase pairing group with the smallest judgment time, deleting the first prepared phrase pairing group in a pairing set, and selecting a second prepared phrase pairing group to meet preset conditions: the prepared phrase at the head end of the second prepared phrase pairing group is consistent with the prepared phrase at the tail end of the first prepared phrase pairing group;
after the head end of the second prepared phrase pairing group is removed, the second prepared phrase pairing group is butted at the tail end of the first prepared phrase pairing group, a new first prepared phrase pairing group is formed through fusion, and the second prepared phrase pairing group is deleted;
re-selecting a second prepared phrase pairing group, and repeating the previous step for fusion;
repeating the previous step until a second prepared phrase pairing group meeting the preset condition is not formed;
taking the first prepared phrase pairing group obtained by fusion as a first text time sequence set;
reselecting a first prepared phrase pairing group with the smallest judging time, deleting the first prepared phrase pairing group in the pairing set to obtain a second character time sequence set;
the above steps are repeated until the paired set becomes an empty set.
The conference audio analysis processing system based on time sequence is used for realizing the conference audio analysis processing method based on time sequence, and comprises the following steps:
the voice signal processing module is used for acquiring conference audio;
the voice signal noise reduction module is used for carrying out noise reduction processing on the voice signal;
the voice signal preprocessing module is used for acquiring a voice signal and preprocessing the voice signal;
the voice signal correction module is used for removing the part of at least one basic sinusoidal signal outside the first characteristic range;
the voice signal separation module is used for separating out human voice;
the voice signal recognition module is used for carrying out voice recognition on human voice and generating characters;
the word processing module is used for classifying words generated by different people by the word rearrangement model to obtain at least one word time sequence set;
and the information management module is used for managing the information of the text time sequence set and generating a meeting summary.
Compared with the prior art, the invention has the beneficial effects that:
through setting up speech signal separation module, speech signal recognition module and word processing module, separate out the voice to discernment produces the pronunciation characters that many people sent, distinguishes the characters that different people produced according to the word rearrangement model, and gathers according to time sequence, obtains at least one character time sequence collection, thereby can solve present speech recognition can only separate the voice and with the great voice of voice difference, can't draw different voice classification, can't classify the problem that discernment produced characters.
Drawings
FIG. 1 is a schematic flow chart of a method for analyzing and processing conference audio based on time sequence;
FIG. 2 is a schematic diagram of a process of framing and windowing a speech signal according to the present invention;
FIG. 3 is a schematic flow chart of a part of the method for obtaining the first characteristic range of the human voice in the frequency domain and rejecting at least one basic sinusoidal signal outside the first characteristic range;
FIG. 4 is a schematic diagram of a voice recognition process for human voice according to the present invention;
FIG. 5 is a schematic diagram of a process for creating a text rearrangement model using deep learning according to the present invention;
FIG. 6 is a schematic diagram of a text rearrangement model for classifying text generated by different persons according to the present invention.
Detailed Description
The following description is presented to enable one of ordinary skill in the art to make and use the invention. The preferred embodiments in the following description are by way of example only and other obvious variations will occur to those skilled in the art.
Referring to fig. 1, a method for analyzing and processing conference audio based on time sequence includes:
acquiring conference audio, wherein the conference audio is restored into a voice signal through digital receiving, decompression, information processing and digital-to-analog conversion;
noise reduction processing is carried out on the voice signals;
the method comprises the steps of obtaining a voice signal, preprocessing the voice signal, wherein the preprocessing flow sequentially comprises pre-emphasis of the voice signal, framing of the voice signal, windowing of the voice signal, fast Fourier transform of the voice signal, decomposing of the voice signal into at least one basic sinusoidal signal, frequency domain information forming of the at least one basic sinusoidal signal, and determination of the basic sinusoidal signal by amplitude, phase and frequency, wherein the pre-emphasis compensates for a damaged signal, high-frequency components of the signal are enhanced at the beginning end of a transmission line, and excessive attenuation of the high-frequency components in the transmission process is compensated;
acquiring a first characteristic range of the human voice on a frequency domain, and removing a part of at least one basic sinusoidal signal outside the first characteristic range to obtain at least one corrected sinusoidal signal;
performing inverse fourier transform on at least one modified sinusoidal signal to obtain separated human voice;
voice recognition is carried out on human voice to generate characters, wherein the characters are staggered messy characters generated by multiple persons;
establishing a word rearrangement model by using deep learning, classifying words generated by different people by using the word rearrangement model to obtain at least one word time sequence set, wherein the words generated by the same person belong to the same word time sequence set, the words in the word time sequence set are arranged according to the time sequence of generation, and the words in the word time sequence set are marked with the time points when the words appear;
and managing information of the text time sequence set to generate a meeting summary.
Referring to fig. 2, framing a voice signal, and windowing the voice signal includes the steps of:
determining the frame length of the frame, wherein the frame length of the frame is 15ms-25ms;
determining a frame shift, wherein the time difference of the initial positions of two adjacent frames is called frame shift, the frame shift ensures that the two adjacent frames have an overlapped part, and the frame shift is one half to three quarters of the frame length;
dividing a voice signal into at least one voice segment according to a frame length to obtain an original voice sequence v (n) of the voice segment in a time domain, multiplying the original voice sequence v (n) by a moving window function w (n), and finishing windowing, wherein the window function can select a rectangular window or a triangular window;
the speech signal is framed and windowed in order to make the speech signal smoother.
Performing a fast fourier transform on the speech signal comprises the steps of:
obtaining at least one windowed speech segment f (x) of the speech signal, processing the at least one speech segment f (x) using a fourier transform;
the fourier transform is as follows:
wherein F (x) is a basic sinusoidal signal, i is a unit imaginary number, and F (t) is a function of a small voice segment;
the fast fourier transform is performed on the speech signal to generate a plurality of basic sinusoidal signals, the basic sinusoidal signals are sinusoidal functions, and the data processing of the sinusoidal functions is a known means, so that the difficulty in processing the speech signal can be reduced.
Referring to fig. 3, obtaining a first characteristic range of a human voice in a frequency domain, and removing a portion of at least one basic sinusoidal signal outside the first characteristic range includes the following steps:
acquiring a first characteristic range (a, b) of human voice on a frequency domain, and acquiring a basic sinusoidal signal F (x);
as a transition signal function G (x), the transition signal function G (x) is initially constant with 0, and for any point c within the definition domain range of the basic sinusoidal signal F (x), if F (c) does not belong to (a, b), G (c) takes 0, and if F (c) belongs to (a, b), G (c) takes F (c);
when c traverses the definition domain of the basic sinusoidal signal F (x), finishing the value resetting of the transition signal function G (x), and taking the transition signal function G (x) as a correction sinusoidal signal;
the human voice generally has a predetermined range in the frequency domain, and sound waves not in the range are other sound waves and need to be removed, so that the part of the basic sinusoidal signal out of the range is removed.
Performing an inverse fourier transform on the at least one modified sinusoidal signal to obtain a separated human voice, comprising the steps of:
acquiring at least one modified sinusoidal signal, and performing inverse fourier transform on the at least one modified sinusoidal signal;
obtaining a function g (x) of at least one corrected voice small section, and combining the g (x) of the at least one corrected voice small section to obtain human voice;
the inverse fourier transform is as follows:
wherein G (t) is a modified sinusoidal signal, i is a unit imaginary number, and G (x) is a function of the modified speech segment;
the processed corrected sinusoidal signal is recombined into a new signal, namely, the part of the non-human voice is removed, so that the separated human voice is obtained.
Referring to fig. 4, the voice recognition of the human voice includes the steps of:
establishing a voice recognition model, wherein the voice recognition model comprises a single word and a first acoustic function corresponding to the single word and a second acoustic function corresponding to at least one word;
acquiring a voice function H (x), and acquiring character spacing points of the voice function H (x), wherein the character spacing points are points at which the value of the voice function H (x) is 0;
dividing the voice function H (x) into at least one dividing function along a time axis according to the character spacing points;
comparing a third acoustic wave function with the minimum difference with the segmentation function in the voice recognition model, if the third acoustic wave function is a first acoustic wave function, taking characters corresponding to the first acoustic wave function as characters recognized by the segmentation function, and if the third acoustic wave function is a second acoustic wave function, taking at least one character corresponding to the second acoustic wave function as at least one character recognized by the segmentation function;
regarding the identified characters, taking the middle point of the definition domain of the corresponding third acoustic wave function as the time point of the identified characters;
when a person makes a sound, each word has very small pauses, and the sound wave is small and can be used as a word spacing point, so that the sound wave between adjacent word spacing points corresponds to one word, and therefore, recognition can be performed.
Referring to fig. 5, the creation of a text rearrangement model using deep learning includes the steps of:
acquiring a daily dialogue sample data set, cutting sentences in the daily dialogue sample data set into at least one daily phrase, and taking adjacent daily phrases as daily phrase pairing groups;
in the daily phrase pairing group, daily phrases are arranged according to the sequence of sentences in the daily dialogue sample data set;
acquiring all daily phrase and daily phrase pairing groups, and storing the daily phrase and the daily phrase pairing groups into a word rearrangement model;
the word rearrangement model mainly establishes the matching relation of the words according to the using habit of the words, the matching is divided into two stages, firstly, the word group is established, and secondly, the word group collocation is used for generating sentences step by step.
Referring to fig. 6, the text rearrangement model classifies text generated by different persons, comprising the steps of:
acquiring a character set of voice recognition, and calling the characters according to the sequence of time points of the characters in the character set;
the characters with adjacent time points but different time points are used as test groups, the sequence of the characters in the test groups is arranged according to time sequence, whether daily phrases consistent with the test groups exist or not is searched in a character rearrangement model, if not, the test groups are deleted, and if so, the test groups are changed into prepared phrases;
the character set generates at least one prepared phrase, and the average value of the time points of characters in the prepared phrase is used as the time sequence time of the prepared phrase;
the preparation phrase with adjacent time sequence is used as a test group, the sequence of the preparation phrase in the test group is arranged according to time sequence, whether a daily phrase pairing group consistent with the test group exists or not is searched in a word rearrangement model, if not, the test group is deleted, and if so, the test group is used as the preparation phrase pairing group;
the character set generates at least one prepared phrase pairing group, the at least one prepared phrase pairing group forms a pairing set, and the time sequence time average value of the prepared phrases in the prepared phrase pairing group is used as the judgment time of the prepared phrase pairing group;
selecting a first prepared phrase pairing group with the smallest judgment time, deleting the first prepared phrase pairing group in a pairing set, and selecting a second prepared phrase pairing group to meet preset conditions: the prepared phrase at the head end of the second prepared phrase pairing group is consistent with the prepared phrase at the tail end of the first prepared phrase pairing group;
after the head end of the second prepared phrase pairing group is removed, the second prepared phrase pairing group is butted at the tail end of the first prepared phrase pairing group, a new first prepared phrase pairing group is formed through fusion, and the second prepared phrase pairing group is deleted;
re-selecting a second prepared phrase pairing group, and repeating the previous step for fusion;
repeating the previous step until a second prepared phrase pairing group meeting the preset condition is not formed;
taking the first prepared phrase pairing group obtained by fusion as a first text time sequence set;
reselecting a first prepared phrase pairing group with the smallest judging time, deleting the first prepared phrase pairing group in the pairing set to obtain a second character time sequence set;
repeating the previous step until the pairing set becomes an empty set;
the word sets are generated by multiple people, so that the word sets are overlapped, firstly, the words are paired into phrases according to the sequence of time, and the words spoken by different people are not logically connected, so that the words cannot be paired into phrases, secondly, the phrases are paired into sentences, and the words spoken by different people are not logically connected, so that the words cannot be paired into sentences;
here, when the character set generates at least one prepared phrase pairing group, each prepared phrase will appear in two of the prepared phrase pairing groups, because the prepared phrase appears in the sentence, the prepared phrase first appears in front of the sentence, the prepared phrase second appears in back of the sentence, and the prepared phrase first and the prepared phrase second are respectively matched with the prepared phrase to generate the prepared phrase pairing group, so that the prepared phrase pairing groups need to be fused into the sentence, and different people can speak.
The conference audio analysis processing system based on time sequence is used for realizing the conference audio analysis processing method based on time sequence, and comprises the following steps:
the voice signal processing module is used for acquiring conference audio;
the voice signal noise reduction module is used for carrying out noise reduction processing on the voice signal;
the voice signal preprocessing module is used for acquiring a voice signal and preprocessing the voice signal;
the voice signal correction module is used for removing the part of at least one basic sinusoidal signal outside the first characteristic range;
the voice signal separation module is used for separating out human voice;
the voice signal recognition module is used for carrying out voice recognition on human voice and generating characters;
the word processing module is used for classifying words generated by different people by the word rearrangement model to obtain at least one word time sequence set;
and the information management module is used for managing the information of the text time sequence set and generating a meeting summary.
The conference audio analysis processing system based on time sequence has the following working procedures:
step one: the voice signal processing module acquires conference audio which is restored into a voice signal through digital receiving, decompression, information processing and digital-to-analog conversion;
step two: the voice signal noise reduction module performs noise reduction processing on the voice signal;
step three: the voice signal preprocessing module acquires a voice signal, performs preprocessing on the voice signal, sequentially performs pre-emphasis on the voice signal, frames the voice signal, windows the voice signal, performs fast Fourier transform on the voice signal, and decomposes the voice signal into at least one basic sinusoidal signal which forms frequency domain information;
step four: the voice signal correction module obtains a first characteristic range of the voice on a frequency domain, and eliminates the part of at least one basic sinusoidal signal outside the first characteristic range to obtain at least one corrected sinusoidal signal;
step five: the voice signal separation module performs inverse Fourier transform on at least one corrected sinusoidal signal to obtain separated human voice;
step six: the voice signal recognition module carries out voice recognition on voice to generate characters, wherein the characters are staggered messy characters generated by multiple people;
step seven: the word processing module uses deep learning to establish a word rearrangement model, and the word rearrangement model classifies words generated by different people to obtain at least one word time sequence set;
step eight: and the information management module is used for carrying out information management on the text time sequence set to generate a meeting summary.
Still further, the present disclosure provides a storage medium having a computer readable program stored thereon, the computer readable program when invoked performing the above-described time-based conference audio analysis processing method.
It is understood that the storage medium may be a magnetic medium, e.g., floppy disk, hard disk, magnetic tape; optical media such as DVD; or a semiconductor medium such as a solid state disk SolidStateDisk, SSD, etc.
In summary, the invention has the advantages that: through setting up speech signal separation module, speech signal recognition module and word processing module, separate out the voice to discernment produces the pronunciation characters that many people sent, distinguishes the characters that different people produced according to the word rearrangement model, and gathers according to time sequence, obtains at least one character time sequence collection, thereby can solve present speech recognition can only separate the voice and with the great voice of voice difference, can't draw different voice classification, can't classify the problem that discernment produced characters.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention, which is defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (9)
1. A method for analyzing and processing conference audio based on time sequence, which is characterized by comprising the following steps:
acquiring conference audio, wherein the conference audio is restored into a voice signal through digital receiving, decompression, information processing and digital-to-analog conversion;
noise reduction processing is carried out on the voice signals;
the method comprises the steps of obtaining a voice signal, preprocessing the voice signal, wherein the preprocessing flow sequentially comprises pre-emphasis of the voice signal, framing of the voice signal, windowing of the voice signal, fast Fourier transform of the voice signal, decomposing of the voice signal into at least one basic sinusoidal signal, frequency domain information forming of the at least one basic sinusoidal signal, and determination of the basic sinusoidal signal by amplitude, phase and frequency, wherein the pre-emphasis compensates for a damaged signal, high-frequency components of the signal are enhanced at the beginning end of a transmission line, and excessive attenuation of the high-frequency components in the transmission process is compensated;
acquiring a first characteristic range of the human voice on a frequency domain, and removing a part of at least one basic sinusoidal signal outside the first characteristic range to obtain at least one corrected sinusoidal signal;
performing inverse fourier transform on at least one modified sinusoidal signal to obtain separated human voice;
voice recognition is carried out on human voice to generate characters, wherein the characters are staggered messy characters generated by multiple persons;
establishing a word rearrangement model by using deep learning, classifying words generated by different people by using the word rearrangement model to obtain at least one word time sequence set, wherein the words generated by the same person belong to the same word time sequence set, the words in the word time sequence set are arranged according to the time sequence of generation, and the words in the word time sequence set are marked with the time points when the words appear;
and managing information of the text time sequence set to generate a meeting summary.
2. The method for analyzing and processing conference audio based on time sequence according to claim 1, wherein framing the voice signal and windowing the voice signal comprises the steps of:
determining the frame length of the frame, wherein the frame length of the frame is 15ms-25ms;
determining a frame shift, wherein the time difference of the initial positions of two adjacent frames is called frame shift, the frame shift ensures that the two adjacent frames have an overlapped part, and the frame shift is one half to three quarters of the frame length;
dividing the voice signal into at least one voice segment according to the frame length to obtain an original voice sequence v (n) of the voice segment in the time domain, multiplying the original voice sequence v (n) by a moving window function w (n), and finishing windowing, wherein the window function can select a rectangular window or a triangular window.
3. A method of time-series based conference audio analysis processing according to claim 2, wherein said performing a fast fourier transform on the speech signal comprises the steps of:
obtaining at least one windowed speech segment f (x) of the speech signal, processing the at least one speech segment f (x) using a fourier transform;
the fourier transform is as follows:
,
where F (x) is a basic sinusoidal signal, i is the unit imaginary number, and F (t) is a function of the speech segment.
4. A method of analyzing and processing conference audio based on time sequence according to claim 3, wherein the step of obtaining a first characteristic range of the voice in the frequency domain and removing the portion of the at least one basic sinusoidal signal outside the first characteristic range comprises the steps of:
acquiring a first characteristic range (a, b) of human voice on a frequency domain, and acquiring a basic sinusoidal signal F (x);
as a transition signal function G (x), the transition signal function G (x) is initially constant with 0, and for any point c within the definition domain range of the basic sinusoidal signal F (x), if F (c) does not belong to (a, b), G (c) takes 0, and if F (c) belongs to (a, b), G (c) takes F (c);
and c, finishing the value resetting of the transition signal function G (x) when traversing the definition domain of the basic sinusoidal signal F (x), and taking the transition signal function G (x) as a correction sinusoidal signal.
5. The method of time-series-based conference audio analysis processing according to claim 4, wherein said performing inverse fourier transform on at least one modified sinusoidal signal to obtain a separated voice comprises the steps of:
acquiring at least one modified sinusoidal signal, and performing inverse fourier transform on the at least one modified sinusoidal signal;
obtaining a function g (x) of at least one corrected voice small section, and combining the g (x) of the at least one corrected voice small section to obtain human voice;
the inverse fourier transform is as follows:
,
where G (t) is the modified sinusoidal signal, i is the unit imaginary number, and G (x) is a function of the modified speech segment.
6. The method of time-series based conference audio analysis processing according to claim 5, wherein said voice recognition of human voice comprises the steps of:
establishing a voice recognition model, wherein the voice recognition model comprises a single word and a first acoustic function corresponding to the single word and a second acoustic function corresponding to at least one word;
acquiring a voice function H (x), and acquiring character spacing points of the voice function H (x), wherein the character spacing points are points at which the value of the voice function H (x) is 0;
dividing the voice function H (x) into at least one dividing function along a time axis according to the character spacing points;
comparing a third acoustic wave function with the minimum difference with the segmentation function in the voice recognition model, if the third acoustic wave function is a first acoustic wave function, taking characters corresponding to the first acoustic wave function as characters recognized by the segmentation function, and if the third acoustic wave function is a second acoustic wave function, taking at least one character corresponding to the second acoustic wave function as at least one character recognized by the segmentation function;
and regarding the recognized characters, taking the middle point of the definition domain of the corresponding third acoustic wave function as the time point of the recognized characters.
7. The method of time-series based conference audio analysis processing according to claim 6, wherein said using deep learning to build a text rearrangement model comprises the steps of:
acquiring a daily dialogue sample data set, cutting sentences in the daily dialogue sample data set into at least one daily phrase, and taking adjacent daily phrases as daily phrase pairing groups;
in the daily phrase pairing group, daily phrases are arranged according to the sequence of sentences in the daily dialogue sample data set;
and obtaining all daily phrase and daily phrase pairing groups, and storing the daily phrase and the daily phrase pairing groups into a word rearrangement model.
8. The method of claim 7, wherein the word rearrangement model classifies words generated by different persons, comprising the steps of:
acquiring a character set of voice recognition, and calling the characters according to the sequence of time points of the characters in the character set;
the characters with adjacent time points but different time points are used as test groups, the sequence of the characters in the test groups is arranged according to time sequence, whether daily phrases consistent with the test groups exist or not is searched in a character rearrangement model, if not, the test groups are deleted, and if so, the test groups are changed into prepared phrases;
the character set generates at least one prepared phrase, and the average value of the time points of characters in the prepared phrase is used as the time sequence time of the prepared phrase;
the preparation phrase with adjacent time sequence is used as a test group, the sequence of the preparation phrase in the test group is arranged according to time sequence, whether a daily phrase pairing group consistent with the test group exists or not is searched in a word rearrangement model, if not, the test group is deleted, and if so, the test group is used as the preparation phrase pairing group;
the character set generates at least one prepared phrase pairing group, the at least one prepared phrase pairing group forms a pairing set, and the time sequence time average value of the prepared phrases in the prepared phrase pairing group is used as the judgment time of the prepared phrase pairing group;
selecting a first prepared phrase pairing group with the smallest judgment time, deleting the first prepared phrase pairing group in a pairing set, and selecting a second prepared phrase pairing group to meet preset conditions: the prepared phrase at the head end of the second prepared phrase pairing group is consistent with the prepared phrase at the tail end of the first prepared phrase pairing group;
after the head end of the second prepared phrase pairing group is removed, the second prepared phrase pairing group is butted at the tail end of the first prepared phrase pairing group, a new first prepared phrase pairing group is formed through fusion, and the second prepared phrase pairing group is deleted;
re-selecting a second prepared phrase pairing group, and repeating the previous step for fusion;
repeating the previous step until a second prepared phrase pairing group meeting the preset condition is not formed;
taking the first prepared phrase pairing group obtained by fusion as a first text time sequence set;
reselecting a first prepared phrase pairing group with the smallest judging time, deleting the first prepared phrase pairing group in the pairing set to obtain a second character time sequence set;
the above steps are repeated until the paired set becomes an empty set.
9. A time-series-based conference audio analysis processing system for implementing the time-series-based conference audio analysis processing method according to any one of claims 1 to 8, comprising:
the voice signal processing module is used for acquiring conference audio;
the voice signal noise reduction module is used for carrying out noise reduction processing on the voice signal;
the voice signal preprocessing module is used for acquiring a voice signal and preprocessing the voice signal;
the voice signal correction module is used for removing the part of at least one basic sinusoidal signal outside the first characteristic range;
the voice signal separation module is used for separating out human voice;
the voice signal recognition module is used for carrying out voice recognition on human voice and generating characters;
the word processing module is used for classifying words generated by different people by the word rearrangement model to obtain at least one word time sequence set;
and the information management module is used for managing the information of the text time sequence set and generating a meeting summary.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311586467.4A CN117316165B (en) | 2023-11-27 | 2023-11-27 | Conference audio analysis processing method and system based on time sequence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311586467.4A CN117316165B (en) | 2023-11-27 | 2023-11-27 | Conference audio analysis processing method and system based on time sequence |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117316165A true CN117316165A (en) | 2023-12-29 |
CN117316165B CN117316165B (en) | 2024-02-20 |
Family
ID=89281394
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311586467.4A Active CN117316165B (en) | 2023-11-27 | 2023-11-27 | Conference audio analysis processing method and system based on time sequence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117316165B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009156773A1 (en) * | 2008-06-27 | 2009-12-30 | Monting-I D.O.O. | Device and procedure for recognizing words or phrases and their meaning from digital free text content |
US20190392837A1 (en) * | 2018-06-22 | 2019-12-26 | Microsoft Technology Licensing, Llc | Use of voice recognition to generate a transcript of conversation(s) |
CN111326160A (en) * | 2020-03-11 | 2020-06-23 | 南京奥拓电子科技有限公司 | Speech recognition method, system and storage medium for correcting noise text |
CN113380234A (en) * | 2021-08-12 | 2021-09-10 | 明品云(北京)数据科技有限公司 | Method, device, equipment and medium for generating form based on voice recognition |
CN113505609A (en) * | 2021-05-28 | 2021-10-15 | 引智科技(深圳)有限公司 | One-key auxiliary translation method for multi-language conference and equipment with same |
US20220237379A1 (en) * | 2019-05-20 | 2022-07-28 | Samsung Electronics Co., Ltd. | Text reconstruction system and method thereof |
CN115273821A (en) * | 2022-07-13 | 2022-11-01 | 平顶山学院 | Speech recognition system based on semantic understanding of computer application scene |
-
2023
- 2023-11-27 CN CN202311586467.4A patent/CN117316165B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009156773A1 (en) * | 2008-06-27 | 2009-12-30 | Monting-I D.O.O. | Device and procedure for recognizing words or phrases and their meaning from digital free text content |
US20190392837A1 (en) * | 2018-06-22 | 2019-12-26 | Microsoft Technology Licensing, Llc | Use of voice recognition to generate a transcript of conversation(s) |
US20220237379A1 (en) * | 2019-05-20 | 2022-07-28 | Samsung Electronics Co., Ltd. | Text reconstruction system and method thereof |
CN111326160A (en) * | 2020-03-11 | 2020-06-23 | 南京奥拓电子科技有限公司 | Speech recognition method, system and storage medium for correcting noise text |
CN113505609A (en) * | 2021-05-28 | 2021-10-15 | 引智科技(深圳)有限公司 | One-key auxiliary translation method for multi-language conference and equipment with same |
CN113380234A (en) * | 2021-08-12 | 2021-09-10 | 明品云(北京)数据科技有限公司 | Method, device, equipment and medium for generating form based on voice recognition |
CN115273821A (en) * | 2022-07-13 | 2022-11-01 | 平顶山学院 | Speech recognition system based on semantic understanding of computer application scene |
Also Published As
Publication number | Publication date |
---|---|
CN117316165B (en) | 2024-02-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109065031B (en) | Voice labeling method, device and equipment | |
CN111524527B (en) | Speaker separation method, speaker separation device, electronic device and storage medium | |
US10515292B2 (en) | Joint acoustic and visual processing | |
CN108305632A (en) | A kind of the voice abstract forming method and system of meeting | |
CN111128223A (en) | Text information-based auxiliary speaker separation method and related device | |
WO2022100692A1 (en) | Human voice audio recording method and apparatus | |
JP2005049859A (en) | Method and device for automatically recognizing audio data | |
CN112633241B (en) | News story segmentation method based on multi-feature fusion and random forest model | |
CN111462758A (en) | Method, device and equipment for intelligent conference role classification and storage medium | |
CN112712824A (en) | Crowd information fused speech emotion recognition method and system | |
US7349477B2 (en) | Audio-assisted video segmentation and summarization | |
CN111488813B (en) | Video emotion marking method and device, electronic equipment and storage medium | |
EP4392972A1 (en) | Speaker-turn-based online speaker diarization with constrained spectral clustering | |
JP6208794B2 (en) | Conversation analyzer, method and computer program | |
CN112562725A (en) | Mixed voice emotion classification method based on spectrogram and capsule network | |
CN112259085A (en) | Two-stage voice awakening algorithm based on model fusion framework | |
Venkatesan et al. | Automatic language identification using machine learning techniques | |
CN115150660A (en) | Video editing method based on subtitles and related equipment | |
CN117316165B (en) | Conference audio analysis processing method and system based on time sequence | |
CN114022923A (en) | Intelligent collecting and editing system | |
CN112927723A (en) | High-performance anti-noise speech emotion recognition method based on deep neural network | |
CN117151047A (en) | Conference summary generation method based on AI identification | |
Chou et al. | Bird species recognition by wavelet transformation of a section of birdsong | |
KR101369270B1 (en) | Method for analyzing video stream data using multi-channel analysis | |
CN115831124A (en) | Conference record role separation system and method based on voiceprint recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |