CN116597856A - Voice quality enhancement method based on frogman intercom - Google Patents
Voice quality enhancement method based on frogman intercom Download PDFInfo
- Publication number
- CN116597856A CN116597856A CN202310876048.8A CN202310876048A CN116597856A CN 116597856 A CN116597856 A CN 116597856A CN 202310876048 A CN202310876048 A CN 202310876048A CN 116597856 A CN116597856 A CN 116597856A
- Authority
- CN
- China
- Prior art keywords
- voice
- noise
- frogman
- signal
- definition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 67
- 230000005540 biological transmission Effects 0.000 claims abstract description 111
- 230000006854 communication Effects 0.000 claims abstract description 70
- 238000004891 communication Methods 0.000 claims abstract description 61
- 238000011156 evaluation Methods 0.000 claims abstract description 56
- 230000000873 masking effect Effects 0.000 claims abstract description 13
- 230000002457 bidirectional effect Effects 0.000 claims description 14
- 239000011159 matrix material Substances 0.000 claims description 12
- 238000005315 distribution function Methods 0.000 claims description 10
- 238000001914 filtration Methods 0.000 claims description 9
- 238000009432 framing Methods 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 6
- 238000007418 data mining Methods 0.000 claims description 4
- 210000005069 ears Anatomy 0.000 claims description 3
- 230000002708 enhancing effect Effects 0.000 claims 2
- 238000012549 training Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000013139 quantization Methods 0.000 description 2
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0316—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
- G10L21/0364—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04Q—SELECTING
- H04Q5/00—Selecting arrangements wherein two or more subscriber stations are connected by the same line to the exchange
- H04Q5/24—Selecting arrangements wherein two or more subscriber stations are connected by the same line to the exchange for two-party-line systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0316—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
- G10L21/0364—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
- G10L2021/03643—Diver speech
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Networks & Wireless Communication (AREA)
- Telephonic Communication Services (AREA)
Abstract
A voice quality enhancement method based on frogman intercom relates to the technical field of voice communication, acquires voice characteristics of frogman and records the voice characteristics into a voice library, establishes a two-way information transmission channel according to voice recognition information of the frogman, and determines the proportion of noise signals with different voice transmission distances in noise-containing voice signals according to the distance of the two-way information transmission channel; setting an evaluation index of voice definition; the method comprises the steps of presetting a desired voice definition evaluation index of a frogman communication target, dynamically adjusting the length of an average window function according to a comparison result of the voice definition in the communication process and the desired voice definition evaluation index, setting a frequency threshold for noise-containing voice based on a human ear sound frequency masking characteristic, acquiring voice characteristics of the noise-containing voice filtered by the frequency threshold, inputting the voice characteristics into a voice broadcasting terminal, and generating voice information conforming to the voice definition of the frogman desired communication target according to the voice characteristics by the voice broadcasting terminal, so that the voice quality under a frogman intercom scene is remarkably improved.
Description
Technical Field
The application relates to the technical field of voice communication, in particular to a voice quality enhancement method based on frog man intercom.
Background
Frogman intercom refers to voice communication carried out under water, because of the high density of water, the characteristic of sound propagation and the like, the underwater communication is often subjected to a plurality of interferences, so that the voice quality is poor, the communication effect is affected, voice enhancement is an effective method for solving noise pollution, the method is an effective method for extracting original voice which is as pure as possible from a voice signal with noise, and overall, the voice enhancement aims are mainly as follows: improving the voice quality, eliminating the background noise, ensuring that the listener is happy to accept, and not feel tired; the speech intelligibility is improved, and the listener is facilitated to understand.
In the prior art, when the voice is converted into voice characteristics during the voice enhancement of the noise-containing voice transmitted by the frogman, the noise-containing voice is converted according to the fixed window function length, the complexity of the underwater communication process is ignored, the complexity of noise signals is not considered to be higher and higher along with the gradual increase of the communication distance between the frogman, if the voice characteristics are still converted according to the fixed window function length, the converted voice characteristics cannot completely express the characteristics of the noise-containing voice, and how to completely express the characteristics of the noise-containing voice along with the gradual increase of the communication distance between the frogman by selecting different window function lengths is a problem which needs to be solved.
Disclosure of Invention
In order to solve the technical problems, the application aims to provide a voice quality enhancement method based on frog man intercom, which comprises the following steps:
step S1: acquiring pure voice sample data of a plurality of groups of frogmans and noise-containing voice sample data transmitted by the groups of frogmans in a frogman intercom scene, and recording the data into a voice library;
step S2: generating voice characteristics of the frogman according to the pure voice sample data and the noise-containing voice sample data, acquiring identity information of the frogman, and binding the voice characteristics of the frogman with the identity information to generate voice recognition information;
step S3: when a frogman puts forward a voice communication request, a two-way information transmission channel is established according to voice recognition information of the frogman, and the proportion of noise signals with different voice transmission distances in the noise-containing voice signals is determined according to the distance of the two-way information transmission channel; setting an evaluation index of the voice definition according to the proportion of the noise signal in the noise-containing voice signal;
step S4: the method comprises the steps of presetting a desired voice definition evaluation index of a frogman communication target, dynamically adjusting the average window function length in the process of windowing when noise-containing voice is converted into voice characteristics according to a comparison result of voice definition in the communication process and the desired voice definition evaluation index, setting a frequency threshold for the noise-containing voice based on a human ear sound frequency masking characteristic, filtering a frequency signal of the noise-containing voice, obtaining voice characteristics of the noise-containing voice filtered by the frequency threshold, inputting the voice characteristics into a voice broadcasting end, and generating voice information which accords with the voice definition of the frogman desired communication target by the voice broadcasting end according to the voice characteristics.
Further, the process of generating the voice features of the frogman according to the clean voice sample data and the noise-containing voice sample data comprises the following steps:
the method comprises the steps of converting sample data into digital voice signals, obtaining average framing parameters, window function types and average window function lengths of the sample data in a frogman intercom scene by utilizing big data, dividing the digital voice signals into a plurality of frames according to the average framing parameters, windowing each frame according to the window function, converting the digital voice signals of each frame into frequency signals through Fourier transformation, converting the frequency signals of each frame into time domain waveforms through Fourier transformation, and marking the time domain waveforms of the sample data as voice features.
Further, when the frogman puts forward a voice communication request, the process of establishing the two-way information transmission channel according to the voice recognition information of the frogman comprises the following steps:
when the frogman communication platform receives a frogman voice communication request, determining a frogman communication target according to the frogman voice communication request, and establishing a bidirectional information transmission channel between the frogman and the frogman communication target according to the frogman voice recognition information, wherein the frogman and the frogman communication target transmit voice through the bidirectional information transmission channel.
Further, the process of obtaining the voice transmission distance of the two-way information transmission channel comprises the following steps:
the frogman carries a position signal generating device; determining real-time distance values among the frogmans according to the position information generated by the position signal generating equipment; and determining the voice transmission distance of the two-way information transmission channel between the frogman and the frogman communication target according to the real-time distance value between the frogman, and recording the voice transmission distance of the two-way information transmission channel into a voice library in real time.
Further, the process of determining the proportion of the noise signals with different voice transmission distances in the noise-containing voice signals according to the distance between the two-way information transmission channels comprises the following steps:
acquiring noise-containing voice sample data of voice transmission processes of a plurality of groups of frogmans through a two-way information transmission channel and voice transmission distances of the two-way information transmission channel under a voice library historical talkback scene of a plurality of frogmans; the noise-containing voice comprises a clean voice signal and a noise signal;
carrying out data mining on noise signals of noise-containing voices in voice transmission distances of different two-way information transmission channels, and constructing a noise-containing voice probability distribution function model by taking the noise signals and the voice transmission distances of the two-way information transmission channels as input features and taking the distribution probability of the noise signals in the voice transmission distances of the different two-way information transmission channels as output tags; and acquiring the proportion of noise signals in the noise-containing voice when the voice transmission distances of different bidirectional information transmission channels are different according to the noise-containing voice probability distribution function model.
Further, the process of setting the evaluation index of the speech intelligibility according to the proportion of the noise signal in the noise-containing speech signal includes:
acquiring an evaluation index of voice definition corresponding to the proportion of the noise signal in the noise-containing voice by using a big data method, wherein the evaluation index comprises purity, definition, general and poor; setting index weight of an evaluation index, establishing an evaluation index matrix about voice definition according to an evaluation index of voice definition corresponding to the proportion of a noise signal in noise-containing voice, establishing a noise proportion matrix according to the proportion of the noise signal in the noise-containing voice, and acquiring a membership matrix of the proportion of the noise signal in the noise-containing voice to the voice definition through fuzzy comprehensive evaluation;
and obtaining the voice definition corresponding to different proportions of the noise signals in the noise-containing voice according to the membership degree matrix and the index weight.
Further, the quantization process for dynamically adjusting the average window function length in the process of windowing when the noisy speech is converted into the speech features according to the comparison result of the speech definition in the communication process and the expected speech definition evaluation index comprises:
presetting an expected voice definition evaluation index of a frogman communication target, and acquiring a frogman noisy voice signal received by the frogman communication target in the two-way information transmission channel between the frogman and the frogman communication target and a voice transmission distance of the two-way information transmission channel; inputting the voice transmission distance of the two-way information transmission channel and the received frogman noisy voice signal as input characteristics into a noisy voice probability distribution function model to obtain the proportion of the noise signal of the current two-way information transmission channel voice transmission distance in the noisy voice signal, and obtaining a voice definition evaluation index of the frogman noisy voice signal according to the proportion of the noise signal in the frogman noisy voice signal in the noisy voice;
comparing the expected voice definition evaluation index of the frogman communication target with the voice definition evaluation index of the noise-containing voice signal received by the frogman communication target, when the expected voice definition evaluation index is inconsistent with the voice definition evaluation index of the received noise-containing voice signal, acquiring the proportion of the noise signal corresponding to the expected voice definition evaluation index in the noise-containing voice and the proportion of the noise signal in the received noise-containing voice signal, and acquiring the average window function length when the received noise-containing voice signal is windowed in the process of converting the received noise-containing voice signal into voice characteristics according to the noise proportion deviation value; the larger the proportion of the noise signal in the noise-containing voice signal is, the shorter the average window function length is.
Further, the process of setting the frequency threshold for the noisy speech based on the human ear sound frequency masking feature comprises:
and S2, acquiring human ear sound frequency masking characteristics by using big data, wherein the human ear sound frequency masking characteristics comprise the highest sound frequency and the lowest sound frequency which can be perceived by human ears, filtering the frequency signals by taking the highest sound frequency and the lowest sound frequency as the highest threshold and the lowest threshold when frequency signals are generated in the step S2, filtering the frequency signals higher than the highest threshold and the frequency signals lower than the lowest threshold, converting the filtered frequency signals into time domain waveforms through inverse Fourier transform, marking the time domain waveforms as voice characteristics, inputting the voice characteristics into a voice broadcasting end, and generating voice information which accords with the voice definition of a communication target expected by a frog by the voice broadcasting end according to the voice characteristics.
Compared with the prior art, the application has the beneficial effects that: in the prior art, the noise-containing voice is converted according to the fixed window function length, the complexity of the underwater communication process is ignored, the complexity of noise signals is not considered to be higher and higher along with the gradual increase of communication distances among frogmans, if the noise-containing voice is converted according to the fixed window function length, the converted voice characteristics can not completely express the characteristics of the noise-containing voice, the data mining of the noise signals is carried out on the noise-containing voices in the voice transmission distances of different bidirectional information transmission channels in a voice library, a noise-containing voice probability distribution function model is constructed, the proportion of the noise signals in the noise-containing voices in the voice transmission distances of different bidirectional information transmission channels is obtained, the noise-containing voices transmitted by the frogmans are dynamically adjusted according to the proportion probability of the noise signals in the noise-containing voices in the voice transmission distances of different bidirectional information transmission channels, and the converted voice characteristics can completely express the characteristics of the noise-containing voices in the voice characteristic process, so that the quality of the intercom of the frogmans is remarkably improved.
Drawings
Fig. 1 is a schematic diagram of a voice quality enhancement method based on frog human intercom according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
As shown in fig. 1, the voice quality enhancement method based on frog human intercom comprises the following steps:
step S1: acquiring pure voice sample data of a plurality of groups of frogmans and noise-containing voice sample data transmitted by the groups of frogmans in a frogman intercom scene, and recording the data into a voice library;
step S2: generating voice characteristics of the frogman according to the pure voice sample data and the noise-containing voice sample data, acquiring identity information of the frogman, and binding the voice characteristics of the frogman with the identity information to generate voice recognition information;
step S3: when a frogman puts forward a voice communication request, a two-way information transmission channel is established according to voice recognition information of the frogman, and the proportion of noise signals with different voice transmission distances in the noise-containing voice signals is determined according to the distance of the two-way information transmission channel; setting an evaluation index of the voice definition according to the proportion of the noise signal in the noise-containing voice signal;
step S4: the method comprises the steps of presetting a desired voice definition evaluation index of a frogman communication target, dynamically adjusting the average window function length in the process of windowing when noise-containing voice is converted into voice characteristics according to a comparison result of voice definition in the communication process and the desired voice definition evaluation index, setting a frequency threshold for the noise-containing voice based on a human ear sound frequency masking characteristic, filtering a frequency signal of the noise-containing voice, obtaining voice characteristics of the noise-containing voice filtered by the frequency threshold, inputting the voice characteristics into a voice broadcasting end, and generating voice information which accords with the voice definition of the frogman desired communication target by the voice broadcasting end according to the voice characteristics.
It should be further noted that, in the implementation process, the process of generating the voice features of the frogman according to the clean voice sample data and the noisy voice sample data includes:
the method comprises the steps of converting sample data into digital voice signals, obtaining average framing parameters, window function types and average window function lengths of the sample data in a frogman intercom scene by utilizing big data, dividing the digital voice signals into a plurality of frames according to the average framing parameters, windowing each frame according to the window function, converting the digital voice signals of each frame into frequency signals through Fourier transformation, converting the frequency signals of each frame into time domain waveforms through Fourier transformation, and marking the time domain waveforms of the sample data as voice features.
It should be further noted that, in the implementation process, the process of constructing the speech enhancement model representing the mapping relationship between the noisy speech and the clean speech based on the deep neural network includes:
constructing a voice enhancement model based on an RBF neural network, and taking voice characteristics of pure voice sample data of a plurality of groups of frogmans in a voice library and voice characteristics of noisy voice sample data of the plurality of groups of frogmans in a frogman intercom scene as a training set and a test set to learn and train the voice enhancement model in real time; each audio file comprises single-segment voice and multi-segment voice, data acquisition is carried out at a sampling rate of 16000Hz during recording, coolEditPro software is used as an auxiliary tool on the basis of the data acquisition, the starting point and the end point of a pure voice sample are manually marked and used as voice detection standards, meanwhile, in order to obtain noisy voice, the voice transmission distance of the noisy voice and a bidirectional information transmission channel of 50 target persons during frogman operation is obtained, 1000 groups of voice sample data are obtained in total, 950 groups are used as training sets, 50 groups of the most tested sets train a voice enhancement model until the loss function training of the voice enhancement model is stable, and model parameters are saved.
It should be further noted that, in the implementation process, the process of establishing the frogman communication platform, when the frogman makes a voice communication request, the process of establishing the bidirectional information transmission channel according to the voice recognition information of the frogman includes:
when the frogman communication platform receives a frogman voice communication request, determining a frogman communication target according to the frogman voice communication request, and establishing a bidirectional information transmission channel between the frogman and the frogman communication target according to the frogman voice recognition information, wherein the frogman and the frogman communication target transmit voice through the bidirectional information transmission channel.
It should be further noted that, in the implementation process, the process of obtaining the voice transmission distance of the two-way information transmission channel includes:
the frogman carries a position signal generating device; determining real-time distance values among the frogmans according to the position information generated by the position signal generating equipment; and determining the voice transmission distance of the two-way information transmission channel between the frogman and the frogman communication target according to the real-time distance value between the frogman, and recording the voice transmission distance of the two-way information transmission channel into a voice library in real time.
It should be further noted that, in the implementation process, the process of determining the proportion of the noise signals with different voice transmission distances in the noise-containing voice signals according to the distance between the two-way information transmission channels includes:
acquiring noise-containing voice sample data of a plurality of groups of frogmans in a frog intercom scene through a two-way information transmission channel and the voice transmission distance of the two-way information transmission channel according to a voice library; the noise-containing voice comprises a clean voice signal and a noise signal;
carrying out data mining on noise signals of noise-containing voices in voice transmission distances of different two-way information transmission channels, and constructing a noise-containing voice probability distribution function model by taking the noise signals and the voice transmission distances of the two-way information transmission channels as input features and taking the distribution probability of the noise signals in the voice transmission distances of the different two-way information transmission channels as output tags; acquiring the proportion probability of noise signals in noise-containing voices when the voice transmission distances of different bidirectional information transmission channels are different according to the noise-containing voice probability distribution function model;
further, the process of setting the evaluation index of the speech intelligibility according to the proportion of the noise signal in the noise-containing speech signal includes:
acquiring an evaluation index of voice definition corresponding to the proportion of the noise signal in the noise-containing voice by using a big data method, wherein the evaluation index comprises purity, definition, general and poor; setting index weight of an evaluation index, establishing an evaluation index matrix about voice definition according to an evaluation index of voice definition corresponding to the proportion of a noise signal in noise-containing voice, establishing a noise proportion matrix according to the proportion of the noise signal in the noise-containing voice, and acquiring a membership matrix of the proportion of the noise signal in the noise-containing voice to the voice definition through fuzzy comprehensive evaluation;
and obtaining the voice definition corresponding to different proportions of the noise signals in the noise-containing voice according to the membership degree matrix and the index weight.
It should be further noted that, in the implementation process: the quantization process for dynamically adjusting the length of the average window function in the process of windowing when the noisy speech is converted into the speech characteristics according to the comparison result of the speech definition in the communication process and the expected speech definition evaluation index comprises the following steps:
presetting an expected voice definition evaluation index of a frogman communication target, and acquiring a frogman noisy voice signal received by the frogman communication target in the two-way information transmission channel between the frogman and the frogman communication target and a voice transmission distance of the two-way information transmission channel; inputting the voice transmission distance of the two-way information transmission channel and the received frogman noisy voice signal as input characteristics into a noisy voice probability distribution function model to obtain the proportion of the noise signal of the current two-way information transmission channel voice transmission distance in the noisy voice signal, and obtaining a voice definition evaluation index of the frogman noisy voice signal according to the proportion of the noise signal in the frogman noisy voice signal in the noisy voice;
comparing the expected voice definition evaluation index of the frogman communication target with the voice definition evaluation index of the noise-containing voice signal received by the frogman communication target, when the expected voice definition evaluation index is inconsistent with the voice definition evaluation index of the received noise-containing voice signal, acquiring the proportion of the noise signal corresponding to the expected voice definition evaluation index in the noise-containing voice and the proportion of the noise signal in the received noise-containing voice signal, and acquiring the average window function length when the received noise-containing voice signal is windowed in the process of converting the received noise-containing voice signal into voice characteristics according to the noise proportion deviation value; the larger the proportion of the noise signal in the noise-containing voice signal is, the shorter the average window function length is.
The characteristics of the noise-containing voice can be expressed completely by dynamically adjusting the noise-containing voice transmitted by the frogman according to the proportion probability of the noise signal in the noise-containing voice when the voice transmission distances of different two-way information transmission channels are different and converting the noise-containing voice into the average window function length in the voice characteristic process, thereby obviously improving the voice quality under the frogman intercom scene
It should be further noted that, in the implementation process, the process of setting the frequency threshold for the noise-containing speech based on the frequency masking feature of the human ear sound includes:
and S2, acquiring human ear sound frequency masking characteristics by using big data, wherein the human ear sound frequency masking characteristics comprise the highest sound frequency and the lowest sound frequency which can be perceived by human ears, filtering the frequency signals by taking the highest sound frequency and the lowest sound frequency as the highest threshold and the lowest threshold when frequency signals are generated in the step S2, filtering the frequency signals higher than the highest threshold and the frequency signals lower than the lowest threshold, converting the filtered frequency signals into time domain waveforms through inverse Fourier transform, marking the time domain waveforms as voice characteristics, inputting the voice characteristics into a voice enhancement model, and generating voice information meeting the voice definition of a communication target expected by a frog by a voice broadcasting terminal according to the voice characteristics.
The above embodiments are only for illustrating the technical method of the present application and not for limiting the same, and it should be understood by those skilled in the art that the technical method of the present application may be modified or substituted without departing from the spirit and scope of the technical method of the present application.
Claims (8)
1. The voice quality enhancement method based on frog man intercom is characterized by comprising the following steps of:
step S1: acquiring pure voice sample data of a plurality of groups of frogmans and noise-containing voice sample data transmitted by the groups of frogmans in a frogman intercom scene, and recording the data into a voice library;
step S2: generating voice characteristics of the frogman according to the pure voice sample data and the noise-containing voice sample data, acquiring identity information of the frogman, and binding the voice characteristics of the frogman with the identity information to generate voice recognition information;
step S3: when a frogman puts forward a voice communication request, a two-way information transmission channel is established according to voice recognition information of the frogman, and the proportion of noise signals with different voice transmission distances in the noise-containing voice signals is determined according to the distance of the two-way information transmission channel; setting an evaluation index of the voice definition according to the proportion of the noise signal in the noise-containing voice signal;
step S4: the method comprises the steps of presetting a desired voice definition evaluation index of a frogman communication target, dynamically adjusting the average window function length in the process of windowing when noise-containing voice is converted into voice characteristics according to a comparison result of voice definition in the communication process and the desired voice definition evaluation index, setting a frequency threshold for the noise-containing voice based on a human ear sound frequency masking characteristic, filtering a frequency signal of the noise-containing voice, obtaining voice characteristics of the noise-containing voice filtered by the frequency threshold, inputting the voice characteristics into a voice broadcasting end, and generating voice information which accords with the voice definition of the frogman desired communication target by the voice broadcasting end according to the voice characteristics.
2. The method of claim 1, wherein generating voice features of the frogman based on the clean voice sample data and the noisy voice sample data comprises:
the method comprises the steps of converting sample data into digital voice signals, obtaining average framing parameters, window function types and average window function lengths of the sample data in a frogman intercom scene by utilizing big data, dividing the digital voice signals into a plurality of frames according to the average framing parameters, windowing each frame according to the window function, converting the digital voice signals of each frame into frequency signals through Fourier transformation, converting the frequency signals of each frame into time domain waveforms through Fourier transformation, and marking the time domain waveforms of the sample data as voice features.
3. The method of claim 2, wherein the step of establishing a two-way information transmission channel based on the voice recognition information of the frogman when the frogman makes a voice communication request comprises the steps of:
when the frogman communication platform receives a frogman voice communication request, determining a frogman communication target according to the frogman voice communication request, and establishing a bidirectional information transmission channel between the frogman and the frogman communication target according to the frogman voice recognition information, wherein the frogman and the frogman communication target transmit voice through the bidirectional information transmission channel.
4. A method for enhancing speech quality based on frogman intercom according to claim 3, wherein the process of obtaining the speech transmission distance of the two-way information transmission channel comprises:
the frogman carries a position signal generating device; determining real-time distance values among the frogmans according to the position information generated by the position signal generating equipment; and determining the voice transmission distance of the two-way information transmission channel between the frogman and the frogman communication target according to the real-time distance value between the frogman, and recording the voice transmission distance of the two-way information transmission channel into a voice library in real time.
5. The method of claim 4, wherein determining the proportion of the noise signals with different voice transmission distances in the noise-containing voice signals according to the distance between the two-way information transmission channels comprises:
acquiring noise-containing voice sample data of voice transmission processes of a plurality of groups of frogmans through a two-way information transmission channel and voice transmission distances of the two-way information transmission channel under a voice library historical talkback scene of a plurality of frogmans; the noise-containing voice comprises a clean voice signal and a noise signal;
carrying out data mining on noise signals of noise-containing voices in voice transmission distances of different two-way information transmission channels, and constructing a noise-containing voice probability distribution function model by taking the noise signals and the voice transmission distances of the two-way information transmission channels as input features and taking the distribution probability of the noise signals in the voice transmission distances of the different two-way information transmission channels as output tags; and acquiring the proportion of noise signals in the noise-containing voice when the voice transmission distances of different bidirectional information transmission channels are different according to the noise-containing voice probability distribution function model.
6. The method for enhancing speech quality based on frog talk according to claim 5, wherein the step of setting the evaluation index of speech intelligibility according to the proportion of the noise signal in the noisy speech signal comprises:
acquiring an evaluation index of voice definition corresponding to the proportion of the noise signal in the noise-containing voice by using a big data method, wherein the evaluation index comprises purity, definition, general and poor; setting index weight of an evaluation index, establishing an evaluation index matrix about voice definition according to an evaluation index of voice definition corresponding to the proportion of a noise signal in noise-containing voice, establishing a noise proportion matrix according to the proportion of the noise signal in the noise-containing voice, and acquiring a membership matrix of the proportion of the noise signal in the noise-containing voice to the voice definition through fuzzy comprehensive evaluation;
and obtaining the voice definition corresponding to different proportions of the noise signals in the noise-containing voice according to the membership degree matrix and the index weight.
7. The method of claim 6, wherein dynamically adjusting the length of the average window function during the windowing process in the process of converting the noisy speech into the speech features according to the comparison result between the speech intelligibility in the communication process and the expected speech intelligibility assessment index comprises:
presetting an expected voice definition evaluation index of a frogman communication target, and acquiring a frogman noise-containing voice signal received by the frogman communication target in the two-way information transmission channel and a voice transmission distance of the two-way information transmission channel between the frogman and the frogman communication target; inputting the voice transmission distance of the two-way information transmission channel and the received frogman noisy voice signal as input characteristics into a noisy voice probability distribution function model to obtain the proportion of the noise signal of the current two-way information transmission channel voice transmission distance in the noisy voice signal, and obtaining a voice definition evaluation index of the frogman noisy voice signal according to the proportion of the noise signal in the frogman noisy voice signal in the noisy voice;
comparing the expected voice definition evaluation index of the frogman communication target with the voice definition evaluation index of the noise-containing voice signal received by the frogman communication target, when the expected voice definition evaluation index is inconsistent with the voice definition evaluation index of the received noise-containing voice signal, acquiring the proportion of the noise signal corresponding to the expected voice definition evaluation index in the noise-containing voice and the proportion of the noise signal in the received noise-containing voice signal, and acquiring the average window function length when the received noise-containing voice signal is windowed in the process of converting the received noise-containing voice signal into voice characteristics according to the noise proportion deviation value; the larger the proportion of the noise signal in the noise-containing voice signal is, the shorter the average window function length is.
8. The method of claim 7, wherein the step of setting a frequency threshold for noisy speech based on the masking characteristics of the frequency of the human ear sound comprises:
and (2) acquiring human ear sound frequency masking characteristics by using big data, wherein the human ear sound frequency masking characteristics comprise the highest sound frequency and the lowest sound frequency which can be perceived by human ears, filtering the frequency signals by taking the highest sound frequency and the lowest sound frequency as the highest threshold and the lowest threshold when frequency signals are generated in the step (S2), filtering the frequency signals higher than the highest threshold and the frequency signals lower than the lowest threshold, converting the filtered frequency signals into time domain waveforms through inverse Fourier transform, and marking the time domain waveforms as voice characteristics to be input to a voice broadcasting terminal.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310876048.8A CN116597856B (en) | 2023-07-18 | 2023-07-18 | Voice quality enhancement method based on frogman intercom |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310876048.8A CN116597856B (en) | 2023-07-18 | 2023-07-18 | Voice quality enhancement method based on frogman intercom |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116597856A true CN116597856A (en) | 2023-08-15 |
CN116597856B CN116597856B (en) | 2023-09-22 |
Family
ID=87599531
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310876048.8A Active CN116597856B (en) | 2023-07-18 | 2023-07-18 | Voice quality enhancement method based on frogman intercom |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116597856B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117496953A (en) * | 2023-12-29 | 2024-02-02 | 山东贝宁电子科技开发有限公司 | Frog voice processing method based on voice enhancement technology |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104050971A (en) * | 2013-03-15 | 2014-09-17 | 杜比实验室特许公司 | Acoustic echo mitigating apparatus and method, audio processing apparatus, and voice communication terminal |
WO2015005914A1 (en) * | 2013-07-10 | 2015-01-15 | Nuance Communications, Inc. | Methods and apparatus for dynamic low frequency noise suppression |
CN104704560A (en) * | 2012-09-04 | 2015-06-10 | 纽昂斯通讯公司 | Formant dependent speech signal enhancement |
CN111968630A (en) * | 2019-05-20 | 2020-11-20 | 北京字节跳动网络技术有限公司 | Information processing method and device and electronic equipment |
CN112102846A (en) * | 2020-09-04 | 2020-12-18 | 腾讯科技(深圳)有限公司 | Audio processing method and device, electronic equipment and storage medium |
WO2021258832A1 (en) * | 2020-06-23 | 2021-12-30 | 青岛科技大学 | Method for denoising underwater acoustic signal on the basis of adaptive window filtering and wavelet threshold optimization |
CN114822584A (en) * | 2022-04-25 | 2022-07-29 | 东北大学 | Transmission device signal separation method based on integral improved generalized cross-correlation |
-
2023
- 2023-07-18 CN CN202310876048.8A patent/CN116597856B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104704560A (en) * | 2012-09-04 | 2015-06-10 | 纽昂斯通讯公司 | Formant dependent speech signal enhancement |
CN104050971A (en) * | 2013-03-15 | 2014-09-17 | 杜比实验室特许公司 | Acoustic echo mitigating apparatus and method, audio processing apparatus, and voice communication terminal |
WO2015005914A1 (en) * | 2013-07-10 | 2015-01-15 | Nuance Communications, Inc. | Methods and apparatus for dynamic low frequency noise suppression |
CN111968630A (en) * | 2019-05-20 | 2020-11-20 | 北京字节跳动网络技术有限公司 | Information processing method and device and electronic equipment |
WO2021258832A1 (en) * | 2020-06-23 | 2021-12-30 | 青岛科技大学 | Method for denoising underwater acoustic signal on the basis of adaptive window filtering and wavelet threshold optimization |
CN112102846A (en) * | 2020-09-04 | 2020-12-18 | 腾讯科技(深圳)有限公司 | Audio processing method and device, electronic equipment and storage medium |
CN114822584A (en) * | 2022-04-25 | 2022-07-29 | 东北大学 | Transmission device signal separation method based on integral improved generalized cross-correlation |
Non-Patent Citations (2)
Title |
---|
YUMA KOIZUMI ET AL.: "TRAINABLE ADAPTIVE WINDOW SWITCHING FOR SPEECH ENHANCEMENT", 《ICASSP 2019》 * |
吴一飞;李玉伟;: "M序列信号在主动声纳中的性能研究", 舰船电子工程, no. 10 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117496953A (en) * | 2023-12-29 | 2024-02-02 | 山东贝宁电子科技开发有限公司 | Frog voice processing method based on voice enhancement technology |
CN117496953B (en) * | 2023-12-29 | 2024-03-12 | 山东贝宁电子科技开发有限公司 | Frog voice processing method based on voice enhancement technology |
Also Published As
Publication number | Publication date |
---|---|
CN116597856B (en) | 2023-09-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107393542B (en) | Bird species identification method based on two-channel neural network | |
CN116597856B (en) | Voice quality enhancement method based on frogman intercom | |
DE602004003443T2 (en) | Speech period detection based on electromyography | |
CN105469785A (en) | Voice activity detection method in communication-terminal double-microphone denoising system and apparatus thereof | |
CN108346434B (en) | Voice quality assessment method and device | |
DE112017007005B4 (en) | ACOUSTIC SIGNAL PROCESSING DEVICE, ACOUSTIC SIGNAL PROCESSING METHOD AND HANDS-FREE COMMUNICATION DEVICE | |
CN1312938A (en) | System and method for reducing noise | |
CN112767963A (en) | Voice enhancement method, device and system and computer readable storage medium | |
DE60127550T2 (en) | METHOD AND SYSTEM FOR ADAPTIVE DISTRIBUTED LANGUAGE RECOGNITION | |
CN114338623B (en) | Audio processing method, device, equipment and medium | |
CN111710344A (en) | Signal processing method, device, equipment and computer readable storage medium | |
CN1049062C (en) | Method of converting speech | |
DE60300267T2 (en) | Method and device for multi-reference correction of the spectral speech distortions caused by a communication network | |
EP0644674A2 (en) | Method of transmission quality estimation of a speech transmission link | |
CN111341331B (en) | Voice enhancement method, device and medium based on local attention mechanism | |
DE102012102882A1 (en) | An electrical device and method for receiving voiced voice signals therefor | |
CN105635453A (en) | Conversation volume automatic adjusting method and system, vehicle-mounted device, and automobile | |
CN103474067A (en) | Voice signal transmission method and system | |
US20230186943A1 (en) | Voice activity detection method and apparatus, and storage medium | |
CN111341351A (en) | Voice activity detection method and device based on self-attention mechanism and storage medium | |
CN116798434A (en) | Communication enhancement method, system and storage medium based on voice characteristics | |
CN113411456B (en) | Voice quality assessment method and device based on voice recognition | |
DE3875894T2 (en) | ADAPTIVE MULTIVARIABLE ANALYSIS DEVICE. | |
CN114827363A (en) | Method, device and readable storage medium for eliminating echo in call process | |
CN113709625A (en) | Self-adaptive volume adjusting method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |