CN116597856A

CN116597856A - Voice quality enhancement method based on frogman intercom

Info

Publication number: CN116597856A
Application number: CN202310876048.8A
Authority: CN
Inventors: 王银畦; 王涛
Original assignee: Shandong Benin Electronic Technology Development Co ltd
Current assignee: Shandong Benin Electronic Technology Development Co ltd
Priority date: 2023-07-18
Filing date: 2023-07-18
Publication date: 2023-08-15
Anticipated expiration: 2043-07-18
Also published as: CN116597856B

Abstract

A voice quality enhancement method based on frogman intercom relates to the technical field of voice communication, acquires voice characteristics of frogman and records the voice characteristics into a voice library, establishes a two-way information transmission channel according to voice recognition information of the frogman, and determines the proportion of noise signals with different voice transmission distances in noise-containing voice signals according to the distance of the two-way information transmission channel; setting an evaluation index of voice definition; the method comprises the steps of presetting a desired voice definition evaluation index of a frogman communication target, dynamically adjusting the length of an average window function according to a comparison result of the voice definition in the communication process and the desired voice definition evaluation index, setting a frequency threshold for noise-containing voice based on a human ear sound frequency masking characteristic, acquiring voice characteristics of the noise-containing voice filtered by the frequency threshold, inputting the voice characteristics into a voice broadcasting terminal, and generating voice information conforming to the voice definition of the frogman desired communication target according to the voice characteristics by the voice broadcasting terminal, so that the voice quality under a frogman intercom scene is remarkably improved.

Description

Voice quality enhancement method based on frogman intercom

Technical Field

The application relates to the technical field of voice communication, in particular to a voice quality enhancement method based on frog man intercom.

Background

Frogman intercom refers to voice communication carried out under water, because of the high density of water, the characteristic of sound propagation and the like, the underwater communication is often subjected to a plurality of interferences, so that the voice quality is poor, the communication effect is affected, voice enhancement is an effective method for solving noise pollution, the method is an effective method for extracting original voice which is as pure as possible from a voice signal with noise, and overall, the voice enhancement aims are mainly as follows: improving the voice quality, eliminating the background noise, ensuring that the listener is happy to accept, and not feel tired; the speech intelligibility is improved, and the listener is facilitated to understand.

In the prior art, when the voice is converted into voice characteristics during the voice enhancement of the noise-containing voice transmitted by the frogman, the noise-containing voice is converted according to the fixed window function length, the complexity of the underwater communication process is ignored, the complexity of noise signals is not considered to be higher and higher along with the gradual increase of the communication distance between the frogman, if the voice characteristics are still converted according to the fixed window function length, the converted voice characteristics cannot completely express the characteristics of the noise-containing voice, and how to completely express the characteristics of the noise-containing voice along with the gradual increase of the communication distance between the frogman by selecting different window function lengths is a problem which needs to be solved.

Disclosure of Invention

In order to solve the technical problems, the application aims to provide a voice quality enhancement method based on frog man intercom, which comprises the following steps:

step S1: acquiring pure voice sample data of a plurality of groups of frogmans and noise-containing voice sample data transmitted by the groups of frogmans in a frogman intercom scene, and recording the data into a voice library;

step S2: generating voice characteristics of the frogman according to the pure voice sample data and the noise-containing voice sample data, acquiring identity information of the frogman, and binding the voice characteristics of the frogman with the identity information to generate voice recognition information;

step S3: when a frogman puts forward a voice communication request, a two-way information transmission channel is established according to voice recognition information of the frogman, and the proportion of noise signals with different voice transmission distances in the noise-containing voice signals is determined according to the distance of the two-way information transmission channel; setting an evaluation index of the voice definition according to the proportion of the noise signal in the noise-containing voice signal;

step S4: the method comprises the steps of presetting a desired voice definition evaluation index of a frogman communication target, dynamically adjusting the average window function length in the process of windowing when noise-containing voice is converted into voice characteristics according to a comparison result of voice definition in the communication process and the desired voice definition evaluation index, setting a frequency threshold for the noise-containing voice based on a human ear sound frequency masking characteristic, filtering a frequency signal of the noise-containing voice, obtaining voice characteristics of the noise-containing voice filtered by the frequency threshold, inputting the voice characteristics into a voice broadcasting end, and generating voice information which accords with the voice definition of the frogman desired communication target by the voice broadcasting end according to the voice characteristics.

Further, the process of generating the voice features of the frogman according to the clean voice sample data and the noise-containing voice sample data comprises the following steps:

the method comprises the steps of converting sample data into digital voice signals, obtaining average framing parameters, window function types and average window function lengths of the sample data in a frogman intercom scene by utilizing big data, dividing the digital voice signals into a plurality of frames according to the average framing parameters, windowing each frame according to the window function, converting the digital voice signals of each frame into frequency signals through Fourier transformation, converting the frequency signals of each frame into time domain waveforms through Fourier transformation, and marking the time domain waveforms of the sample data as voice features.

Further, when the frogman puts forward a voice communication request, the process of establishing the two-way information transmission channel according to the voice recognition information of the frogman comprises the following steps:

when the frogman communication platform receives a frogman voice communication request, determining a frogman communication target according to the frogman voice communication request, and establishing a bidirectional information transmission channel between the frogman and the frogman communication target according to the frogman voice recognition information, wherein the frogman and the frogman communication target transmit voice through the bidirectional information transmission channel.

Further, the process of obtaining the voice transmission distance of the two-way information transmission channel comprises the following steps:

the frogman carries a position signal generating device; determining real-time distance values among the frogmans according to the position information generated by the position signal generating equipment; and determining the voice transmission distance of the two-way information transmission channel between the frogman and the frogman communication target according to the real-time distance value between the frogman, and recording the voice transmission distance of the two-way information transmission channel into a voice library in real time.

Further, the process of determining the proportion of the noise signals with different voice transmission distances in the noise-containing voice signals according to the distance between the two-way information transmission channels comprises the following steps:

acquiring noise-containing voice sample data of voice transmission processes of a plurality of groups of frogmans through a two-way information transmission channel and voice transmission distances of the two-way information transmission channel under a voice library historical talkback scene of a plurality of frogmans; the noise-containing voice comprises a clean voice signal and a noise signal;

carrying out data mining on noise signals of noise-containing voices in voice transmission distances of different two-way information transmission channels, and constructing a noise-containing voice probability distribution function model by taking the noise signals and the voice transmission distances of the two-way information transmission channels as input features and taking the distribution probability of the noise signals in the voice transmission distances of the different two-way information transmission channels as output tags; and acquiring the proportion of noise signals in the noise-containing voice when the voice transmission distances of different bidirectional information transmission channels are different according to the noise-containing voice probability distribution function model.

Further, the process of setting the evaluation index of the speech intelligibility according to the proportion of the noise signal in the noise-containing speech signal includes:

acquiring an evaluation index of voice definition corresponding to the proportion of the noise signal in the noise-containing voice by using a big data method, wherein the evaluation index comprises purity, definition, general and poor; setting index weight of an evaluation index, establishing an evaluation index matrix about voice definition according to an evaluation index of voice definition corresponding to the proportion of a noise signal in noise-containing voice, establishing a noise proportion matrix according to the proportion of the noise signal in the noise-containing voice, and acquiring a membership matrix of the proportion of the noise signal in the noise-containing voice to the voice definition through fuzzy comprehensive evaluation;

and obtaining the voice definition corresponding to different proportions of the noise signals in the noise-containing voice according to the membership degree matrix and the index weight.

Further, the quantization process for dynamically adjusting the average window function length in the process of windowing when the noisy speech is converted into the speech features according to the comparison result of the speech definition in the communication process and the expected speech definition evaluation index comprises:

presetting an expected voice definition evaluation index of a frogman communication target, and acquiring a frogman noisy voice signal received by the frogman communication target in the two-way information transmission channel between the frogman and the frogman communication target and a voice transmission distance of the two-way information transmission channel; inputting the voice transmission distance of the two-way information transmission channel and the received frogman noisy voice signal as input characteristics into a noisy voice probability distribution function model to obtain the proportion of the noise signal of the current two-way information transmission channel voice transmission distance in the noisy voice signal, and obtaining a voice definition evaluation index of the frogman noisy voice signal according to the proportion of the noise signal in the frogman noisy voice signal in the noisy voice;

comparing the expected voice definition evaluation index of the frogman communication target with the voice definition evaluation index of the noise-containing voice signal received by the frogman communication target, when the expected voice definition evaluation index is inconsistent with the voice definition evaluation index of the received noise-containing voice signal, acquiring the proportion of the noise signal corresponding to the expected voice definition evaluation index in the noise-containing voice and the proportion of the noise signal in the received noise-containing voice signal, and acquiring the average window function length when the received noise-containing voice signal is windowed in the process of converting the received noise-containing voice signal into voice characteristics according to the noise proportion deviation value; the larger the proportion of the noise signal in the noise-containing voice signal is, the shorter the average window function length is.

Further, the process of setting the frequency threshold for the noisy speech based on the human ear sound frequency masking feature comprises:

and S2, acquiring human ear sound frequency masking characteristics by using big data, wherein the human ear sound frequency masking characteristics comprise the highest sound frequency and the lowest sound frequency which can be perceived by human ears, filtering the frequency signals by taking the highest sound frequency and the lowest sound frequency as the highest threshold and the lowest threshold when frequency signals are generated in the step S2, filtering the frequency signals higher than the highest threshold and the frequency signals lower than the lowest threshold, converting the filtered frequency signals into time domain waveforms through inverse Fourier transform, marking the time domain waveforms as voice characteristics, inputting the voice characteristics into a voice broadcasting end, and generating voice information which accords with the voice definition of a communication target expected by a frog by the voice broadcasting end according to the voice characteristics.

Compared with the prior art, the application has the beneficial effects that: in the prior art, the noise-containing voice is converted according to the fixed window function length, the complexity of the underwater communication process is ignored, the complexity of noise signals is not considered to be higher and higher along with the gradual increase of communication distances among frogmans, if the noise-containing voice is converted according to the fixed window function length, the converted voice characteristics can not completely express the characteristics of the noise-containing voice, the data mining of the noise signals is carried out on the noise-containing voices in the voice transmission distances of different bidirectional information transmission channels in a voice library, a noise-containing voice probability distribution function model is constructed, the proportion of the noise signals in the noise-containing voices in the voice transmission distances of different bidirectional information transmission channels is obtained, the noise-containing voices transmitted by the frogmans are dynamically adjusted according to the proportion probability of the noise signals in the noise-containing voices in the voice transmission distances of different bidirectional information transmission channels, and the converted voice characteristics can completely express the characteristics of the noise-containing voices in the voice characteristic process, so that the quality of the intercom of the frogmans is remarkably improved.

Drawings

Fig. 1 is a schematic diagram of a voice quality enhancement method based on frog human intercom according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

As shown in fig. 1, the voice quality enhancement method based on frog human intercom comprises the following steps:

It should be further noted that, in the implementation process, the process of generating the voice features of the frogman according to the clean voice sample data and the noisy voice sample data includes:

It should be further noted that, in the implementation process, the process of constructing the speech enhancement model representing the mapping relationship between the noisy speech and the clean speech based on the deep neural network includes:

constructing a voice enhancement model based on an RBF neural network, and taking voice characteristics of pure voice sample data of a plurality of groups of frogmans in a voice library and voice characteristics of noisy voice sample data of the plurality of groups of frogmans in a frogman intercom scene as a training set and a test set to learn and train the voice enhancement model in real time; each audio file comprises single-segment voice and multi-segment voice, data acquisition is carried out at a sampling rate of 16000Hz during recording, coolEditPro software is used as an auxiliary tool on the basis of the data acquisition, the starting point and the end point of a pure voice sample are manually marked and used as voice detection standards, meanwhile, in order to obtain noisy voice, the voice transmission distance of the noisy voice and a bidirectional information transmission channel of 50 target persons during frogman operation is obtained, 1000 groups of voice sample data are obtained in total, 950 groups are used as training sets, 50 groups of the most tested sets train a voice enhancement model until the loss function training of the voice enhancement model is stable, and model parameters are saved.

It should be further noted that, in the implementation process, the process of establishing the frogman communication platform, when the frogman makes a voice communication request, the process of establishing the bidirectional information transmission channel according to the voice recognition information of the frogman includes:

It should be further noted that, in the implementation process, the process of obtaining the voice transmission distance of the two-way information transmission channel includes:

It should be further noted that, in the implementation process, the process of determining the proportion of the noise signals with different voice transmission distances in the noise-containing voice signals according to the distance between the two-way information transmission channels includes:

acquiring noise-containing voice sample data of a plurality of groups of frogmans in a frog intercom scene through a two-way information transmission channel and the voice transmission distance of the two-way information transmission channel according to a voice library; the noise-containing voice comprises a clean voice signal and a noise signal;

carrying out data mining on noise signals of noise-containing voices in voice transmission distances of different two-way information transmission channels, and constructing a noise-containing voice probability distribution function model by taking the noise signals and the voice transmission distances of the two-way information transmission channels as input features and taking the distribution probability of the noise signals in the voice transmission distances of the different two-way information transmission channels as output tags; acquiring the proportion probability of noise signals in noise-containing voices when the voice transmission distances of different bidirectional information transmission channels are different according to the noise-containing voice probability distribution function model;

It should be further noted that, in the implementation process: the quantization process for dynamically adjusting the length of the average window function in the process of windowing when the noisy speech is converted into the speech characteristics according to the comparison result of the speech definition in the communication process and the expected speech definition evaluation index comprises the following steps:

The characteristics of the noise-containing voice can be expressed completely by dynamically adjusting the noise-containing voice transmitted by the frogman according to the proportion probability of the noise signal in the noise-containing voice when the voice transmission distances of different two-way information transmission channels are different and converting the noise-containing voice into the average window function length in the voice characteristic process, thereby obviously improving the voice quality under the frogman intercom scene

It should be further noted that, in the implementation process, the process of setting the frequency threshold for the noise-containing speech based on the frequency masking feature of the human ear sound includes:

and S2, acquiring human ear sound frequency masking characteristics by using big data, wherein the human ear sound frequency masking characteristics comprise the highest sound frequency and the lowest sound frequency which can be perceived by human ears, filtering the frequency signals by taking the highest sound frequency and the lowest sound frequency as the highest threshold and the lowest threshold when frequency signals are generated in the step S2, filtering the frequency signals higher than the highest threshold and the frequency signals lower than the lowest threshold, converting the filtered frequency signals into time domain waveforms through inverse Fourier transform, marking the time domain waveforms as voice characteristics, inputting the voice characteristics into a voice enhancement model, and generating voice information meeting the voice definition of a communication target expected by a frog by a voice broadcasting terminal according to the voice characteristics.

The above embodiments are only for illustrating the technical method of the present application and not for limiting the same, and it should be understood by those skilled in the art that the technical method of the present application may be modified or substituted without departing from the spirit and scope of the technical method of the present application.

Claims

1. The voice quality enhancement method based on frog man intercom is characterized by comprising the following steps of:

2. The method of claim 1, wherein generating voice features of the frogman based on the clean voice sample data and the noisy voice sample data comprises:

3. The method of claim 2, wherein the step of establishing a two-way information transmission channel based on the voice recognition information of the frogman when the frogman makes a voice communication request comprises the steps of:

4. A method for enhancing speech quality based on frogman intercom according to claim 3, wherein the process of obtaining the speech transmission distance of the two-way information transmission channel comprises:

5. The method of claim 4, wherein determining the proportion of the noise signals with different voice transmission distances in the noise-containing voice signals according to the distance between the two-way information transmission channels comprises:

6. The method for enhancing speech quality based on frog talk according to claim 5, wherein the step of setting the evaluation index of speech intelligibility according to the proportion of the noise signal in the noisy speech signal comprises:

7. The method of claim 6, wherein dynamically adjusting the length of the average window function during the windowing process in the process of converting the noisy speech into the speech features according to the comparison result between the speech intelligibility in the communication process and the expected speech intelligibility assessment index comprises:

presetting an expected voice definition evaluation index of a frogman communication target, and acquiring a frogman noise-containing voice signal received by the frogman communication target in the two-way information transmission channel and a voice transmission distance of the two-way information transmission channel between the frogman and the frogman communication target; inputting the voice transmission distance of the two-way information transmission channel and the received frogman noisy voice signal as input characteristics into a noisy voice probability distribution function model to obtain the proportion of the noise signal of the current two-way information transmission channel voice transmission distance in the noisy voice signal, and obtaining a voice definition evaluation index of the frogman noisy voice signal according to the proportion of the noise signal in the frogman noisy voice signal in the noisy voice;

8. The method of claim 7, wherein the step of setting a frequency threshold for noisy speech based on the masking characteristics of the frequency of the human ear sound comprises:

and (2) acquiring human ear sound frequency masking characteristics by using big data, wherein the human ear sound frequency masking characteristics comprise the highest sound frequency and the lowest sound frequency which can be perceived by human ears, filtering the frequency signals by taking the highest sound frequency and the lowest sound frequency as the highest threshold and the lowest threshold when frequency signals are generated in the step (S2), filtering the frequency signals higher than the highest threshold and the frequency signals lower than the lowest threshold, converting the filtered frequency signals into time domain waveforms through inverse Fourier transform, and marking the time domain waveforms as voice characteristics to be input to a voice broadcasting terminal.