CN111370028A

CN111370028A - Voice distortion detection method and system

Info

Publication number: CN111370028A
Application number: CN202010097544.XA
Authority: CN
Inventors: 王恒洲; 肖龙源; 李稀敏; 蔡振华; 刘晓葳
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2020-02-17
Filing date: 2020-02-17
Publication date: 2020-07-03

Abstract

The invention discloses a voice distortion detection method, which comprises the following steps: s11, dividing the audio to be detected into a plurality of unit audios, and storing audio data of each unit audio into 1 array; s12, detecting the array based on a convolutional neural network detection model, wherein a predicted value is output by an output layer of the convolutional neural network detection model, and the predicted value is the fidelity degree of the audio corresponding to an input layer of the convolutional neural network detection model; and S13, averaging the predicted values of all the unit audios to obtain the fidelity degree of the audio to be detected. The invention also discloses a voice distortion detection system adopting the method. The invention can play a good role in various distortion environments and evaluate the distortion degree of the voice frequency.

Description

Voice distortion detection method and system

Technical Field

The invention relates to the technical field of audio recognition, in particular to a voice distortion detection method and system.

Background

High quality speech audio is key to speech recognition and voiceprint recognition, but is limited by recording conditions, and actual speech processing can cause speech distortion.

In the prior art, the distortion degree of audio is mainly measured by PESQ measurement, LPC spectral distance measurement and the like. These measures can only function in certain circumstances.

Disclosure of Invention

The invention provides a voice distortion detection method and a voice distortion detection system for solving the problems, which can achieve good effect in various distortion environments and evaluate the distortion degree of voice audio.

In order to achieve the purpose, the invention adopts the technical scheme that:

a method of speech distortion detection, comprising the steps of:

s11, dividing the audio to be detected into a plurality of unit audios, and storing audio data of each unit audio into 1 array;

s12, detecting the array based on a convolutional neural network detection model, wherein an output layer of the convolutional neural network detection model outputs a predicted value, and the predicted value is the fidelity degree of the audio corresponding to an input layer of the convolutional neural network detection model;

s13, averaging the predicted values of all the unit audios to obtain the fidelity degree of the audio to be detected.

Preferably, the method for constructing the convolutional neural network detection model includes the following steps:

s21, acquiring a training set, wherein the training set comprises audio data with different fidelity degrees;

s22, training the convolutional neural network detection model by taking the training set as an input layer, and outputting the predicted value by an output layer of the convolutional neural network detection model;

and S23, taking the difference value between the predicted value of the corresponding audio of the training set and the actual value of the fidelity degree as a loss value, carrying out repeated iterative training, and optimizing the loss value to be stable by using cross entropy loss as a loss function to finish the training.

Preferably, the convolutional neural network detection model includes 3 CNN convolutional layers, 3 pooling layers, and 1 fully-connected layer, where the back of each CNN convolutional layer is connected to 1 pooling layer, and the back of the last pooling layer is connected to the fully-connected layer.

Preferably, the audio to be detected is divided in the following manner: and taking A second as a window, B second as displacement time, and audio with the time length of A second per time is 1 unit audio, and the actual time length is taken when the time length is less than A second, wherein B is more than 0 and less than or equal to A.

Preferably, the array and the training set are subjected to short-time fourier transform before being input into the convolutional neural network detection model.

The invention also provides a voice distortion detection system adopting the method, which comprises the following steps:

the audio input terminal is used for inputting the audio to be detected and dividing the audio into the unit audio;

and the detection module is used for respectively detecting the audio data of the unit audio, outputting the predicted values, and averaging all the predicted values to obtain an evaluation result.

The invention has the beneficial effects that:

(1) the detection of the distortion degree is carried out through the convolutional neural network, and the method is suitable for various distortion environments;

(2) the end-to-end network structure realizes that the evaluation result is directly obtained after the audio is input, and the detection speed is high;

(3) and dividing the audio to be detected into unit audio for respective detection, and determining the result according to the average value, thereby improving the detection accuracy.

Drawings

Fig. 1 is a detection flow chart of a speech distortion detection system according to an embodiment of the present invention;

fig. 2 is a network framework diagram of a convolutional neural network detection model according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention clearer and more obvious, the present invention is further described in detail with reference to specific embodiments below. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The present embodiment provides a speech distortion detection system for evaluating the degree of distortion of an input speech audio.

The system comprises an audio input terminal and a detection module.

As shown in fig. 1, a user inputs 1 segment of 10 seconds of audio to be detected into the audio input terminal, the audio input terminal divides the audio into 20 unit audios by taking 2 seconds as a window and 1 second as a displacement time, and inputs the audio data of the unit audios into a detection module for detection after short-time fourier transform processing.

More spectrum information can be obtained by adopting the stft characteristic as the input of the convolutional neural network.

The audio data of the adjacent unit audios are partially overlapped, so that the audio data can be detected for multiple times, and the reliability of the detection result is improved.

The detection module is loaded with a convolutional neural network detection model. As shown in fig. 2, the model includes 3 CNN convolutional layers, 3 pooling layers, and 1 fully-connected layer, with 1 pooling layer connected behind each CNN convolutional layer, and a fully-connected layer connected behind the last pooled layer.

With CNN as the neuron, the stft feature can be better processed.

The model converts the matrix result output by the full connection layer into 1 number between 0 and 1 through a SoftMax algorithm to serve as a predicted value.

The construction method of the model comprises the following steps:

s1, collecting a training set, wherein the training set comprises audio data with different fidelity degrees, the fidelity degrees are obtained through other measures and then are marked, and the fidelity degrees are normalized and expressed by numbers between 0 and 1. The larger the value of the number, the higher the fidelity of the corresponding audio of the training set. The length of the audio corresponding to the training set is cut to be 1-2 seconds.

And S2, training the model by taking the training set as an input layer, and outputting a predicted value by an output layer of the model.

And S3, taking the difference value between the predicted value of the corresponding audio of the training set and the actual value of the fidelity degree as a loss value, carrying out iterative training for many times, and optimizing the loss value to be stable by using the cross entropy loss as a loss function to finish the training.

The audio data of the unit audio is detected by the model respectively, and the detection module carries out smoothing treatment on 20 predicted values output by an output layer of the model as an evaluation result of the fidelity degree of the audio to be detected input into the audio input terminal.

The invention can quantitatively evaluate the fidelity effect of the recording equipment when recording voice.

Those skilled in the art can understand that all or part of the steps in the above-mentioned embodiments of the audio data detection method may be implemented by a program instructing related hardware to complete, where the program is stored in 1 storage medium and includes several instructions to enable 1 device (which may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the above description shows and describes the preferred embodiments of the present invention, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for detecting speech distortion, comprising the steps of:

s13, averaging all the predicted values to obtain the fidelity degree of the audio to be detected.

2. The method for detecting speech distortion according to claim 1, wherein the method for constructing the convolutional neural network detection model comprises the following steps:

3. The method of claim 1, wherein the convolutional neural network detection model comprises 3 CNN convolutional layers, 3 pooling layers, and 1 fully-connected layer, wherein 1 pooling layer is connected to each CNN convolutional layer, and the last pooling layer is connected to the fully-connected layer.

4. The method according to claim 1, wherein the audio to be detected is divided into: and taking A second as a window, B second as displacement time, and audio with the time length of A second per time is 1 unit audio, and the actual time length is taken when the time length is less than A second, wherein B is more than 0 and less than or equal to A.

5. The method of claim 1, wherein the array and the training set are subjected to short-time fourier transform before being input into the convolutional neural network detection model.

6. A speech distortion detection system using the method of any of claims 1 to 5, comprising: