CN111370028A - Voice distortion detection method and system - Google Patents

Voice distortion detection method and system Download PDF

Info

Publication number
CN111370028A
CN111370028A CN202010097544.XA CN202010097544A CN111370028A CN 111370028 A CN111370028 A CN 111370028A CN 202010097544 A CN202010097544 A CN 202010097544A CN 111370028 A CN111370028 A CN 111370028A
Authority
CN
China
Prior art keywords
audio
neural network
convolutional neural
detection model
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010097544.XA
Other languages
Chinese (zh)
Inventor
王恒洲
肖龙源
李稀敏
蔡振华
刘晓葳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Kuaishangtong Technology Co Ltd
Original Assignee
Xiamen Kuaishangtong Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Kuaishangtong Technology Co Ltd filed Critical Xiamen Kuaishangtong Technology Co Ltd
Priority to CN202010097544.XA priority Critical patent/CN111370028A/en
Publication of CN111370028A publication Critical patent/CN111370028A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention discloses a voice distortion detection method, which comprises the following steps: s11, dividing the audio to be detected into a plurality of unit audios, and storing audio data of each unit audio into 1 array; s12, detecting the array based on a convolutional neural network detection model, wherein a predicted value is output by an output layer of the convolutional neural network detection model, and the predicted value is the fidelity degree of the audio corresponding to an input layer of the convolutional neural network detection model; and S13, averaging the predicted values of all the unit audios to obtain the fidelity degree of the audio to be detected. The invention also discloses a voice distortion detection system adopting the method. The invention can play a good role in various distortion environments and evaluate the distortion degree of the voice frequency.

Description

Voice distortion detection method and system
Technical Field
The invention relates to the technical field of audio recognition, in particular to a voice distortion detection method and system.
Background
High quality speech audio is key to speech recognition and voiceprint recognition, but is limited by recording conditions, and actual speech processing can cause speech distortion.
In the prior art, the distortion degree of audio is mainly measured by PESQ measurement, LPC spectral distance measurement and the like. These measures can only function in certain circumstances.
Disclosure of Invention
The invention provides a voice distortion detection method and a voice distortion detection system for solving the problems, which can achieve good effect in various distortion environments and evaluate the distortion degree of voice audio.
In order to achieve the purpose, the invention adopts the technical scheme that:
a method of speech distortion detection, comprising the steps of:
s11, dividing the audio to be detected into a plurality of unit audios, and storing audio data of each unit audio into 1 array;
s12, detecting the array based on a convolutional neural network detection model, wherein an output layer of the convolutional neural network detection model outputs a predicted value, and the predicted value is the fidelity degree of the audio corresponding to an input layer of the convolutional neural network detection model;
s13, averaging the predicted values of all the unit audios to obtain the fidelity degree of the audio to be detected.
Preferably, the method for constructing the convolutional neural network detection model includes the following steps:
s21, acquiring a training set, wherein the training set comprises audio data with different fidelity degrees;
s22, training the convolutional neural network detection model by taking the training set as an input layer, and outputting the predicted value by an output layer of the convolutional neural network detection model;
and S23, taking the difference value between the predicted value of the corresponding audio of the training set and the actual value of the fidelity degree as a loss value, carrying out repeated iterative training, and optimizing the loss value to be stable by using cross entropy loss as a loss function to finish the training.
Preferably, the convolutional neural network detection model includes 3 CNN convolutional layers, 3 pooling layers, and 1 fully-connected layer, where the back of each CNN convolutional layer is connected to 1 pooling layer, and the back of the last pooling layer is connected to the fully-connected layer.
Preferably, the audio to be detected is divided in the following manner: and taking A second as a window, B second as displacement time, and audio with the time length of A second per time is 1 unit audio, and the actual time length is taken when the time length is less than A second, wherein B is more than 0 and less than or equal to A.
Preferably, the array and the training set are subjected to short-time fourier transform before being input into the convolutional neural network detection model.
The invention also provides a voice distortion detection system adopting the method, which comprises the following steps:
the audio input terminal is used for inputting the audio to be detected and dividing the audio into the unit audio;
and the detection module is used for respectively detecting the audio data of the unit audio, outputting the predicted values, and averaging all the predicted values to obtain an evaluation result.
The invention has the beneficial effects that:
(1) the detection of the distortion degree is carried out through the convolutional neural network, and the method is suitable for various distortion environments;
(2) the end-to-end network structure realizes that the evaluation result is directly obtained after the audio is input, and the detection speed is high;
(3) and dividing the audio to be detected into unit audio for respective detection, and determining the result according to the average value, thereby improving the detection accuracy.
Drawings
Fig. 1 is a detection flow chart of a speech distortion detection system according to an embodiment of the present invention;
fig. 2 is a network framework diagram of a convolutional neural network detection model according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention clearer and more obvious, the present invention is further described in detail with reference to specific embodiments below. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The present embodiment provides a speech distortion detection system for evaluating the degree of distortion of an input speech audio.
The system comprises an audio input terminal and a detection module.
As shown in fig. 1, a user inputs 1 segment of 10 seconds of audio to be detected into the audio input terminal, the audio input terminal divides the audio into 20 unit audios by taking 2 seconds as a window and 1 second as a displacement time, and inputs the audio data of the unit audios into a detection module for detection after short-time fourier transform processing.
More spectrum information can be obtained by adopting the stft characteristic as the input of the convolutional neural network.
The audio data of the adjacent unit audios are partially overlapped, so that the audio data can be detected for multiple times, and the reliability of the detection result is improved.
The detection module is loaded with a convolutional neural network detection model. As shown in fig. 2, the model includes 3 CNN convolutional layers, 3 pooling layers, and 1 fully-connected layer, with 1 pooling layer connected behind each CNN convolutional layer, and a fully-connected layer connected behind the last pooled layer.
With CNN as the neuron, the stft feature can be better processed.
The model converts the matrix result output by the full connection layer into 1 number between 0 and 1 through a SoftMax algorithm to serve as a predicted value.
The construction method of the model comprises the following steps:
s1, collecting a training set, wherein the training set comprises audio data with different fidelity degrees, the fidelity degrees are obtained through other measures and then are marked, and the fidelity degrees are normalized and expressed by numbers between 0 and 1. The larger the value of the number, the higher the fidelity of the corresponding audio of the training set. The length of the audio corresponding to the training set is cut to be 1-2 seconds.
And S2, training the model by taking the training set as an input layer, and outputting a predicted value by an output layer of the model.
And S3, taking the difference value between the predicted value of the corresponding audio of the training set and the actual value of the fidelity degree as a loss value, carrying out iterative training for many times, and optimizing the loss value to be stable by using the cross entropy loss as a loss function to finish the training.
The audio data of the unit audio is detected by the model respectively, and the detection module carries out smoothing treatment on 20 predicted values output by an output layer of the model as an evaluation result of the fidelity degree of the audio to be detected input into the audio input terminal.
The invention can quantitatively evaluate the fidelity effect of the recording equipment when recording voice.
Those skilled in the art can understand that all or part of the steps in the above-mentioned embodiments of the audio data detection method may be implemented by a program instructing related hardware to complete, where the program is stored in 1 storage medium and includes several instructions to enable 1 device (which may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
While the above description shows and describes the preferred embodiments of the present invention, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (6)

1. A method for detecting speech distortion, comprising the steps of:
s11, dividing the audio to be detected into a plurality of unit audios, and storing audio data of each unit audio into 1 array;
s12, detecting the array based on a convolutional neural network detection model, wherein an output layer of the convolutional neural network detection model outputs a predicted value, and the predicted value is the fidelity degree of the audio corresponding to an input layer of the convolutional neural network detection model;
s13, averaging all the predicted values to obtain the fidelity degree of the audio to be detected.
2. The method for detecting speech distortion according to claim 1, wherein the method for constructing the convolutional neural network detection model comprises the following steps:
s21, acquiring a training set, wherein the training set comprises audio data with different fidelity degrees;
s22, training the convolutional neural network detection model by taking the training set as an input layer, and outputting the predicted value by an output layer of the convolutional neural network detection model;
and S23, taking the difference value between the predicted value of the corresponding audio of the training set and the actual value of the fidelity degree as a loss value, carrying out repeated iterative training, and optimizing the loss value to be stable by using cross entropy loss as a loss function to finish the training.
3. The method of claim 1, wherein the convolutional neural network detection model comprises 3 CNN convolutional layers, 3 pooling layers, and 1 fully-connected layer, wherein 1 pooling layer is connected to each CNN convolutional layer, and the last pooling layer is connected to the fully-connected layer.
4. The method according to claim 1, wherein the audio to be detected is divided into: and taking A second as a window, B second as displacement time, and audio with the time length of A second per time is 1 unit audio, and the actual time length is taken when the time length is less than A second, wherein B is more than 0 and less than or equal to A.
5. The method of claim 1, wherein the array and the training set are subjected to short-time fourier transform before being input into the convolutional neural network detection model.
6. A speech distortion detection system using the method of any of claims 1 to 5, comprising:
the audio input terminal is used for inputting the audio to be detected and dividing the audio into the unit audio;
and the detection module is used for respectively detecting the audio data of the unit audio, outputting the predicted values, and averaging all the predicted values to obtain an evaluation result.
CN202010097544.XA 2020-02-17 2020-02-17 Voice distortion detection method and system Pending CN111370028A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010097544.XA CN111370028A (en) 2020-02-17 2020-02-17 Voice distortion detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010097544.XA CN111370028A (en) 2020-02-17 2020-02-17 Voice distortion detection method and system

Publications (1)

Publication Number Publication Date
CN111370028A true CN111370028A (en) 2020-07-03

Family

ID=71208041

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010097544.XA Pending CN111370028A (en) 2020-02-17 2020-02-17 Voice distortion detection method and system

Country Status (1)

Country Link
CN (1) CN111370028A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113448514A (en) * 2021-06-02 2021-09-28 合肥群音信息服务有限公司 Automatic processing system of multisource voice data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997005730A1 (en) * 1995-07-27 1997-02-13 British Telecommunications Public Limited Company Assessment of signal quality
WO2008016531A2 (en) * 2006-08-01 2008-02-07 Dts, Inc. Neural network filtering techniques for compensating linear and non-linear distortion of an audio transducer
CN107358966A (en) * 2017-06-27 2017-11-17 北京理工大学 Based on deep learning speech enhan-cement without reference voice quality objective evaluation method
CN107886968A (en) * 2017-12-28 2018-04-06 广州讯飞易听说网络科技有限公司 Speech evaluating method and system
CN109065072A (en) * 2018-09-30 2018-12-21 中国科学院声学研究所 A kind of speech quality objective assessment method based on deep neural network
CN109872720A (en) * 2019-01-29 2019-06-11 广东技术师范学院 It is a kind of that speech detection algorithms being rerecorded to different scenes robust based on convolutional neural networks

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997005730A1 (en) * 1995-07-27 1997-02-13 British Telecommunications Public Limited Company Assessment of signal quality
WO2008016531A2 (en) * 2006-08-01 2008-02-07 Dts, Inc. Neural network filtering techniques for compensating linear and non-linear distortion of an audio transducer
CN107358966A (en) * 2017-06-27 2017-11-17 北京理工大学 Based on deep learning speech enhan-cement without reference voice quality objective evaluation method
CN107886968A (en) * 2017-12-28 2018-04-06 广州讯飞易听说网络科技有限公司 Speech evaluating method and system
CN109065072A (en) * 2018-09-30 2018-12-21 中国科学院声学研究所 A kind of speech quality objective assessment method based on deep neural network
CN109872720A (en) * 2019-01-29 2019-06-11 广东技术师范学院 It is a kind of that speech detection algorithms being rerecorded to different scenes robust based on convolutional neural networks

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
梁民 等: "声音信号质量评价技术", 《数字技术与应用》, no. 6, 31 December 2011 (2011-12-31), pages 141 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113448514A (en) * 2021-06-02 2021-09-28 合肥群音信息服务有限公司 Automatic processing system of multisource voice data
CN113448514B (en) * 2021-06-02 2022-03-15 合肥群音信息服务有限公司 Automatic processing system of multisource voice data

Similar Documents

Publication Publication Date Title
CN108369813B (en) Specific voice recognition method, apparatus and storage medium
CN108900725B (en) Voiceprint recognition method and device, terminal equipment and storage medium
CN108346436B (en) Voice emotion detection method and device, computer equipment and storage medium
CN109074822A (en) Specific sound recognition methods, equipment and storage medium
CN111080109B (en) Customer service quality evaluation method and device and electronic equipment
CN110797031A (en) Voice change detection method, system, mobile terminal and storage medium
CN116153330B (en) Intelligent telephone voice robot control method
CN109300470B (en) Mixing separation method and mixing separation device
KR102026226B1 (en) Method for extracting signal unit features using variational inference model based deep learning and system thereof
CN115062678A (en) Training method of equipment fault detection model, fault detection method and device
CN107545898B (en) Processing method and device for distinguishing speaker voice
CN113192536A (en) Training method of voice quality detection model, voice quality detection method and device
Heo et al. Automated recovery of damaged audio files using deep neural networks
CN110570871A (en) TristouNet-based voiceprint recognition method, device and equipment
CN111370028A (en) Voice distortion detection method and system
US20230245674A1 (en) Method for learning an audio quality metric combining labeled and unlabeled data
Sawata et al. Improving character error rate is not equal to having clean speech: Speech enhancement for asr systems with black-box acoustic models
CN111640451B (en) Maturity evaluation method and device, and storage medium
CN111261192A (en) Audio detection method based on LSTM network, electronic equipment and storage medium
US20230116052A1 (en) Array geometry agnostic multi-channel personalized speech enhancement
CN115659248A (en) Power equipment defect identification method, device, equipment and storage medium
CN113707172B (en) Single-channel voice separation method, system and computer equipment of sparse orthogonal network
Oostermeijer et al. Frequency gating: Improved convolutional neural networks for speech enhancement in the time-frequency domain
CN111477248B (en) Audio noise detection method and device
CN113593604A (en) Method, device and storage medium for detecting audio quality

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200703

RJ01 Rejection of invention patent application after publication