CN111370028A - Voice distortion detection method and system - Google Patents
Voice distortion detection method and system Download PDFInfo
- Publication number
- CN111370028A CN111370028A CN202010097544.XA CN202010097544A CN111370028A CN 111370028 A CN111370028 A CN 111370028A CN 202010097544 A CN202010097544 A CN 202010097544A CN 111370028 A CN111370028 A CN 111370028A
- Authority
- CN
- China
- Prior art keywords
- audio
- neural network
- convolutional neural
- detection model
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 47
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 30
- 238000000034 method Methods 0.000 claims abstract description 13
- 238000012935 Averaging Methods 0.000 claims abstract description 5
- 238000012549 training Methods 0.000 claims description 25
- 238000011176 pooling Methods 0.000 claims description 8
- 238000011156 evaluation Methods 0.000 claims description 4
- 238000006073 displacement reaction Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 description 3
- 238000005259 measurement Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Abstract
The invention discloses a voice distortion detection method, which comprises the following steps: s11, dividing the audio to be detected into a plurality of unit audios, and storing audio data of each unit audio into 1 array; s12, detecting the array based on a convolutional neural network detection model, wherein a predicted value is output by an output layer of the convolutional neural network detection model, and the predicted value is the fidelity degree of the audio corresponding to an input layer of the convolutional neural network detection model; and S13, averaging the predicted values of all the unit audios to obtain the fidelity degree of the audio to be detected. The invention also discloses a voice distortion detection system adopting the method. The invention can play a good role in various distortion environments and evaluate the distortion degree of the voice frequency.
Description
Technical Field
The invention relates to the technical field of audio recognition, in particular to a voice distortion detection method and system.
Background
High quality speech audio is key to speech recognition and voiceprint recognition, but is limited by recording conditions, and actual speech processing can cause speech distortion.
In the prior art, the distortion degree of audio is mainly measured by PESQ measurement, LPC spectral distance measurement and the like. These measures can only function in certain circumstances.
Disclosure of Invention
The invention provides a voice distortion detection method and a voice distortion detection system for solving the problems, which can achieve good effect in various distortion environments and evaluate the distortion degree of voice audio.
In order to achieve the purpose, the invention adopts the technical scheme that:
a method of speech distortion detection, comprising the steps of:
s11, dividing the audio to be detected into a plurality of unit audios, and storing audio data of each unit audio into 1 array;
s12, detecting the array based on a convolutional neural network detection model, wherein an output layer of the convolutional neural network detection model outputs a predicted value, and the predicted value is the fidelity degree of the audio corresponding to an input layer of the convolutional neural network detection model;
s13, averaging the predicted values of all the unit audios to obtain the fidelity degree of the audio to be detected.
Preferably, the method for constructing the convolutional neural network detection model includes the following steps:
s21, acquiring a training set, wherein the training set comprises audio data with different fidelity degrees;
s22, training the convolutional neural network detection model by taking the training set as an input layer, and outputting the predicted value by an output layer of the convolutional neural network detection model;
and S23, taking the difference value between the predicted value of the corresponding audio of the training set and the actual value of the fidelity degree as a loss value, carrying out repeated iterative training, and optimizing the loss value to be stable by using cross entropy loss as a loss function to finish the training.
Preferably, the convolutional neural network detection model includes 3 CNN convolutional layers, 3 pooling layers, and 1 fully-connected layer, where the back of each CNN convolutional layer is connected to 1 pooling layer, and the back of the last pooling layer is connected to the fully-connected layer.
Preferably, the audio to be detected is divided in the following manner: and taking A second as a window, B second as displacement time, and audio with the time length of A second per time is 1 unit audio, and the actual time length is taken when the time length is less than A second, wherein B is more than 0 and less than or equal to A.
Preferably, the array and the training set are subjected to short-time fourier transform before being input into the convolutional neural network detection model.
The invention also provides a voice distortion detection system adopting the method, which comprises the following steps:
the audio input terminal is used for inputting the audio to be detected and dividing the audio into the unit audio;
and the detection module is used for respectively detecting the audio data of the unit audio, outputting the predicted values, and averaging all the predicted values to obtain an evaluation result.
The invention has the beneficial effects that:
(1) the detection of the distortion degree is carried out through the convolutional neural network, and the method is suitable for various distortion environments;
(2) the end-to-end network structure realizes that the evaluation result is directly obtained after the audio is input, and the detection speed is high;
(3) and dividing the audio to be detected into unit audio for respective detection, and determining the result according to the average value, thereby improving the detection accuracy.
Drawings
Fig. 1 is a detection flow chart of a speech distortion detection system according to an embodiment of the present invention;
fig. 2 is a network framework diagram of a convolutional neural network detection model according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention clearer and more obvious, the present invention is further described in detail with reference to specific embodiments below. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The present embodiment provides a speech distortion detection system for evaluating the degree of distortion of an input speech audio.
The system comprises an audio input terminal and a detection module.
As shown in fig. 1, a user inputs 1 segment of 10 seconds of audio to be detected into the audio input terminal, the audio input terminal divides the audio into 20 unit audios by taking 2 seconds as a window and 1 second as a displacement time, and inputs the audio data of the unit audios into a detection module for detection after short-time fourier transform processing.
More spectrum information can be obtained by adopting the stft characteristic as the input of the convolutional neural network.
The audio data of the adjacent unit audios are partially overlapped, so that the audio data can be detected for multiple times, and the reliability of the detection result is improved.
The detection module is loaded with a convolutional neural network detection model. As shown in fig. 2, the model includes 3 CNN convolutional layers, 3 pooling layers, and 1 fully-connected layer, with 1 pooling layer connected behind each CNN convolutional layer, and a fully-connected layer connected behind the last pooled layer.
With CNN as the neuron, the stft feature can be better processed.
The model converts the matrix result output by the full connection layer into 1 number between 0 and 1 through a SoftMax algorithm to serve as a predicted value.
The construction method of the model comprises the following steps:
s1, collecting a training set, wherein the training set comprises audio data with different fidelity degrees, the fidelity degrees are obtained through other measures and then are marked, and the fidelity degrees are normalized and expressed by numbers between 0 and 1. The larger the value of the number, the higher the fidelity of the corresponding audio of the training set. The length of the audio corresponding to the training set is cut to be 1-2 seconds.
And S2, training the model by taking the training set as an input layer, and outputting a predicted value by an output layer of the model.
And S3, taking the difference value between the predicted value of the corresponding audio of the training set and the actual value of the fidelity degree as a loss value, carrying out iterative training for many times, and optimizing the loss value to be stable by using the cross entropy loss as a loss function to finish the training.
The audio data of the unit audio is detected by the model respectively, and the detection module carries out smoothing treatment on 20 predicted values output by an output layer of the model as an evaluation result of the fidelity degree of the audio to be detected input into the audio input terminal.
The invention can quantitatively evaluate the fidelity effect of the recording equipment when recording voice.
Those skilled in the art can understand that all or part of the steps in the above-mentioned embodiments of the audio data detection method may be implemented by a program instructing related hardware to complete, where the program is stored in 1 storage medium and includes several instructions to enable 1 device (which may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
While the above description shows and describes the preferred embodiments of the present invention, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (6)
1. A method for detecting speech distortion, comprising the steps of:
s11, dividing the audio to be detected into a plurality of unit audios, and storing audio data of each unit audio into 1 array;
s12, detecting the array based on a convolutional neural network detection model, wherein an output layer of the convolutional neural network detection model outputs a predicted value, and the predicted value is the fidelity degree of the audio corresponding to an input layer of the convolutional neural network detection model;
s13, averaging all the predicted values to obtain the fidelity degree of the audio to be detected.
2. The method for detecting speech distortion according to claim 1, wherein the method for constructing the convolutional neural network detection model comprises the following steps:
s21, acquiring a training set, wherein the training set comprises audio data with different fidelity degrees;
s22, training the convolutional neural network detection model by taking the training set as an input layer, and outputting the predicted value by an output layer of the convolutional neural network detection model;
and S23, taking the difference value between the predicted value of the corresponding audio of the training set and the actual value of the fidelity degree as a loss value, carrying out repeated iterative training, and optimizing the loss value to be stable by using cross entropy loss as a loss function to finish the training.
3. The method of claim 1, wherein the convolutional neural network detection model comprises 3 CNN convolutional layers, 3 pooling layers, and 1 fully-connected layer, wherein 1 pooling layer is connected to each CNN convolutional layer, and the last pooling layer is connected to the fully-connected layer.
4. The method according to claim 1, wherein the audio to be detected is divided into: and taking A second as a window, B second as displacement time, and audio with the time length of A second per time is 1 unit audio, and the actual time length is taken when the time length is less than A second, wherein B is more than 0 and less than or equal to A.
5. The method of claim 1, wherein the array and the training set are subjected to short-time fourier transform before being input into the convolutional neural network detection model.
6. A speech distortion detection system using the method of any of claims 1 to 5, comprising:
the audio input terminal is used for inputting the audio to be detected and dividing the audio into the unit audio;
and the detection module is used for respectively detecting the audio data of the unit audio, outputting the predicted values, and averaging all the predicted values to obtain an evaluation result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010097544.XA CN111370028A (en) | 2020-02-17 | 2020-02-17 | Voice distortion detection method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010097544.XA CN111370028A (en) | 2020-02-17 | 2020-02-17 | Voice distortion detection method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111370028A true CN111370028A (en) | 2020-07-03 |
Family
ID=71208041
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010097544.XA Pending CN111370028A (en) | 2020-02-17 | 2020-02-17 | Voice distortion detection method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111370028A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113448514A (en) * | 2021-06-02 | 2021-09-28 | 合肥群音信息服务有限公司 | Automatic processing system of multisource voice data |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1997005730A1 (en) * | 1995-07-27 | 1997-02-13 | British Telecommunications Public Limited Company | Assessment of signal quality |
WO2008016531A2 (en) * | 2006-08-01 | 2008-02-07 | Dts, Inc. | Neural network filtering techniques for compensating linear and non-linear distortion of an audio transducer |
CN107358966A (en) * | 2017-06-27 | 2017-11-17 | 北京理工大学 | Based on deep learning speech enhan-cement without reference voice quality objective evaluation method |
CN107886968A (en) * | 2017-12-28 | 2018-04-06 | 广州讯飞易听说网络科技有限公司 | Speech evaluating method and system |
CN109065072A (en) * | 2018-09-30 | 2018-12-21 | 中国科学院声学研究所 | A kind of speech quality objective assessment method based on deep neural network |
CN109872720A (en) * | 2019-01-29 | 2019-06-11 | 广东技术师范学院 | It is a kind of that speech detection algorithms being rerecorded to different scenes robust based on convolutional neural networks |
-
2020
- 2020-02-17 CN CN202010097544.XA patent/CN111370028A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1997005730A1 (en) * | 1995-07-27 | 1997-02-13 | British Telecommunications Public Limited Company | Assessment of signal quality |
WO2008016531A2 (en) * | 2006-08-01 | 2008-02-07 | Dts, Inc. | Neural network filtering techniques for compensating linear and non-linear distortion of an audio transducer |
CN107358966A (en) * | 2017-06-27 | 2017-11-17 | 北京理工大学 | Based on deep learning speech enhan-cement without reference voice quality objective evaluation method |
CN107886968A (en) * | 2017-12-28 | 2018-04-06 | 广州讯飞易听说网络科技有限公司 | Speech evaluating method and system |
CN109065072A (en) * | 2018-09-30 | 2018-12-21 | 中国科学院声学研究所 | A kind of speech quality objective assessment method based on deep neural network |
CN109872720A (en) * | 2019-01-29 | 2019-06-11 | 广东技术师范学院 | It is a kind of that speech detection algorithms being rerecorded to different scenes robust based on convolutional neural networks |
Non-Patent Citations (1)
Title |
---|
梁民 等: "声音信号质量评价技术", 《数字技术与应用》, no. 6, 31 December 2011 (2011-12-31), pages 141 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113448514A (en) * | 2021-06-02 | 2021-09-28 | 合肥群音信息服务有限公司 | Automatic processing system of multisource voice data |
CN113448514B (en) * | 2021-06-02 | 2022-03-15 | 合肥群音信息服务有限公司 | Automatic processing system of multisource voice data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108369813B (en) | Specific voice recognition method, apparatus and storage medium | |
CN108900725B (en) | Voiceprint recognition method and device, terminal equipment and storage medium | |
CN108346436B (en) | Voice emotion detection method and device, computer equipment and storage medium | |
CN109074822A (en) | Specific sound recognition methods, equipment and storage medium | |
CN111080109B (en) | Customer service quality evaluation method and device and electronic equipment | |
CN110797031A (en) | Voice change detection method, system, mobile terminal and storage medium | |
CN116153330B (en) | Intelligent telephone voice robot control method | |
CN109300470B (en) | Mixing separation method and mixing separation device | |
KR102026226B1 (en) | Method for extracting signal unit features using variational inference model based deep learning and system thereof | |
CN115062678A (en) | Training method of equipment fault detection model, fault detection method and device | |
CN107545898B (en) | Processing method and device for distinguishing speaker voice | |
CN113192536A (en) | Training method of voice quality detection model, voice quality detection method and device | |
Heo et al. | Automated recovery of damaged audio files using deep neural networks | |
CN110570871A (en) | TristouNet-based voiceprint recognition method, device and equipment | |
CN111370028A (en) | Voice distortion detection method and system | |
US20230245674A1 (en) | Method for learning an audio quality metric combining labeled and unlabeled data | |
Sawata et al. | Improving character error rate is not equal to having clean speech: Speech enhancement for asr systems with black-box acoustic models | |
CN111640451B (en) | Maturity evaluation method and device, and storage medium | |
CN111261192A (en) | Audio detection method based on LSTM network, electronic equipment and storage medium | |
US20230116052A1 (en) | Array geometry agnostic multi-channel personalized speech enhancement | |
CN115659248A (en) | Power equipment defect identification method, device, equipment and storage medium | |
CN113707172B (en) | Single-channel voice separation method, system and computer equipment of sparse orthogonal network | |
Oostermeijer et al. | Frequency gating: Improved convolutional neural networks for speech enhancement in the time-frequency domain | |
CN111477248B (en) | Audio noise detection method and device | |
CN113593604A (en) | Method, device and storage medium for detecting audio quality |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200703 |
|
RJ01 | Rejection of invention patent application after publication |