CN108346434B - Voice quality assessment method and device - Google Patents

Voice quality assessment method and device Download PDF

Info

Publication number
CN108346434B
CN108346434B CN201710055497.0A CN201710055497A CN108346434B CN 108346434 B CN108346434 B CN 108346434B CN 201710055497 A CN201710055497 A CN 201710055497A CN 108346434 B CN108346434 B CN 108346434B
Authority
CN
China
Prior art keywords
layer
network
voice
output
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710055497.0A
Other languages
Chinese (zh)
Other versions
CN108346434A (en
Inventor
祁俊杰
王丽莉
李坤滋
赵艳琼
王君诚
莫一鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Anhui Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Anhui Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Anhui Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201710055497.0A priority Critical patent/CN108346434B/en
Publication of CN108346434A publication Critical patent/CN108346434A/en
Application granted granted Critical
Publication of CN108346434B publication Critical patent/CN108346434B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/08Testing, supervising or monitoring using real traffic

Abstract

A method and apparatus for speech quality assessment, comprising: according to original voice information and corresponding degraded voice information tested by the original voice information, calculating to obtain a POLQA-MOS value corresponding to an effective voice segment in the degraded voice information according to a POLQA algorithm; extracting acoustic features in the effective voice segments; establishing a voice quality evaluation model by using the acoustic features of the effective voice segments and the POLQA-MOS values corresponding to the acoustic features by adopting a deep learning algorithm; and inputting the acoustic characteristics of the voice data in a voice quality evaluation model, and acquiring the ePOLQA-MOS value of the voice data. The accuracy of evaluating the voice quality can be improved, and the call quality of the network user can be truly reflected.

Description

Voice quality assessment method and device
Technical Field
The present invention relates to the field of communications, and in particular, to a method and an apparatus for speech quality assessment.
Background
Voice quality in mobile communication networks includes two aspects: the intelligibility is the intelligibility of the words, sentences and words in the speech, and the naturalness is the recognition level of the speaker. With the continuous development of communication, multimedia, computer and office automation and artificial intelligence technologies, speech signal processing has gradually penetrated the aspects of human social life and becomes one of the core technologies affecting the development of modern society. The quality of the system performance such as speech coding and decoding, speech communication, speech processing and the like also becomes an important factor related to information exchange in the modern society.
Speech quality assessment methods can be divided into two main categories: subjective evaluation and objective evaluation. Subjective assessment is the evaluation of speech quality by human subjects, and since humans are the ultimate recipients of speech, this evaluation is a true reflection of speech quality. In 1996, Mean Opinion Score (MOS), proposed by the International Telecommunications Union (ITU) organization, was a widely used subjective evaluation method that visually reflected human perception of speech quality with the mean opinion score of the tester. The method is divided into five grades, namely, excellent grade (5 points), good grade (4 points), general grade (3 points), poor grade (2 points), bad grade (1 point) and the like according to the quality degree of voice quality. The MOS scoring method has the advantages that: first, since the quality of the coding systems is ranked in numerical magnitude, coding systems of different distortion types can be compared with each other; secondly, the evaluator can directly participate in the evaluation only by simply training, so that the evaluation is easy to complete.
However, the subjective evaluation of MOS has the disadvantages of time and labor waste, insufficient flexibility, poor stability and great influence by human subjectivity. In order to overcome the disadvantage of subjective evaluation, people have begun to research objective evaluation methods for speech quality, and the objective evaluation is intended to become a speech quality evaluation means for quickly and accurately predicting a subjective evaluation value.
Currently, in the field of mobile communication, objective evaluation methods have become the mainstream of speech quality assessment, and the objective evaluation methods are classified into active (invasive) and passive (non-invasive).
Active evaluation generally refers to a bilateral comparison mode such as subjective speech quality evaluation (PESQ)/objective evaluation standard of speech quality (POLQA), and this mode requires investment of drive test equipment and human resources, and is only suitable for evaluation of road scenes, which is time-consuming and labor-consuming, and cannot achieve the purpose of comprehensively evaluating a network.
The passive evaluation is currently based on an E-Model algorithm of network-side index statistics, the principle of the method is to perform fitting calculation on network indexes such as packet loss rate, jitter and time delay, and the like, but not based on voice, the algorithm is simple, the accuracy is low, and the real perception of user conversation cannot be reflected. The p.563 algorithm published in 2004 by ITU-T is an output evaluation method based on single-sided speech, but the evaluation method is too simple to accurately and reasonably evaluate the quality of the mobile network broadband speech because of the narrow-band speech.
In summary, in the prior art, the accuracy of evaluating the broadband voice quality of the mobile network is not high, and the call quality of the network user cannot be truly reflected.
Disclosure of Invention
The embodiment of the invention provides a voice quality evaluation method, which can improve the accuracy of evaluating voice quality and further truly reflect the call quality of network users.
The embodiment of the invention also provides a device for evaluating the voice quality, which can improve the accuracy of evaluating the voice quality and further truly reflect the call quality of network users.
A method of speech quality assessment, the method comprising:
according to original voice information and corresponding degraded voice information tested by the original voice information, calculating to obtain a POLQA-MOS value corresponding to an effective voice segment in the degraded voice information according to a POLQA algorithm;
extracting acoustic features in the effective voice segments;
establishing a voice quality evaluation model by using the acoustic features of the effective voice segments and the POLQA-MOS values corresponding to the acoustic features by adopting a deep learning algorithm;
and inputting the acoustic characteristics of the voice data in a voice quality evaluation model, and acquiring the ePOLQA-MOS value of the voice data.
Optionally, the deep learning algorithm includes: a deep neural network algorithm DNN, a convolutional neural network algorithm CNN, a recurrent neural network algorithm RNN or a long-short term memory artificial neural network algorithm LSTM.
Optionally, the establishing, by using a deep learning algorithm, a voice quality evaluation model from the acoustic features of the effective voice segments and the POLQA-MOS values corresponding to the acoustic features includes:
establishing an N-layer network, and determining the weight of each layer of network except an input layer and an output layer, wherein N is more than 2;
calculating the output result of the output layer by adopting a deep learning algorithm according to the acoustic characteristics of the effective voice segments and the weights of the networks of all layers;
updating the weight of each layer of network by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature, and calculating the output result of the output layer again;
until the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature is smaller than a preset threshold value;
the updated weight of each layer of network is used as the adjusted weight of each layer of network;
and establishing a voice quality evaluation model according to the adjusted weight of each layer of network.
Optionally, the determining the weight of each layer of the network except for the input layer and the output layer includes:
establishing an N-layer network, and randomly determining the weight of each layer of network;
determining the output result of an output layer according to the acoustic characteristics of the effective voice segments and the weights of the networks of all layers by adopting a deep learning algorithm;
updating the weight of each layer of network by using the output result of the output layer and the difference of the POLQA-MOS value corresponding to the acoustic feature;
selecting the acoustic features of different effective voice segments and the updated weights of the networks of all layers again to determine the output result of the output layer again until all the effective voice segments are selected;
and the updated weights of the networks of all layers are used as the weights for determining the networks of all layers except the input layer and the output layer.
Optionally, the updating the weights of the networks of the respective layers by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature includes:
determining the network parameter gradient of the output layer by using the output result of the output layer and the difference of the POLQA-MOS value corresponding to the acoustic feature;
determining the network parameter gradient of the N-1 layer according to the network parameter gradient of the output layer and the weight of each layer of network;
and updating the weight of the N-1 layer and the network parameter gradient of the N-1 layer according to the network parameter gradient of the N layer and the weight of the N-1 layer, wherein N is more than 2, and the N layer is an output layer.
Optionally, N is equal to 5.
Optionally, the valid speech segment includes: 50 frames or more than 50 frames of speech data.
An apparatus for speech quality assessment, the apparatus comprising:
the calculation module is used for calculating and obtaining a POLQA-MOS value corresponding to an effective voice segment in degraded voice information according to the POLQA algorithm according to the original voice information and the degraded voice information corresponding to the original voice information after being tested;
the extraction module is used for extracting acoustic features in the effective voice fragments;
the establishing module is used for adopting a deep learning algorithm to establish a voice quality evaluation model by the acoustic characteristics of the effective voice fragments and the POLQA-MOS values corresponding to the acoustic characteristics;
and the acquisition module is used for inputting the acoustic characteristics of the voice data in the voice quality evaluation model and acquiring the ePOLQA-MOS value of the voice data.
Optionally, the deep learning algorithm includes: a deep neural network algorithm DNN, a convolutional neural network algorithm CNN, a recurrent neural network algorithm RNN or a long-short term memory artificial neural network algorithm LSTM.
Optionally, the establishing module is further configured to:
establishing an N-layer network, and determining the weight of each layer of network except an input layer and an output layer, wherein N is more than 2;
calculating the output result of the output layer by adopting a deep learning algorithm according to the acoustic characteristics of the effective voice segments and the weights of the networks of all layers;
updating the weight of each layer of network by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature, and calculating the output result of the output layer again;
until the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature is smaller than a preset threshold value;
the updated weight of each layer of network is used as the adjusted weight of each layer of network;
and establishing a voice quality evaluation model according to the adjusted weight of each layer of network.
Optionally, the establishing module is further configured to:
establishing an N-layer network, and randomly determining the weight of each layer of network;
determining the output result of an output layer according to the acoustic characteristics of the effective voice segments and the weights of the networks of all layers by adopting a deep learning algorithm;
updating the weight of each layer of network by using the output result of the output layer and the difference of the POLQA-MOS value corresponding to the acoustic feature;
selecting the acoustic features of different effective voice segments and the updated weights of the networks of all layers again to determine the output result of the output layer again until all the effective voice segments are selected;
and the updated weights of the networks of all layers are used as the weights for determining the networks of all layers except the input layer and the output layer.
Optionally, the establishing module is further configured to:
determining the network parameter gradient of the output layer by using the output result of the output layer and the difference of the POLQA-MOS value corresponding to the acoustic feature;
determining the network parameter gradient of the N-1 layer according to the network parameter gradient of the output layer and the weight of each layer of network;
and updating the weight of the N-1 layer and the network parameter gradient of the N-1 layer according to the network parameter gradient of the N layer and the weight of the N-1 layer, wherein N is more than 2, and the N layer is an output layer.
Optionally, N is equal to 5.
Optionally, the valid speech segment includes: 50 frames or more than 50 frames of speech data.
It can be seen from the above technical solutions that, in the embodiment of the present invention, according to original voice information and degraded voice information corresponding to the original voice information after being tested, a POLQA-MOS value corresponding to an effective voice fragment in the degraded voice information is obtained by calculation according to a POLQA algorithm; extracting acoustic features in the effective voice segments; establishing a voice quality evaluation model by using the acoustic features of the effective voice segments and the POLQA-MOS values corresponding to the acoustic features by adopting a deep learning algorithm; and inputting the acoustic characteristics of the voice data in a voice evaluation quality model, and acquiring the ePOLQA-MOS value of the voice data. The voice quality evaluation model is established by utilizing a deep learning algorithm, so that the accuracy of evaluating the voice quality can be improved, and the call quality of a network user can be truly reflected.
Drawings
The present invention will be better understood from the following description of specific embodiments thereof taken in conjunction with the accompanying drawings, in which like or similar reference characters designate like or similar features.
FIG. 1 is a flow chart of a method for speech quality assessment according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of establishing a speech quality assessment model according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of updating weights of networks in each layer according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart of determining weights of networks in layers other than the input layer and the output layer according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a device for speech quality assessment according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments.
In the embodiment of the invention, the POLQA-MOS value corresponding to the effective voice fragment is obtained according to the POLQA algorithm; then, a deep learning algorithm is adopted, and a voice quality evaluation model is established according to the acoustic features of the effective voice fragments and the POLQA-MOS values corresponding to the acoustic features; and a deep learning algorithm is adopted, effective voice segments are fully utilized for modeling, and the accuracy of evaluating the voice quality can be improved. Moreover, the measurement and the evaluation are carried out according to the acoustic characteristics, and the call quality of the network user is truly reflected.
The embodiment of the invention is suitable for evaluating VoLTE voice quality of mobile communication 2G/3G networks and 4G networks.
Fig. 1 is a schematic flow chart of a method for evaluating speech quality in an embodiment, which specifically includes:
101. and calculating a POLQA-MOS value corresponding to an effective voice segment in the degraded voice information according to the original voice information and the degraded voice information corresponding to the original voice information after being tested and a POLQA algorithm.
The POLQA algorithm is a p.863 standard proposed by ITU, which obtains MOS scores by predicting user perception, and is one of the currently highly international recognized speech quality assessment standards. The POLQA algorithm firstly adjusts the level of the original voice and the degraded voice to a uniform volume; secondly, IRS filtering is carried out, and voice transmission of a telephone receiver is fitted; then, the MOS score is obtained through the evaluation of a psychoacoustic model.
The POLQA algorithm is characterized in that MOS scores are obtained by comparing a known reference signal with a degraded signal passing through a system to be tested, and an MOS value cannot be calculated through unilateral voice. I.e. without a known reference signal, the MOS value of the input speech signal cannot be evaluated.
In an embodiment of the invention, at least 2000 hours of data is first collected from a hi-fi recording studio as clean raw speech. Then, dialing test is carried out by using the original voice, and corresponding RTP data packets are collected and restored at an Mb interface of the LTE network, so as to obtain degraded voice. The original speech information of the original speech corresponds to the degraded speech information of the degraded speech.
The degraded speech is divided into a plurality of speech segments. And calculating the POLQA value of the degraded voice information according to the POLQA algorithm, and further acquiring the POLQA-MOS value corresponding to the effective voice fragment in the degraded voice information. At this time, the corresponding relationship between the valid voice segment and the POLQA-MOS value can be obtained.
Generally, the valid speech segment is a frame of data, and since a frame of data is about 10ms to 50ms, it is difficult for such a short speech sample to learn the MOS attribute. Therefore, a certain frame data can be extended for a long time, at least 50 frames or more. That is, when the valid speech segment includes 50 frames or more than 50 frames of speech data, the MOS attribute can be learned relatively easily.
102. And extracting acoustic features in the effective voice segments.
The acoustic features of the voice information are divided into time domain information and frequency domain information, the sound waveform can be decomposed into superposition of simple waveforms, and the structure of the waveforms can be accurately measured, so that the voice quality can be accurately evaluated based on the acoustic features.
Acoustic features in the degraded speech information are extracted. The specific voice information includes the following acoustic features:
2, dimension: signal-to-noise ratio, multiplicative noise;
4, dimension: first and second order differences of the 2 features;
15 dimension: the 15 filter banks accumulate an energy parameter;
and (3) 30-dimensional: the first order difference and the second order difference corresponding to the energy parameters of the 15 filters;
36 dimension: LPC prediction coefficients: 12 dimensions and corresponding first and second order differences;
3, dimension: fundamental frequency, and corresponding first and second order differences;
1 dimension: the signal is intermittent.
The voice quality attribute of each frame data is characterized by acoustic features. Since the effective voice segment belongs to the degraded voice information, the acoustic features of the effective voice segment can be extracted on the basis of extracting the acoustic features in the degraded voice information.
103. And establishing a voice quality evaluation model by using the acoustic features of the effective voice segments and the POLQA-MOS values corresponding to the acoustic features by adopting a deep learning algorithm.
And establishing a voice quality evaluation model by using the acoustic features of the effective voice segments and the POLQA-MOS values corresponding to the acoustic features of the effective voice segments and adopting a deep learning algorithm. Wherein, the deep learning algorithm comprises: a deep neural network algorithm (DNN), a convolutional neural network algorithm (CNN), a recurrent neural network algorithm (RNN), or a long-short term memory artificial neural network algorithm (LSTM).
104. And inputting the acoustic characteristics of the voice data in a voice quality evaluation model, and acquiring the ePOLQA-MOS value of the voice data.
After the voice quality evaluation model is established, the corresponding relation between the acoustic characteristics of the voice data and the POLQA-MOS value is established. And inputting the acoustic characteristics of the voice data in the voice quality evaluation model, and acquiring the ePOLQA-MOS value of the voice data. I.e. in the absence of a known reference signal, the MOS value of the input speech signal can be evaluated.
In summary, in the embodiment of the present invention, a POLQA-MOS value corresponding to an effective speech fragment is obtained according to a POLQA algorithm; then, a deep learning algorithm is adopted, and a voice quality evaluation model is established according to the acoustic features of the effective voice fragments and the POLQA-MOS values corresponding to the acoustic features; and finally, inputting the acoustic characteristics of the voice data in a voice quality evaluation model, and acquiring the ePOLQA-MOS value of the voice data.
And a deep learning algorithm is adopted, effective voice segments are fully utilized for modeling, and the accuracy of evaluating the voice quality can be improved. Moreover, the measurement and the evaluation are carried out according to the acoustic characteristics, and the call quality of the network user is truly reflected.
The process of establishing the speech quality assessment model by using the deep learning algorithm is explained in detail below.
Fig. 2 is a schematic flow chart of establishing a speech quality assessment model in the embodiment of the present invention, which specifically includes:
the process of 201-203 is a detailed step of training and establishing a speech quality evaluation model by using the convergence condition that the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature is smaller than a preset threshold.
201. And establishing an N-layer network, and determining the weight of each layer of network except the input layer and the output layer, wherein N is greater than 2.
Firstly, an N-layer network is established according to actual experience and actual experimental results. The method specifically comprises the following steps: the input layer, the hidden layer and the output layer, and the value range of N is more than 2. That is, the hidden layer includes at least one layer. And after the weight of the hidden layer is determined, a voice quality evaluation model is initially established.
The process of determining the weights of the networks in the layers except the input layer and the output layer, i.e. determining the weight of the hidden layer, is shown in 401-.
202. Calculating the output result of the output layer by adopting a deep learning algorithm according to the acoustic characteristics of the effective voice segments and the weights of the networks of all layers; updating the weight of each layer of network by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature, and calculating the output result of the output layer again; until the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature is smaller than a preset threshold value.
Calculating the output result of the output layer by adopting a deep learning algorithm according to the acoustic characteristics of the effective voice segments and the weights of the networks of all layers; secondly, training a voice quality evaluation model by using an output result of an output layer and a POLQA-MOS value corresponding to the acoustic feature, namely updating the weight of each layer of network; after the weights of the networks of all layers are updated, the updated weights of the networks of all layers are used for calculating the output result of the output layer again; in the process of updating the network weight of each layer, the updating is stopped under the condition that the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature is smaller than a preset threshold value.
The following is an analysis and description of a specific process for updating the weights of the networks in each layer, and refer to fig. 3, which is a schematic flow chart of updating the weights of the networks in each layer in the embodiment of the present invention, and specifically includes:
301. and determining the network parameter gradient of the output layer by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic characteristic.
Calculating the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature
Figure BDA0001218233050000091
Figure BDA0001218233050000092
The quotient of the learning rate α is the network parameter gradient of the output layer. Where α is preset according to practical experience.
Figure BDA0001218233050000093
302. And determining the network parameter gradient of the N-1 layer according to the network parameter gradient of the output layer and the weight of each layer of network.
Figure BDA0001218233050000094
Figure BDA0001218233050000095
The network parameter gradient corresponding to the N-1 layer, after the weight of each layer of network is determined in step 201, the output result of the N-1 layer can be calculated in sequence according to the output result of the output layer and the weight of each layer of network, and then the output result of the N-1 layer can be obtained by using formula 1
Figure BDA0001218233050000096
303. And updating the weight of the N-1 layer and the network parameter gradient of the N-1 layer according to the network parameter gradient of the N layer and the weight of the N-1 layer, wherein N is more than 2, and the N layer is an output layer.
The N-layer network comprises an input layer, an N-2 layer hidden layer and an output layer. That is, the layer 1 network is an input layer, the layers 2 to N-2 are hidden layers, and the N layer is an output layer.
Figure BDA0001218233050000101
Figure BDA0001218233050000102
Gradient of network parameters from the output layer using equation 2
Figure BDA0001218233050000103
And network parameter gradient of layer N-1
Figure BDA0001218233050000104
Obtaining updated network parameter gradients of the N-1 layer
Figure BDA0001218233050000105
Using equation 3, from the weight W of layer N-1N-1And the updated network parameter gradient of the N-1 layer to obtain the updated weight W of the N-1 layerN-1 new
Figure BDA0001218233050000106
Figure BDA0001218233050000107
Using equation 4, from the updated network parameter gradient of layer N-1
Figure BDA0001218233050000108
And network parameter gradient of layer N-2
Figure BDA0001218233050000109
Obtaining updated N-2 layer network parameter gradient
Figure BDA00012182330500001010
Using equation 5, the weight W from layer N-2N-1And the updated network parameter gradient of the N-2 layer to obtain the updated weight W of the N-2 layerN-1 new
The value range of N is greater than 2,
and after the weight of each layer of network is updated each time, recalculating the output result of the output layer. And when the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature is smaller than a preset threshold value, stopping updating the weight of each layer of network.
Namely, the weights of the networks of the layers are updated according to the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature. And when the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature is smaller than a preset threshold value, namely the training of the voice quality evaluation model is finished, stopping updating the weight of each layer of network.
203. The updated weight of each layer of network is used as the adjusted weight of each layer of network; and establishing a voice quality evaluation model according to the adjusted weight of each layer of network.
And when the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature is smaller than a preset threshold value, stopping updating the weight of each layer of network. And taking the updated weight of each layer of network as the adjusted weight of each layer of network.
And then, establishing a voice quality evaluation model according to the adjusted weight of each layer of network.
In the embodiment of the invention, an N-layer network is established firstly, and after the weight of the hidden layer is determined, the weight of the hidden layer network is updated by adopting a deep learning algorithm according to the output result of the output layer and the difference of the POLQA-MOS value corresponding to the acoustic characteristic. When the conditions are met, stopping updating the weight of the hidden layer, wherein the updated weight of the hidden layer network is used as the adjusted weight of each layer of network; and establishing a voice quality evaluation model according to the adjusted weight of the hidden layer network. And adjusting the weight of the hidden layer according to the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature to determine a voice quality evaluation model. The accuracy of evaluating the voice quality can be improved, and the call quality of the network user can be truly reflected.
Fig. 4 is a schematic flow chart of determining weights of networks in each layer except for the input layer and the output layer in step 201, and specifically includes:
the process 401-403 is a detailed step of training and establishing a speech quality assessment model by using all valid speech segments as the convergence condition.
On the basis that the speech quality assessment model is obtained by 401-403 training, the speech quality assessment model is further refined by 201-203.
401. And establishing an N-layer network, and randomly determining the weight of each layer of network except the input layer and the output layer, wherein N is more than 2.
Firstly, an N-layer network is established according to actual experience and actual experimental results. The method specifically comprises the following steps: the input layer, the hidden layer and the output layer, and the value range of N is more than 2. That is, the hidden layer includes at least one layer. The weights of the hidden layers are randomly determined. And after the weight of the hidden layer is determined, a voice quality evaluation model is initially established.
402. Determining an output result of an output layer according to the acoustic characteristics of the effective voice segments and the weight of each layer network; updating the weight of each layer of network by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature; and selecting the acoustic characteristics of different effective voice segments and the updated weights of the networks of all layers again to determine the output result of the output layer until all the effective voice segments are selected.
And determining the output result of the output layer by adopting a deep learning algorithm according to the acoustic characteristics of the effective voice segments and the weight of each layer network. Here, the weight of each layer network is the weight of the hidden layer network.
The output result of the output layer is determined for the first time, and the weight of each layer of network layer at this time is the weight of each layer of network layer randomly determined in 401. And determining the output result of the output layer again and determining the weight of each layer of network layer again later, namely the updated weight of each layer of network in the last cycle.
And updating the weight of each layer of network by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature. Because more than one effective voice segment is available, all the effective voice segments can be fully utilized to update the weights of the networks of each layer. That is, the acoustic features of different valid speech segments and the updated weights of the networks of the layers may be selected again to determine the output result of the output layer until all valid speech segments are selected.
That is, the convergence condition for updating the weights of the networks in each layer is as follows: all valid speech segments are selected.
The following is a detailed process analysis for updating the weights of the networks of each layer using all valid speech segments. The step of updating the weights of each layer of network by using all the valid speech segments, as shown in fig. 3, specifically includes:
301. and determining the network parameter gradient of the output layer by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic characteristic.
Calculating the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature
Figure BDA0001218233050000121
Figure BDA0001218233050000122
The quotient of the learning rate α is the network parameter gradient of the output layer. Where α is preset according to practical experience.
Figure BDA0001218233050000123
302. And determining the network parameter gradient of the N-1 layer according to the network parameter gradient of the output layer and the weight of each layer of network.
Figure BDA0001218233050000124
Figure BDA0001218233050000125
The network parameter gradient corresponding to the N-1 layer, after the weight of each layer of network is determined in step 201, the output result of the N-1 layer can be calculated in sequence according to the output result of the output layer and the weight of each layer of network, and then the output result of the N-1 layer can be obtained by using formula 1
Figure BDA0001218233050000126
303. And updating the weight of the N-1 layer and the network parameter gradient of the N-1 layer according to the network parameter gradient of the N layer and the weight of the N-1 layer, wherein N is more than 2, and the N layer is an output layer.
The N-layer network comprises an input layer, an N-2 layer hidden layer and an output layer. That is, the layer 1 network is an input layer, the layers 2 to N-2 are hidden layers, and the N layer is an output layer.
Figure BDA0001218233050000127
Figure BDA0001218233050000128
Gradient of network parameters from the output layer using equation 2
Figure BDA0001218233050000129
And network parameter gradient of layer N-1
Figure BDA00012182330500001210
Obtaining updated network parameter gradients of the N-1 layer
Figure BDA00012182330500001211
Using equation 3, from the weight W of layer N-1N-1And the updated network parameter gradient of the N-1 layer to obtain the updated weight W of the N-1 layerN-1 new
Figure BDA0001218233050000131
Figure BDA0001218233050000132
Using equation 4, from the updated network parameter gradient of layer N-1
Figure BDA0001218233050000133
Network parameter gradient of positive N-2 layer
Figure BDA0001218233050000134
Obtaining updated N-2 layer network parameter gradient
Figure BDA0001218233050000135
Using equation 5, the weight W from layer N-2N-1And the updated network parameter gradient of the N-2 layer to obtain the updated weight W of the N-2 layerN-1 new
The value range of N is greater than 2,
and after the weight of each layer of network is updated each time, recalculating the output result of the output layer. All valid speech segments are selected and all valid speech segments are selected.
That is, the weights of the networks of the respective layers are updated according to the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature. When all the effective voice segments are selected, updating the weights of the networks of each layer is stopped.
403. And the updated weights of the networks of the layers are used as the weights for determining the networks of the layers except the input layer and the output layer.
When all the effective voice segments are selected, updating the weights of the networks of each layer is stopped. And taking the updated weight of each layer of network as the weight for determining each layer of network except the input layer and the output layer.
In the embodiment of the invention, an N-layer network is established firstly, and after the weight of the hidden layer is determined, the weight of the hidden layer network is updated by adopting a deep learning algorithm according to the output result of the output layer and the difference of the POLQA-MOS value corresponding to the acoustic characteristic. And all the effective voice segments are selected, the updating of the weights of the hidden layers is stopped, and the updated weights of the hidden layer networks are used as the weights for determining the networks of all layers except the input layer and the output layer.
The weights of the networks in each layer except the input layer and the output layer are determined by all valid speech segments. The accuracy of evaluating the voice quality can be improved, and the call quality of the network user can be truly reflected.
Referring to fig. 5, which is a schematic structural diagram of a device for speech quality assessment in an embodiment of the present invention, the device for speech quality assessment corresponds to a method for speech quality assessment, and specifically includes:
a calculating module 501, configured to calculate, according to the POLQA algorithm, a POLQA-MOS value corresponding to an effective speech segment in degraded speech information according to the original speech information and degraded speech information corresponding to the original speech information after being tested;
an extraction module 502, configured to extract acoustic features in the valid speech segments;
the establishing module 503 is configured to establish a voice quality evaluation model by using a deep learning algorithm and using the acoustic features of the valid voice segments and the POLQA-MOS values corresponding to the acoustic features;
an obtaining module 504, configured to input an acoustic feature of the voice data in the voice quality evaluation model, and obtain an ePOLQA-MOS value of the voice data.
Wherein, the deep learning algorithm comprises: a deep neural network algorithm DNN, a convolutional neural network algorithm CNN, a recurrent neural network algorithm RNN or a long-short term memory artificial neural network algorithm LSTM.
The establishing module 503 is further configured to:
establishing an N-layer network, and determining the weight of each layer of network except an input layer and an output layer, wherein N is more than 2;
calculating an output result of an output layer by adopting a deep learning algorithm according to the acoustic characteristics of the effective voice segments and the weight of each layer network;
updating the weight of each layer of network by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature, and calculating the output result of the output layer again;
until the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature is smaller than a preset threshold value;
the updated weight of each layer of network is used as the adjusted weight of each layer of network;
and establishing a voice quality evaluation model according to the adjusted weight of each layer of network.
The detailed process of establishing the speech quality assessment model is shown in 201-203.
Further, the establishing module 503 is further configured to:
establishing an N-layer network, and randomly determining the weight of each layer of network;
determining an output result of an output layer according to the acoustic characteristics of the effective voice segments and the weight of each layer network by adopting a deep learning algorithm;
updating the weight of each layer of network by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature;
selecting the acoustic features of different effective voice segments and the updated weights of the networks of all layers again to determine the output result of the output layer again until all the effective voice segments are selected;
and the updated weights of the networks of all layers are used as the weights for determining the networks of all layers except the input layer and the output layer.
The detailed procedure for determining the weights of the networks of the layers except the input layer and the output layer is shown in 401-.
The establishing module 503 is further configured to:
determining the network parameter gradient of the output layer by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature;
determining the network parameter gradient of the N-1 layer according to the network parameter gradient of the output layer and the weight of each layer of network;
and updating the weight of the N-1 layer and the network parameter gradient of the N-1 layer according to the network parameter gradient of the N layer and the weight of the N-1 layer, wherein N is more than 2, and the N layer is an output layer.
The detailed process of updating the weights of the networks of each layer is shown in 301-303.
Wherein N is equal to 5. The valid speech segments include: 50 frames or more than 50 frames of speech data.
In summary, in the embodiment of the present invention, a POLQA-MOS value corresponding to an effective speech fragment is obtained according to a POLQA algorithm; then, a deep learning algorithm is adopted, and a voice quality evaluation model is established according to the acoustic features of the effective voice fragments and the POLQA-MOS values corresponding to the acoustic features; and finally, inputting the acoustic characteristics of the voice data in a voice quality evaluation model, and acquiring the ePOLQA-MOS value of the voice data.
And a deep learning algorithm is adopted, effective voice segments are fully utilized for modeling, and the accuracy of evaluating the voice quality can be improved. Moreover, the measurement and the evaluation are carried out according to the acoustic characteristics, and the call quality of the network user is truly reflected.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of speech quality assessment, the method comprising:
according to original voice information and corresponding degraded voice information tested by the original voice information, calculating to obtain a POLQA-MOS value corresponding to an effective voice segment in the degraded voice information according to a POLQA algorithm;
extracting acoustic features in the effective voice segments;
establishing a voice quality evaluation model by using the acoustic features of the effective voice segments and the POLQA-MOS values corresponding to the acoustic features by adopting a deep learning algorithm;
inputting acoustic characteristics of voice data in a voice quality evaluation model, and acquiring an ePOLQA-MOS value of the voice data;
wherein, the adoption of the deep learning algorithm, the establishment of the speech quality evaluation model by the acoustic features of the effective speech segments and the POLQA-MOS values corresponding to the acoustic features comprises:
establishing an N-layer network, and determining the weight of each layer of network except an input layer and an output layer, wherein N is more than 2;
calculating the output result of the output layer by adopting a deep learning algorithm according to the acoustic characteristics of the effective voice segments and the weights of the networks of all layers;
updating the weight of each layer of network by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature, and calculating the output result of the output layer again;
until the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature is smaller than a preset threshold value;
the updated weight of each layer of network is used as the adjusted weight of each layer of network;
establishing a voice quality evaluation model according to the adjusted weight of each layer of network;
the determining the weight of each layer of network except the input layer and the output layer comprises the following steps:
establishing an N-layer network, and randomly determining the weight of each layer of network;
determining the output result of an output layer according to the acoustic characteristics of the effective voice segments and the weights of the networks of all layers by adopting a deep learning algorithm;
updating the weight of each layer of network by using the output result of the output layer and the difference of the POLQA-MOS value corresponding to the acoustic feature;
selecting the acoustic features of different effective voice segments and the updated weights of the networks of all layers again to determine the output result of the output layer again until all the effective voice segments are selected;
and the updated weights of the networks of all layers are used as the weights for determining the networks of all layers except the input layer and the output layer.
2. The method of speech quality assessment according to claim 1, wherein said deep learning algorithm comprises: a deep neural network algorithm DNN, a convolutional neural network algorithm CNN, a recurrent neural network algorithm RNN or a long-short term memory artificial neural network algorithm LSTM.
3. The method of claim 1, wherein the updating the weights of the networks of the layers by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature comprises:
determining the network parameter gradient of the output layer by using the output result of the output layer and the difference of the POLQA-MOS value corresponding to the acoustic feature;
determining the network parameter gradient of the N-1 layer according to the network parameter gradient of the output layer and the weight of each layer of network;
and updating the weight of the N-1 layer and the network parameter gradient of the N-1 layer according to the network parameter gradient of the N layer and the weight of the N-1 layer, wherein N is more than 2, and the N layer is an output layer.
4. The method of speech quality assessment according to claim 1, wherein N is equal to 5.
5. The method of speech quality assessment according to claim 1, wherein said valid speech segments comprise: 50 frames or more than 50 frames of speech data.
6. An apparatus for speech quality assessment, the apparatus comprising:
the calculation module is used for calculating and obtaining a POLQA-MOS value corresponding to an effective voice segment in degraded voice information according to the POLQA algorithm according to the original voice information and the degraded voice information corresponding to the original voice information after being tested;
the extraction module is used for extracting acoustic features in the effective voice fragments;
the establishing module is used for adopting a deep learning algorithm to establish a voice quality evaluation model by the acoustic characteristics of the effective voice fragments and the POLQA-MOS values corresponding to the acoustic characteristics;
the acquisition module is used for inputting the acoustic characteristics of voice data in a voice quality evaluation model and acquiring an ePOLQA-MOS value of the voice data;
wherein the establishing module is further configured to:
establishing an N-layer network, and determining the weight of each layer of network except an input layer and an output layer, wherein N is more than 2;
calculating the output result of the output layer by adopting a deep learning algorithm according to the acoustic characteristics of the effective voice segments and the weights of the networks of all layers;
updating the weight of each layer of network by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature, and calculating the output result of the output layer again;
until the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature is smaller than a preset threshold value;
the updated weight of each layer of network is used as the adjusted weight of each layer of network;
establishing a voice quality evaluation model according to the adjusted weight of each layer of network;
the establishing module is further configured to:
establishing an N-layer network, and randomly determining the weight of each layer of network;
determining the output result of an output layer according to the acoustic characteristics of the effective voice segments and the weights of the networks of all layers by adopting a deep learning algorithm;
updating the weight of each layer of network by using the output result of the output layer and the difference of the POLQA-MOS value corresponding to the acoustic feature;
selecting the acoustic features of different effective voice segments and the updated weights of the networks of all layers again to determine the output result of the output layer again until all the effective voice segments are selected;
and the updated weights of the networks of all layers are used as the weights for determining the networks of all layers except the input layer and the output layer.
7. The apparatus for speech quality assessment according to claim 6, wherein said deep learning algorithm comprises: a deep neural network algorithm DNN, a convolutional neural network algorithm CNN, a recurrent neural network algorithm RNN or a long-short term memory artificial neural network algorithm LSTM.
8. The apparatus for speech quality assessment according to claim 6, wherein said establishing module is further configured to:
determining the network parameter gradient of the output layer by using the output result of the output layer and the difference of the POLQA-MOS value corresponding to the acoustic feature;
determining the network parameter gradient of the N-1 layer according to the network parameter gradient of the output layer and the weight of each layer of network;
and updating the weight of the N-1 layer and the network parameter gradient of the N-1 layer according to the network parameter gradient of the N layer and the weight of the N-1 layer, wherein N is more than 2, and the N layer is an output layer.
9. The apparatus for speech quality assessment according to claim 6, wherein N is equal to 5.
10. The apparatus for speech quality assessment according to claim 6, wherein said valid speech segments comprise: 50 frames or more than 50 frames of speech data.
CN201710055497.0A 2017-01-24 2017-01-24 Voice quality assessment method and device Active CN108346434B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710055497.0A CN108346434B (en) 2017-01-24 2017-01-24 Voice quality assessment method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710055497.0A CN108346434B (en) 2017-01-24 2017-01-24 Voice quality assessment method and device

Publications (2)

Publication Number Publication Date
CN108346434A CN108346434A (en) 2018-07-31
CN108346434B true CN108346434B (en) 2020-12-22

Family

ID=62962957

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710055497.0A Active CN108346434B (en) 2017-01-24 2017-01-24 Voice quality assessment method and device

Country Status (1)

Country Link
CN (1) CN108346434B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110797046B (en) * 2018-08-02 2022-05-06 中国移动通信集团广东有限公司 Method and device for establishing prediction model of voice quality MOS value
CN109065072B (en) * 2018-09-30 2019-12-17 中国科学院声学研究所 voice quality objective evaluation method based on deep neural network
US20220060963A1 (en) * 2018-12-11 2022-02-24 Telefonaktiebolaget Lm Ericsson (Publ) Technique for user plane traffic quality analysis
CN111326169B (en) * 2018-12-17 2023-11-10 中国移动通信集团北京有限公司 Voice quality evaluation method and device
CN111383657A (en) * 2018-12-27 2020-07-07 中国移动通信集团辽宁有限公司 Voice quality evaluation method, device, equipment and medium
CN109830247A (en) * 2019-03-22 2019-05-31 北京百度网讯科技有限公司 Method and apparatus for test call quality
WO2022103290A1 (en) 2020-11-12 2022-05-19 "Stc"-Innovations Limited" Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems
CN112530457A (en) * 2020-12-21 2021-03-19 招商局重庆交通科研设计院有限公司 Tunnel emergency broadcast sound effect information evaluation method
CN113411456B (en) * 2021-06-29 2023-05-02 中国人民解放军63892部队 Voice quality assessment method and device based on voice recognition
CN116564351B (en) * 2023-04-03 2024-01-23 湖北经济学院 Voice dialogue quality evaluation method and system and portable electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104252865A (en) * 2013-06-25 2014-12-31 中国移动通信集团公司 Voice service quality evaluation method and device
CN104517613A (en) * 2013-09-30 2015-04-15 华为技术有限公司 Method and device for evaluating speech quality

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102044248B (en) * 2009-10-10 2012-07-04 北京理工大学 Objective evaluating method for audio quality of streaming media
US9679555B2 (en) * 2013-06-26 2017-06-13 Qualcomm Incorporated Systems and methods for measuring speech signal quality
CN103957216B (en) * 2014-05-09 2017-10-03 武汉大学 Based on characteristic audio signal classification without reference audio quality evaluating method and system
CN106328122A (en) * 2016-08-19 2017-01-11 深圳市唯特视科技有限公司 Voice identification method using long-short term memory model recurrent neural network
CN106531190B (en) * 2016-10-12 2020-05-05 科大讯飞股份有限公司 Voice quality evaluation method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104252865A (en) * 2013-06-25 2014-12-31 中国移动通信集团公司 Voice service quality evaluation method and device
CN104517613A (en) * 2013-09-30 2015-04-15 华为技术有限公司 Method and device for evaluating speech quality

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Evaluating perceived voice quality on packet networks using different random neural network architectures;Kapilan Radhakrishnan等;《Performance Evaluation》;20110118;第68卷;第347-360页 *

Also Published As

Publication number Publication date
CN108346434A (en) 2018-07-31

Similar Documents

Publication Publication Date Title
CN108346434B (en) Voice quality assessment method and device
CN107358966B (en) No-reference speech quality objective assessment method based on deep learning speech enhancement
CN106531190B (en) Voice quality evaluation method and device
CN108389592B (en) Voice quality evaluation method and device
CN104485114B (en) A kind of method of the voice quality objective evaluation based on auditory perception property
JPH08314884A (en) Method and apparatus for characterization of input signal
CN107507625B (en) Sound source distance determining method and device
CN109036412A (en) voice awakening method and system
CN105282347B (en) The appraisal procedure and device of voice quality
KR20160023767A (en) Systems and methods for measuring speech signal quality
CN103440869A (en) Audio-reverberation inhibiting device and inhibiting method thereof
RU2312405C2 (en) Method for realizing machine estimation of quality of sound signals
CN112017687B (en) Voice processing method, device and medium of bone conduction equipment
CN111968677B (en) Voice quality self-evaluation method for fitting-free hearing aid
CN105679335B (en) Speech quality assessment method and system based on no line analysis
CN106558308A (en) A kind of internet audio quality of data auto-scoring system and method
CN111710344A (en) Signal processing method, device, equipment and computer readable storage medium
Lin et al. A composite objective measure on subjective evaluation of speech enhancement algorithms
CN113411456B (en) Voice quality assessment method and device based on voice recognition
CN112233693B (en) Sound quality evaluation method, device and equipment
Wang et al. Automatic voice quality evaluation method of IVR service in call center based on Stacked Auto Encoder
CN113096691A (en) Detection method, device, equipment and computer storage medium
Xu et al. Does a PESQNet (Loss) require a clean reference input? The original PESQ does, but ACR listening tests don’t
Mello et al. Reference-free speech quality assessment for mobile phones based on audio perception
CN115565523A (en) End-to-end channel quality evaluation method and system based on neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant