CN108346434B

CN108346434B - Voice quality assessment method and device

Info

Publication number: CN108346434B
Application number: CN201710055497.0A
Authority: CN
Inventors: 祁俊杰; 王丽莉; 李坤滋; 赵艳琼; 王君诚; 莫一鸣
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Anhui Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Anhui Co Ltd
Priority date: 2017-01-24
Filing date: 2017-01-24
Publication date: 2020-12-22
Anticipated expiration: 2037-01-24
Also published as: CN108346434A

Abstract

A method and apparatus for speech quality assessment, comprising: according to original voice information and corresponding degraded voice information tested by the original voice information, calculating to obtain a POLQA-MOS value corresponding to an effective voice segment in the degraded voice information according to a POLQA algorithm; extracting acoustic features in the effective voice segments; establishing a voice quality evaluation model by using the acoustic features of the effective voice segments and the POLQA-MOS values corresponding to the acoustic features by adopting a deep learning algorithm; and inputting the acoustic characteristics of the voice data in a voice quality evaluation model, and acquiring the ePOLQA-MOS value of the voice data. The accuracy of evaluating the voice quality can be improved, and the call quality of the network user can be truly reflected.

Description

Voice quality assessment method and device

Technical Field

The present invention relates to the field of communications, and in particular, to a method and an apparatus for speech quality assessment.

Background

Voice quality in mobile communication networks includes two aspects: the intelligibility is the intelligibility of the words, sentences and words in the speech, and the naturalness is the recognition level of the speaker. With the continuous development of communication, multimedia, computer and office automation and artificial intelligence technologies, speech signal processing has gradually penetrated the aspects of human social life and becomes one of the core technologies affecting the development of modern society. The quality of the system performance such as speech coding and decoding, speech communication, speech processing and the like also becomes an important factor related to information exchange in the modern society.

Speech quality assessment methods can be divided into two main categories: subjective evaluation and objective evaluation. Subjective assessment is the evaluation of speech quality by human subjects, and since humans are the ultimate recipients of speech, this evaluation is a true reflection of speech quality. In 1996, Mean Opinion Score (MOS), proposed by the International Telecommunications Union (ITU) organization, was a widely used subjective evaluation method that visually reflected human perception of speech quality with the mean opinion score of the tester. The method is divided into five grades, namely, excellent grade (5 points), good grade (4 points), general grade (3 points), poor grade (2 points), bad grade (1 point) and the like according to the quality degree of voice quality. The MOS scoring method has the advantages that: first, since the quality of the coding systems is ranked in numerical magnitude, coding systems of different distortion types can be compared with each other; secondly, the evaluator can directly participate in the evaluation only by simply training, so that the evaluation is easy to complete.

However, the subjective evaluation of MOS has the disadvantages of time and labor waste, insufficient flexibility, poor stability and great influence by human subjectivity. In order to overcome the disadvantage of subjective evaluation, people have begun to research objective evaluation methods for speech quality, and the objective evaluation is intended to become a speech quality evaluation means for quickly and accurately predicting a subjective evaluation value.

Currently, in the field of mobile communication, objective evaluation methods have become the mainstream of speech quality assessment, and the objective evaluation methods are classified into active (invasive) and passive (non-invasive).

Active evaluation generally refers to a bilateral comparison mode such as subjective speech quality evaluation (PESQ)/objective evaluation standard of speech quality (POLQA), and this mode requires investment of drive test equipment and human resources, and is only suitable for evaluation of road scenes, which is time-consuming and labor-consuming, and cannot achieve the purpose of comprehensively evaluating a network.

The passive evaluation is currently based on an E-Model algorithm of network-side index statistics, the principle of the method is to perform fitting calculation on network indexes such as packet loss rate, jitter and time delay, and the like, but not based on voice, the algorithm is simple, the accuracy is low, and the real perception of user conversation cannot be reflected. The p.563 algorithm published in 2004 by ITU-T is an output evaluation method based on single-sided speech, but the evaluation method is too simple to accurately and reasonably evaluate the quality of the mobile network broadband speech because of the narrow-band speech.

In summary, in the prior art, the accuracy of evaluating the broadband voice quality of the mobile network is not high, and the call quality of the network user cannot be truly reflected.

Disclosure of Invention

The embodiment of the invention provides a voice quality evaluation method, which can improve the accuracy of evaluating voice quality and further truly reflect the call quality of network users.

The embodiment of the invention also provides a device for evaluating the voice quality, which can improve the accuracy of evaluating the voice quality and further truly reflect the call quality of network users.

A method of speech quality assessment, the method comprising:

according to original voice information and corresponding degraded voice information tested by the original voice information, calculating to obtain a POLQA-MOS value corresponding to an effective voice segment in the degraded voice information according to a POLQA algorithm;

extracting acoustic features in the effective voice segments;

establishing a voice quality evaluation model by using the acoustic features of the effective voice segments and the POLQA-MOS values corresponding to the acoustic features by adopting a deep learning algorithm;

and inputting the acoustic characteristics of the voice data in a voice quality evaluation model, and acquiring the ePOLQA-MOS value of the voice data.

Optionally, the deep learning algorithm includes: a deep neural network algorithm DNN, a convolutional neural network algorithm CNN, a recurrent neural network algorithm RNN or a long-short term memory artificial neural network algorithm LSTM.

Optionally, the establishing, by using a deep learning algorithm, a voice quality evaluation model from the acoustic features of the effective voice segments and the POLQA-MOS values corresponding to the acoustic features includes:

establishing an N-layer network, and determining the weight of each layer of network except an input layer and an output layer, wherein N is more than 2;

calculating the output result of the output layer by adopting a deep learning algorithm according to the acoustic characteristics of the effective voice segments and the weights of the networks of all layers;

updating the weight of each layer of network by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature, and calculating the output result of the output layer again;

until the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature is smaller than a preset threshold value;

the updated weight of each layer of network is used as the adjusted weight of each layer of network;

and establishing a voice quality evaluation model according to the adjusted weight of each layer of network.

Optionally, the determining the weight of each layer of the network except for the input layer and the output layer includes:

establishing an N-layer network, and randomly determining the weight of each layer of network;

determining the output result of an output layer according to the acoustic characteristics of the effective voice segments and the weights of the networks of all layers by adopting a deep learning algorithm;

updating the weight of each layer of network by using the output result of the output layer and the difference of the POLQA-MOS value corresponding to the acoustic feature;

selecting the acoustic features of different effective voice segments and the updated weights of the networks of all layers again to determine the output result of the output layer again until all the effective voice segments are selected;

and the updated weights of the networks of all layers are used as the weights for determining the networks of all layers except the input layer and the output layer.

Optionally, the updating the weights of the networks of the respective layers by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature includes:

determining the network parameter gradient of the output layer by using the output result of the output layer and the difference of the POLQA-MOS value corresponding to the acoustic feature;

determining the network parameter gradient of the N-1 layer according to the network parameter gradient of the output layer and the weight of each layer of network;

and updating the weight of the N-1 layer and the network parameter gradient of the N-1 layer according to the network parameter gradient of the N layer and the weight of the N-1 layer, wherein N is more than 2, and the N layer is an output layer.

Optionally, N is equal to 5.

Optionally, the valid speech segment includes: 50 frames or more than 50 frames of speech data.

An apparatus for speech quality assessment, the apparatus comprising:

the calculation module is used for calculating and obtaining a POLQA-MOS value corresponding to an effective voice segment in degraded voice information according to the POLQA algorithm according to the original voice information and the degraded voice information corresponding to the original voice information after being tested;

the extraction module is used for extracting acoustic features in the effective voice fragments;

the establishing module is used for adopting a deep learning algorithm to establish a voice quality evaluation model by the acoustic characteristics of the effective voice fragments and the POLQA-MOS values corresponding to the acoustic characteristics;

and the acquisition module is used for inputting the acoustic characteristics of the voice data in the voice quality evaluation model and acquiring the ePOLQA-MOS value of the voice data.

Optionally, the establishing module is further configured to:

Optionally, N is equal to 5.

It can be seen from the above technical solutions that, in the embodiment of the present invention, according to original voice information and degraded voice information corresponding to the original voice information after being tested, a POLQA-MOS value corresponding to an effective voice fragment in the degraded voice information is obtained by calculation according to a POLQA algorithm; extracting acoustic features in the effective voice segments; establishing a voice quality evaluation model by using the acoustic features of the effective voice segments and the POLQA-MOS values corresponding to the acoustic features by adopting a deep learning algorithm; and inputting the acoustic characteristics of the voice data in a voice evaluation quality model, and acquiring the ePOLQA-MOS value of the voice data. The voice quality evaluation model is established by utilizing a deep learning algorithm, so that the accuracy of evaluating the voice quality can be improved, and the call quality of a network user can be truly reflected.

Drawings

The present invention will be better understood from the following description of specific embodiments thereof taken in conjunction with the accompanying drawings, in which like or similar reference characters designate like or similar features.

FIG. 1 is a flow chart of a method for speech quality assessment according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of establishing a speech quality assessment model according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of updating weights of networks in each layer according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart of determining weights of networks in layers other than the input layer and the output layer according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a device for speech quality assessment according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments.

In the embodiment of the invention, the POLQA-MOS value corresponding to the effective voice fragment is obtained according to the POLQA algorithm; then, a deep learning algorithm is adopted, and a voice quality evaluation model is established according to the acoustic features of the effective voice fragments and the POLQA-MOS values corresponding to the acoustic features; and a deep learning algorithm is adopted, effective voice segments are fully utilized for modeling, and the accuracy of evaluating the voice quality can be improved. Moreover, the measurement and the evaluation are carried out according to the acoustic characteristics, and the call quality of the network user is truly reflected.

The embodiment of the invention is suitable for evaluating VoLTE voice quality of mobile communication 2G/3G networks and 4G networks.

Fig. 1 is a schematic flow chart of a method for evaluating speech quality in an embodiment, which specifically includes:

101. and calculating a POLQA-MOS value corresponding to an effective voice segment in the degraded voice information according to the original voice information and the degraded voice information corresponding to the original voice information after being tested and a POLQA algorithm.

The POLQA algorithm is a p.863 standard proposed by ITU, which obtains MOS scores by predicting user perception, and is one of the currently highly international recognized speech quality assessment standards. The POLQA algorithm firstly adjusts the level of the original voice and the degraded voice to a uniform volume; secondly, IRS filtering is carried out, and voice transmission of a telephone receiver is fitted; then, the MOS score is obtained through the evaluation of a psychoacoustic model.

The POLQA algorithm is characterized in that MOS scores are obtained by comparing a known reference signal with a degraded signal passing through a system to be tested, and an MOS value cannot be calculated through unilateral voice. I.e. without a known reference signal, the MOS value of the input speech signal cannot be evaluated.

In an embodiment of the invention, at least 2000 hours of data is first collected from a hi-fi recording studio as clean raw speech. Then, dialing test is carried out by using the original voice, and corresponding RTP data packets are collected and restored at an Mb interface of the LTE network, so as to obtain degraded voice. The original speech information of the original speech corresponds to the degraded speech information of the degraded speech.

The degraded speech is divided into a plurality of speech segments. And calculating the POLQA value of the degraded voice information according to the POLQA algorithm, and further acquiring the POLQA-MOS value corresponding to the effective voice fragment in the degraded voice information. At this time, the corresponding relationship between the valid voice segment and the POLQA-MOS value can be obtained.

Generally, the valid speech segment is a frame of data, and since a frame of data is about 10ms to 50ms, it is difficult for such a short speech sample to learn the MOS attribute. Therefore, a certain frame data can be extended for a long time, at least 50 frames or more. That is, when the valid speech segment includes 50 frames or more than 50 frames of speech data, the MOS attribute can be learned relatively easily.

102. And extracting acoustic features in the effective voice segments.

The acoustic features of the voice information are divided into time domain information and frequency domain information, the sound waveform can be decomposed into superposition of simple waveforms, and the structure of the waveforms can be accurately measured, so that the voice quality can be accurately evaluated based on the acoustic features.

Acoustic features in the degraded speech information are extracted. The specific voice information includes the following acoustic features:

2, dimension: signal-to-noise ratio, multiplicative noise;

4, dimension: first and second order differences of the 2 features;

15 dimension: the 15 filter banks accumulate an energy parameter;

and (3) 30-dimensional: the first order difference and the second order difference corresponding to the energy parameters of the 15 filters;

36 dimension: LPC prediction coefficients: 12 dimensions and corresponding first and second order differences;

3, dimension: fundamental frequency, and corresponding first and second order differences;

1 dimension: the signal is intermittent.

The voice quality attribute of each frame data is characterized by acoustic features. Since the effective voice segment belongs to the degraded voice information, the acoustic features of the effective voice segment can be extracted on the basis of extracting the acoustic features in the degraded voice information.

103. And establishing a voice quality evaluation model by using the acoustic features of the effective voice segments and the POLQA-MOS values corresponding to the acoustic features by adopting a deep learning algorithm.

And establishing a voice quality evaluation model by using the acoustic features of the effective voice segments and the POLQA-MOS values corresponding to the acoustic features of the effective voice segments and adopting a deep learning algorithm. Wherein, the deep learning algorithm comprises: a deep neural network algorithm (DNN), a convolutional neural network algorithm (CNN), a recurrent neural network algorithm (RNN), or a long-short term memory artificial neural network algorithm (LSTM).

104. And inputting the acoustic characteristics of the voice data in a voice quality evaluation model, and acquiring the ePOLQA-MOS value of the voice data.

After the voice quality evaluation model is established, the corresponding relation between the acoustic characteristics of the voice data and the POLQA-MOS value is established. And inputting the acoustic characteristics of the voice data in the voice quality evaluation model, and acquiring the ePOLQA-MOS value of the voice data. I.e. in the absence of a known reference signal, the MOS value of the input speech signal can be evaluated.

In summary, in the embodiment of the present invention, a POLQA-MOS value corresponding to an effective speech fragment is obtained according to a POLQA algorithm; then, a deep learning algorithm is adopted, and a voice quality evaluation model is established according to the acoustic features of the effective voice fragments and the POLQA-MOS values corresponding to the acoustic features; and finally, inputting the acoustic characteristics of the voice data in a voice quality evaluation model, and acquiring the ePOLQA-MOS value of the voice data.

And a deep learning algorithm is adopted, effective voice segments are fully utilized for modeling, and the accuracy of evaluating the voice quality can be improved. Moreover, the measurement and the evaluation are carried out according to the acoustic characteristics, and the call quality of the network user is truly reflected.

The process of establishing the speech quality assessment model by using the deep learning algorithm is explained in detail below.

Fig. 2 is a schematic flow chart of establishing a speech quality assessment model in the embodiment of the present invention, which specifically includes:

the process of 201-203 is a detailed step of training and establishing a speech quality evaluation model by using the convergence condition that the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature is smaller than a preset threshold.

201. And establishing an N-layer network, and determining the weight of each layer of network except the input layer and the output layer, wherein N is greater than 2.

Firstly, an N-layer network is established according to actual experience and actual experimental results. The method specifically comprises the following steps: the input layer, the hidden layer and the output layer, and the value range of N is more than 2. That is, the hidden layer includes at least one layer. And after the weight of the hidden layer is determined, a voice quality evaluation model is initially established.

The process of determining the weights of the networks in the layers except the input layer and the output layer, i.e. determining the weight of the hidden layer, is shown in 401-.

202. Calculating the output result of the output layer by adopting a deep learning algorithm according to the acoustic characteristics of the effective voice segments and the weights of the networks of all layers; updating the weight of each layer of network by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature, and calculating the output result of the output layer again; until the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature is smaller than a preset threshold value.

Calculating the output result of the output layer by adopting a deep learning algorithm according to the acoustic characteristics of the effective voice segments and the weights of the networks of all layers; secondly, training a voice quality evaluation model by using an output result of an output layer and a POLQA-MOS value corresponding to the acoustic feature, namely updating the weight of each layer of network; after the weights of the networks of all layers are updated, the updated weights of the networks of all layers are used for calculating the output result of the output layer again; in the process of updating the network weight of each layer, the updating is stopped under the condition that the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature is smaller than a preset threshold value.

The following is an analysis and description of a specific process for updating the weights of the networks in each layer, and refer to fig. 3, which is a schematic flow chart of updating the weights of the networks in each layer in the embodiment of the present invention, and specifically includes:

301. and determining the network parameter gradient of the output layer by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic characteristic.

Calculating the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature

The quotient of the learning rate α is the network parameter gradient of the output layer. Where α is preset according to practical experience.

302. And determining the network parameter gradient of the N-1 layer according to the network parameter gradient of the output layer and the weight of each layer of network.

The network parameter gradient corresponding to the N-1 layer, after the weight of each layer of network is determined in step 201, the output result of the N-1 layer can be calculated in sequence according to the output result of the output layer and the weight of each layer of network, and then the output result of the N-1 layer can be obtained by using formula 1

303. And updating the weight of the N-1 layer and the network parameter gradient of the N-1 layer according to the network parameter gradient of the N layer and the weight of the N-1 layer, wherein N is more than 2, and the N layer is an output layer.

The N-layer network comprises an input layer, an N-2 layer hidden layer and an output layer. That is, the layer 1 network is an input layer, the layers 2 to N-2 are hidden layers, and the N layer is an output layer.

Gradient of network parameters from the output layer using equation 2

And network parameter gradient of layer N-1

Obtaining updated network parameter gradients of the N-1 layer

Using equation 3, from the weight W of layer N-1_N-1And the updated network parameter gradient of the N-1 layer to obtain the updated weight W of the N-1 layer_{N-1 new}。

Using equation 4, from the updated network parameter gradient of layer N-1

And network parameter gradient of layer N-2

Obtaining updated N-2 layer network parameter gradient

Using equation 5, the weight W from layer N-2_N-1And the updated network parameter gradient of the N-2 layer to obtain the updated weight W of the N-2 layer_{N-1 new}。

The value range of N is greater than 2,

and after the weight of each layer of network is updated each time, recalculating the output result of the output layer. And when the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature is smaller than a preset threshold value, stopping updating the weight of each layer of network.

Namely, the weights of the networks of the layers are updated according to the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature. And when the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature is smaller than a preset threshold value, namely the training of the voice quality evaluation model is finished, stopping updating the weight of each layer of network.

203. The updated weight of each layer of network is used as the adjusted weight of each layer of network; and establishing a voice quality evaluation model according to the adjusted weight of each layer of network.

And when the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature is smaller than a preset threshold value, stopping updating the weight of each layer of network. And taking the updated weight of each layer of network as the adjusted weight of each layer of network.

And then, establishing a voice quality evaluation model according to the adjusted weight of each layer of network.

In the embodiment of the invention, an N-layer network is established firstly, and after the weight of the hidden layer is determined, the weight of the hidden layer network is updated by adopting a deep learning algorithm according to the output result of the output layer and the difference of the POLQA-MOS value corresponding to the acoustic characteristic. When the conditions are met, stopping updating the weight of the hidden layer, wherein the updated weight of the hidden layer network is used as the adjusted weight of each layer of network; and establishing a voice quality evaluation model according to the adjusted weight of the hidden layer network. And adjusting the weight of the hidden layer according to the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature to determine a voice quality evaluation model. The accuracy of evaluating the voice quality can be improved, and the call quality of the network user can be truly reflected.

Fig. 4 is a schematic flow chart of determining weights of networks in each layer except for the input layer and the output layer in step 201, and specifically includes:

the process 401-403 is a detailed step of training and establishing a speech quality assessment model by using all valid speech segments as the convergence condition.

On the basis that the speech quality assessment model is obtained by 401-403 training, the speech quality assessment model is further refined by 201-203.

401. And establishing an N-layer network, and randomly determining the weight of each layer of network except the input layer and the output layer, wherein N is more than 2.

Firstly, an N-layer network is established according to actual experience and actual experimental results. The method specifically comprises the following steps: the input layer, the hidden layer and the output layer, and the value range of N is more than 2. That is, the hidden layer includes at least one layer. The weights of the hidden layers are randomly determined. And after the weight of the hidden layer is determined, a voice quality evaluation model is initially established.

402. Determining an output result of an output layer according to the acoustic characteristics of the effective voice segments and the weight of each layer network; updating the weight of each layer of network by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature; and selecting the acoustic characteristics of different effective voice segments and the updated weights of the networks of all layers again to determine the output result of the output layer until all the effective voice segments are selected.

And determining the output result of the output layer by adopting a deep learning algorithm according to the acoustic characteristics of the effective voice segments and the weight of each layer network. Here, the weight of each layer network is the weight of the hidden layer network.

The output result of the output layer is determined for the first time, and the weight of each layer of network layer at this time is the weight of each layer of network layer randomly determined in 401. And determining the output result of the output layer again and determining the weight of each layer of network layer again later, namely the updated weight of each layer of network in the last cycle.

And updating the weight of each layer of network by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature. Because more than one effective voice segment is available, all the effective voice segments can be fully utilized to update the weights of the networks of each layer. That is, the acoustic features of different valid speech segments and the updated weights of the networks of the layers may be selected again to determine the output result of the output layer until all valid speech segments are selected.

That is, the convergence condition for updating the weights of the networks in each layer is as follows: all valid speech segments are selected.

The following is a detailed process analysis for updating the weights of the networks of each layer using all valid speech segments. The step of updating the weights of each layer of network by using all the valid speech segments, as shown in fig. 3, specifically includes:

Gradient of network parameters from the output layer using equation 2

And network parameter gradient of layer N-1

Obtaining updated network parameter gradients of the N-1 layer

Using equation 4, from the updated network parameter gradient of layer N-1

Network parameter gradient of positive N-2 layer

Obtaining updated N-2 layer network parameter gradient

The value range of N is greater than 2,

and after the weight of each layer of network is updated each time, recalculating the output result of the output layer. All valid speech segments are selected and all valid speech segments are selected.

That is, the weights of the networks of the respective layers are updated according to the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature. When all the effective voice segments are selected, updating the weights of the networks of each layer is stopped.

403. And the updated weights of the networks of the layers are used as the weights for determining the networks of the layers except the input layer and the output layer.

When all the effective voice segments are selected, updating the weights of the networks of each layer is stopped. And taking the updated weight of each layer of network as the weight for determining each layer of network except the input layer and the output layer.

In the embodiment of the invention, an N-layer network is established firstly, and after the weight of the hidden layer is determined, the weight of the hidden layer network is updated by adopting a deep learning algorithm according to the output result of the output layer and the difference of the POLQA-MOS value corresponding to the acoustic characteristic. And all the effective voice segments are selected, the updating of the weights of the hidden layers is stopped, and the updated weights of the hidden layer networks are used as the weights for determining the networks of all layers except the input layer and the output layer.

The weights of the networks in each layer except the input layer and the output layer are determined by all valid speech segments. The accuracy of evaluating the voice quality can be improved, and the call quality of the network user can be truly reflected.

Referring to fig. 5, which is a schematic structural diagram of a device for speech quality assessment in an embodiment of the present invention, the device for speech quality assessment corresponds to a method for speech quality assessment, and specifically includes:

a calculating module 501, configured to calculate, according to the POLQA algorithm, a POLQA-MOS value corresponding to an effective speech segment in degraded speech information according to the original speech information and degraded speech information corresponding to the original speech information after being tested;

an extraction module 502, configured to extract acoustic features in the valid speech segments;

the establishing module 503 is configured to establish a voice quality evaluation model by using a deep learning algorithm and using the acoustic features of the valid voice segments and the POLQA-MOS values corresponding to the acoustic features;

an obtaining module 504, configured to input an acoustic feature of the voice data in the voice quality evaluation model, and obtain an ePOLQA-MOS value of the voice data.

Wherein, the deep learning algorithm comprises: a deep neural network algorithm DNN, a convolutional neural network algorithm CNN, a recurrent neural network algorithm RNN or a long-short term memory artificial neural network algorithm LSTM.

The establishing module 503 is further configured to:

calculating an output result of an output layer by adopting a deep learning algorithm according to the acoustic characteristics of the effective voice segments and the weight of each layer network;

The detailed process of establishing the speech quality assessment model is shown in 201-203.

Further, the establishing module 503 is further configured to:

determining an output result of an output layer according to the acoustic characteristics of the effective voice segments and the weight of each layer network by adopting a deep learning algorithm;

updating the weight of each layer of network by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature;

The detailed procedure for determining the weights of the networks of the layers except the input layer and the output layer is shown in 401-.

The establishing module 503 is further configured to:

determining the network parameter gradient of the output layer by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature;

The detailed process of updating the weights of the networks of each layer is shown in 301-303.

Wherein N is equal to 5. The valid speech segments include: 50 frames or more than 50 frames of speech data.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of speech quality assessment, the method comprising:

extracting acoustic features in the effective voice segments;

inputting acoustic characteristics of voice data in a voice quality evaluation model, and acquiring an ePOLQA-MOS value of the voice data;

wherein, the adoption of the deep learning algorithm, the establishment of the speech quality evaluation model by the acoustic features of the effective speech segments and the POLQA-MOS values corresponding to the acoustic features comprises:

establishing a voice quality evaluation model according to the adjusted weight of each layer of network;

the determining the weight of each layer of network except the input layer and the output layer comprises the following steps:

2. The method of speech quality assessment according to claim 1, wherein said deep learning algorithm comprises: a deep neural network algorithm DNN, a convolutional neural network algorithm CNN, a recurrent neural network algorithm RNN or a long-short term memory artificial neural network algorithm LSTM.

3. The method of claim 1, wherein the updating the weights of the networks of the layers by using the difference between the output result of the output layer and the POLQA-MOS value corresponding to the acoustic feature comprises:

4. The method of speech quality assessment according to claim 1, wherein N is equal to 5.

5. The method of speech quality assessment according to claim 1, wherein said valid speech segments comprise: 50 frames or more than 50 frames of speech data.

6. An apparatus for speech quality assessment, the apparatus comprising:

the acquisition module is used for inputting the acoustic characteristics of voice data in a voice quality evaluation model and acquiring an ePOLQA-MOS value of the voice data;

wherein the establishing module is further configured to:

the establishing module is further configured to:

7. The apparatus for speech quality assessment according to claim 6, wherein said deep learning algorithm comprises: a deep neural network algorithm DNN, a convolutional neural network algorithm CNN, a recurrent neural network algorithm RNN or a long-short term memory artificial neural network algorithm LSTM.

8. The apparatus for speech quality assessment according to claim 6, wherein said establishing module is further configured to:

9. The apparatus for speech quality assessment according to claim 6, wherein N is equal to 5.

10. The apparatus for speech quality assessment according to claim 6, wherein said valid speech segments comprise: 50 frames or more than 50 frames of speech data.