CN111326169B - Voice quality evaluation method and device - Google Patents
Voice quality evaluation method and device Download PDFInfo
- Publication number
- CN111326169B CN111326169B CN201811544623.XA CN201811544623A CN111326169B CN 111326169 B CN111326169 B CN 111326169B CN 201811544623 A CN201811544623 A CN 201811544623A CN 111326169 B CN111326169 B CN 111326169B
- Authority
- CN
- China
- Prior art keywords
- voice
- evaluated
- signal
- speech
- quality
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000013441 quality evaluation Methods 0.000 title claims abstract description 97
- 238000000034 method Methods 0.000 title claims abstract description 51
- 238000011156 evaluation Methods 0.000 claims abstract description 16
- 238000012545 processing Methods 0.000 claims description 19
- 238000001228 spectrum Methods 0.000 claims description 18
- 238000003066 decision tree Methods 0.000 claims description 17
- 238000012549 training Methods 0.000 claims description 17
- 238000004422 calculation algorithm Methods 0.000 claims description 15
- 238000003860 storage Methods 0.000 claims description 12
- 238000001514 detection method Methods 0.000 claims description 11
- 238000004458 analytical method Methods 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 7
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000001303 quality assessment method Methods 0.000 claims description 6
- 238000010586 diagram Methods 0.000 description 11
- 238000004590 computer program Methods 0.000 description 8
- 238000009826 distribution Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 238000013210 evaluation model Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 238000010295 mobile communication Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000013507 mapping Methods 0.000 description 3
- 230000004075 alteration Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/69—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Monitoring And Testing Of Exchanges (AREA)
- Telephonic Communication Services (AREA)
Abstract
The application discloses a method and a device for evaluating voice quality, which are used for acquiring a voice signal to be evaluated, comparing the voice signal to be evaluated with stored voice signals, updating a built-in voice quality evaluation model when the difference between the voice signal to be evaluated and the stored voice signal is large, obtaining a new voice quality evaluation model, evaluating the voice signal to be evaluated by using the new voice quality evaluation model, and continuously updating the voice quality evaluation model by continuously learning the voice signal, thereby improving the voice evaluation accuracy.
Description
Technical Field
The present application relates to the field of communications technologies, and in particular, to a method and an apparatus for evaluating speech quality.
Background
The voice service based on the internet has become one of important services of the network, and is an important area of focus of service providers, and voice quality is an important factor for evaluating the quality of a communication network, so that development of an effective voice quality evaluation method is essential for achieving the purpose of evaluating the voice quality.
Currently, a voice evaluation method generally uses a fixed voice quality evaluation model to evaluate voice quality, and the specific method is as follows: the characteristic parameters of the voice signals are extracted, a voice quality evaluation model is obtained based on the extracted characteristic parameters in a training mode, the voice signals are evaluated by the voice quality evaluation model, and the voice quality evaluation model obtained through the training in the mode is applicable to a scene with little change of the voice environment and is fixed and unchangeable.
Disclosure of Invention
The application aims to provide a voice quality evaluation method and device so as to improve the accuracy of voice evaluation.
The application aims at realizing the following technical scheme:
in a first aspect, the present application provides a method for evaluating speech quality, including:
acquiring a voice signal to be evaluated, and determining identification information of the voice signal to be evaluated;
if the identification information of the voice signal to be evaluated is different from the stored identification information of the voice signal, the voice signal to be evaluated is used as a new voice signal, and when the number of the new voice signals is larger than a first preset threshold value, a first voice quality evaluation model is updated to obtain a second voice quality evaluation model;
wherein the saved voice signal is a voice signal acquired before the voice signal to be evaluated;
and evaluating the voice signal to be evaluated by using the second voice quality evaluation model.
Optionally, updating the first speech quality evaluation model to obtain a second speech quality evaluation model, including:
acquiring characteristic parameters of the new voice signal;
training the characteristic parameters by utilizing a decision tree algorithm, and updating the first voice quality evaluation model to obtain a second voice quality evaluation model;
the characteristic parameters include at least one of: signal to noise ratio, background noise, noise level, asymmetric interference value of average speech signal spectrum, high frequency flatness analysis, spectrum level range, spectrum level standard deviation, relative noise floor, skewness coefficient of linear prediction coefficient, cepstrum skewness coefficient, voiced sound, average section of back cavity, channel amplitude variation, speech level.
Optionally, after the voice signal to be evaluated is acquired, the method further includes:
and carrying out at least one of the following preprocessing on the voice signal to be evaluated: voice data validity detection, voice data normalization processing and default value difference fitting.
Optionally, after the voice signal to be evaluated is acquired, the method further includes:
evaluating the voice signal to be evaluated according to the first voice quality evaluation model to obtain the voice quality of the voice signal to be evaluated;
classifying the voice quality of the voice signal to be evaluated to obtain the voice quality of different intermediate grades; the different intermediate levels of speech quality are used to characterize different classes of speech quality.
Optionally, the identification information of the speech signal to be evaluated is different from the identification information of the saved speech signal, including:
the speech quality of the speech signal to be evaluated is different from the speech quality of the already stored speech signal and/or the characteristic parameters of the speech signal to be evaluated are different from the characteristic parameters of the already stored speech signal.
Optionally, the voice quality of the voice signal to be evaluated is different from the voice quality of the saved voice signal, including:
the voice quality of the voice signal to be evaluated and the voice quality of the stored voice signal are voice signals with the same interval level, and the difference value between the voice quality of the voice signal to be evaluated and the voice quality of the stored voice signal is smaller than a second preset threshold value; or (b)
The voice quality of the voice signal to be evaluated and the voice quality of the stored voice signal are voice signals with different intermediate grades.
In a second aspect, the present application provides a speech quality evaluation apparatus, comprising:
an acquisition unit configured to acquire a speech signal to be evaluated;
a determining unit, configured to determine identification information of the speech signal to be evaluated, and when determining that the speech quality of the speech signal to be evaluated is different from the speech quality of the speech signal that has been stored, take the speech signal to be evaluated as a new speech signal;
the updating unit is used for updating the first voice quality evaluation model when the number of the new voice signals is larger than a first preset threshold value to obtain a second voice quality evaluation model;
wherein the saved voice signal is a voice signal acquired before the voice signal to be evaluated;
and the evaluation unit is used for evaluating the voice signal to be evaluated by utilizing the second voice quality evaluation model.
Optionally, the updating unit is specifically configured to update the first speech quality evaluation model to obtain a second speech quality evaluation model in the following manner:
acquiring characteristic parameters of the new voice signal;
training the characteristic parameters by utilizing a decision tree algorithm, and updating the first voice quality evaluation model to obtain a second voice quality evaluation model;
the characteristic parameters include at least one of: signal to noise ratio, background noise, noise level, asymmetric interference value of average speech signal spectrum, high frequency flatness analysis, spectrum level range, spectrum level standard deviation, relative noise floor, skewness coefficient of linear prediction coefficient, cepstrum skewness coefficient, voiced sound, average section of back cavity, channel amplitude variation, speech level.
Optionally, the apparatus further comprises a processing unit for:
and carrying out at least one of the following preprocessing on the voice signal to be evaluated: voice data validity detection, voice data normalization processing and default value difference fitting.
Optionally, the evaluation unit is further configured to:
evaluating the voice signal to be evaluated according to the first voice quality evaluation model to obtain the voice quality of the voice signal to be evaluated;
the processing unit is further configured to:
classifying the voice quality of the voice signal to be evaluated to obtain voice quality of different regional grades; the different intermediate levels of speech quality are used to characterize different classes of speech quality.
Optionally, the identification information of the speech signal to be evaluated is different from the identification information of the saved speech signal, including:
the speech quality of the speech signal to be evaluated is different from the speech quality of the already stored speech signal and/or the characteristic parameters of the speech signal to be evaluated are different from the characteristic parameters of the already stored speech signal.
Optionally, the voice quality of the voice signal to be evaluated is different from the voice quality of the saved voice signal, including:
the voice quality of the voice signal to be evaluated and the voice quality of the stored voice signal are voice signals with the same interval level, and the difference value between the voice quality of the voice signal to be evaluated and the voice quality of the stored voice signal is smaller than a second preset threshold value; or (b)
The voice quality of the voice signal to be evaluated and the voice quality of the stored voice signal are voice signals with different intermediate grades.
In a third aspect, the present application also provides a device for evaluating speech quality, including:
a memory for storing program instructions;
and the processor is used for calling the program instructions stored in the memory and executing the method according to the first aspect according to the obtained program.
In a fourth aspect, the present application provides a computer readable storage medium storing computer instructions which, when run on a computer, cause the computer to perform the method of the first aspect.
The application provides a voice quality evaluation method and a voice quality evaluation device, which are used for acquiring a voice signal to be evaluated, comparing the voice signal to be evaluated with stored voice signals, and updating a built-in voice quality evaluation model when the difference between the voice signal to be evaluated and the stored voice signal is large, so as to obtain a new voice quality evaluation model, evaluating the voice signal to be evaluated by using the new voice quality evaluation model, and continuously updating the voice quality evaluation model by continuously learning the voice signal, thereby improving the voice evaluation accuracy.
Drawings
FIG. 1 is a flowchart of a method for evaluating speech quality according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a training classification of a decision tree according to an embodiment of the present application;
FIG. 3 is a schematic diagram of another decision tree training provided in an embodiment of the present application;
FIG. 4 is a flowchart of a method for updating an evaluation model of speech quality according to an embodiment of the present application;
FIG. 5 is a flowchart of another method for evaluating speech quality according to an embodiment of the present application;
FIG. 6 is a block diagram of a voice quality evaluation device according to an embodiment of the present application;
fig. 7 is a schematic diagram of another speech quality evaluation apparatus according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
At present, a common speech quality evaluation method is as follows: extracting the parameter characteristics of the voice signal or acquiring other characteristic parameters related to the voice quality, such as network delay, packet loss, jitter and the like, and then carrying out modeling analysis on the characteristic parameters to obtain objective voice quality evaluation.
The modeling can be generally performed on a fixed evaluation scene by using a fixed algorithm, for example, a subjective voice quality evaluation (Perceptual evaluation of speech quality, PESQ) algorithm for narrowband voice signals, an objective perception voice quality evaluation (Perceptual Objective Listening Quality Assessment, POLQA) algorithm for ultra wideband voice evaluation, and the like, and the voice quality evaluation model established by using the algorithm is a trained linear regression model, and has a specific mapping method, and finally, the objective voice quality evaluation and the actual perception quality of the crowd are mapped, so as to finally obtain the score of the voice quality.
The existing method is suitable for scenes with little change of the voice environment, because the parameters used for training the model are limited, if the parameters related to the voice quality are relatively large in scenes with relatively large change of the voice environment, such as trains, the parameters used for training the fixed voice quality evaluation model can be not limited, and the accuracy of evaluating the voice quality can be relatively low by utilizing the fixed voice quality evaluation model.
In view of this, the embodiment of the application provides a method and a device for evaluating voice quality, which are used for continuously acquiring voice signals, continuously updating a built-in evaluation model based on the input voice signals, evaluating the input voice signals and outputting voice quality scores, thereby improving the accuracy of voice quality evaluation.
It is to be understood that the terms "first," "second," and the like, as used herein below, are used solely for the purpose of distinguishing between descriptions and not necessarily for the purpose of indicating or implying a relative importance or order.
The embodiment of the application is not limited by environmental factors, and can be suitable for various evaluation environments, including environments with larger variation and more stable environments.
Second, the application scenarios of the embodiments of the present application include, but are not limited to, conventional second Generation mobile communication technology (2 rd-Generation, 2G)/third Generation mobile communication technology (3 rd-Generation, 3G) call, fourth Generation mobile communication technology (the 4th Generation mobile communication technology,4G) call, and 2/3/4G hybrid scenarios.
Referring to fig. 1, a flowchart of a method for evaluating voice quality according to an embodiment of the present application is shown, and referring to fig. 1, the method includes:
s101: and acquiring the voice signal to be evaluated, and determining the identification information of the voice signal to be evaluated.
S102: if the identification information of the voice signal to be evaluated is different from the identification information of the stored voice signal, the voice signal to be evaluated is taken as a new voice signal.
It may be understood that the "new speech signal" in the embodiment of the present application means that the speech signal to be evaluated has a larger difference from the speech signal received before the speech signal to be evaluated is received, and the speech signal to be evaluated may be marked as the new speech signal.
Specifically, the speech quality of the speech signal to be evaluated is different from the speech quality of the speech signal already stored, i.e. the speech quality of the speech signal to be evaluated is comparatively different from the speech quality of the speech signal already stored.
Wherein the speech signal that has been stored in the built-in speech quality evaluation model is a speech signal that was acquired before the speech signal to be evaluated.
In the embodiment of the application, the surrounding voice signals are continuously acquired, so that the voice signals acquired before the voice signals to be evaluated can be used as the reference signals of the voice signals to be evaluated.
It should be noted that the speech quality may include, but is not limited to, for example, speech quality evaluation score of speech signal, speech quality level.
S103: and when the number of the new voice signals is larger than a first preset threshold value, updating the first voice quality evaluation model to obtain a second voice quality evaluation model.
For convenience of description, in the embodiment of the present application, the "built-in speech quality evaluation model" may be referred to as a "first speech quality evaluation model", and the "speech quality evaluation model after the built-in speech quality evaluation model is updated" may be referred to as a "second speech quality evaluation model".
Specifically, in the embodiment of the application, when the difference between the voice signal to be evaluated and the old voice signal is large, the voice signal to be evaluated is taken as a new sample, and when the number of the new samples reaches a preset threshold, for example, the number of the new samples can be a first preset threshold, the built-in voice quality evaluation model is updated to obtain a second voice quality evaluation model.
It should be noted that, in the present application, "new speech signal" and "new sample", "old speech signal" and "already stored speech signal" are sometimes mixed together, and those skilled in the art will understand that the meaning is consistent.
S104: and evaluating the voice signal to be evaluated by using the second voice quality evaluation model.
Specifically, when the speech signal to be evaluated is a new sample, the speech signal to be evaluated can be evaluated by using the updated second speech quality evaluation model, so as to obtain an accurate speech quality evaluation result.
In the embodiment of the application, the voice data is continuously acquired through the change of the external voice environment, and the accuracy of the voice evaluation model is ensured by utilizing the continuously updated data set, so that the accuracy of the model can be improved.
In a possible implementation manner, updating the first speech quality evaluation model to obtain the second speech quality evaluation model may include:
and acquiring characteristic parameters of the new voice signal, training the characteristic parameters by utilizing a decision tree algorithm, and updating the first voice quality evaluation model to obtain a second voice quality evaluation model.
Specifically, in the embodiment of the application, a certain number of characteristic parameters of new voice signals (the number of the new voice signals is greater than a first preset threshold value) can be extracted, or other characteristic parameters related to voice quality can be obtained, and then a new voice quality evaluation model can be obtained through training according to the characteristic parameters.
It will be appreciated that other characteristic parameters associated with voice quality include, but are not limited to, network delay, packet loss, jitter, etc.
The method for obtaining the model by training the characteristic parameters is similar to the existing scheme, and is not repeated here.
In another possible implementation manner, in the embodiment of the present application, a new speech signal and an old speech signal may be fused, then feature parameters of the new speech signal and the old speech signal are extracted, or other feature parameters related to speech quality are obtained, and finally a new speech quality evaluation model is obtained according to the feature parameters.
It should be noted that the old speech signal is the speech signal received before the speech signal to be evaluated.
Specifically, the characteristic parameters in the embodiment of the application include at least one of the following parameters: signal to noise ratio, background noise, noise level, asymmetric interference value of average speech signal spectrum, high frequency flatness analysis, spectrum level range, spectrum level standard deviation, relative noise floor, skewness coefficient of linear prediction coefficient, cepstrum skewness coefficient, voiced sound, average section of back cavity, channel amplitude variation, speech level.
Because of the relatively large number of speech quality evaluation parameters, some parameters with relatively high weight values are usually selected as characteristic parameters during training.
In the embodiment of the present application, the voice quality evaluation parameters may further include: average speech signal interference value, global background noise, speech break time, level dip, silence length, pitch period, mechanization, correlation of back and middle cavities, correlation of consecutive frames, average power of consecutive frames, energy sum of repeated frames, number of frames of unnatural beeps, average energy of samples of unnatural beeps, proportion of samples of unnatural beeps, absolute value of cepstrum standard deviation, cepstrum kurtosis coefficient, kurtosis coefficient of linear prediction coefficient, absolute value of bias coefficient of linear prediction coefficient, fixed noise weighting, spectral sharpness, average energy level of samples of background noise, average energy of samples of background noise, multiplicative noise signal to noise ratio, total energy of unnatural silence frames, etc.
Further, after the voice signal to be evaluated is acquired, the method further includes:
the speech signal to be evaluated is subjected to at least one of the following preprocessing: voice data validity detection, voice data normalization processing and default value difference fitting.
Specifically, because a large amount of incomplete, inconsistent and abnormal data exist in the original voice signal, the execution efficiency of the later modeling is seriously affected, and even deviation of a model result can be caused. In addition, the value of the data itself affects the model result, so that the original speech signal can be cleaned first. It is often necessary to handle data misses, anomalies, redundancies, size scaling, and the like.
The data processing method mainly comprises data validity detection, data normalization, default value interpolation fitting and the like, including but not limited to the above method.
Still further, after acquiring the speech signal to be evaluated, the method further comprises:
evaluating the voice signal to be evaluated according to the first voice quality evaluation model to obtain the voice quality of the voice signal to be evaluated; and classifying the voice quality of the voice signal to be evaluated to obtain the voice quality of different intermediate grades.
Wherein the different intermediate levels of speech quality are used to characterize different classes of speech quality.
There are various options for classification algorithms in the speech assessment model, for example, the GBDT (Gradient Boosting Decision Tree, gradient-rising decision tree) algorithm may be used.
Specifically, in the embodiment of the present application, the quality of the voice signal may be classified by using a decision tree algorithm, as shown in fig. 2.
In fig. 2, the feature identifiers (1), (2), etc. are used to represent the identification information of the feature parameters of the speech signal, and the decision tree algorithm can be considered as a prediction model, and can be also understood as a classification tree. The application uses decision tree to map the classification of voice quality.
The classification of speech quality is mapped by a decision tree and the decision tree can be iterated multiple times to form a progressively increasing combined tree to optimize the mapping performance, e.g., fig. 3. In fig. 3, the learner can score the speech signal predictively to obtain the predicted speech quality.
The parameters in fig. 3 represent respectively: θ represents the weight, and Φ represents the mapping function of the different learners.
It should be noted that fig. 2 and 3 are merely exemplary, and the specific form and content thereof is not limited to those shown in the drawings. For example, the fractional set of speech quality is not limited to classification by 0-5 scores.
It will be appreciated that the decision tree may be obtained by machine learning, etc., and embodiments of the present application are not limited in this regard.
As can be seen from the boosting algorithm in fig. 3, the final predictive scoring result of the speech signal is a combination of the b learner speech quality results:
it will be appreciated that in the above formulaCorresponding to phi in the figure.
The above formula can be obtained by optimizing the function space:
where ρ represents the learning rate.
According to the above formula, the training value of one voice sample at a time can be obtained as follows:
from the above formula, it can be seen that: the voice quality scores may correspond to different voice quality score intervals, e.g., [0,1], [1,2], etc., which may correspond to different voice categories.
Preferably, the identification information of the voice signal to be evaluated is different from the stored identification information of the voice signal, and may include:
the speech quality of the speech signal to be evaluated is different from the speech quality of the already stored speech signal and/or the characteristic parameters of the speech signal to be evaluated are different from the characteristic parameters of the already stored speech signal.
Specifically, the speech quality of the speech signal to be evaluated is different from the speech quality of the already stored speech signal, and may include:
the voice quality of the voice signal to be evaluated and the voice quality of the stored voice signal are voice signals with the same interval level, and the difference value between the voice quality of the voice signal to be evaluated and the voice quality of the stored voice signal is smaller than a set threshold (for example, the voice quality of the voice signal to be evaluated and the voice quality of the stored voice signal can be a second preset threshold), or the voice quality of the voice signal to be evaluated and the voice quality of the stored voice signal are voice signals with different interval levels.
Optionally, in the embodiment of the present application, the speech signal to be evaluated may be evaluated by using a built-in speech quality evaluation model (a first speech quality evaluation model), so as to determine whether the speech signal to be evaluated is a new speech signal.
Specifically, in the embodiment of the application, voice data is obtained from the outside, the built-in evaluation model is utilized to classify and score the voice quality of the newly obtained voice data, and then whether the newly obtained voice signal belongs to a new sample is judged. If the difference between the voice data of different classifications is not large, or the score of the voice data of the same classification is too large compared with the score of the old voice data, the part of voice data can be used as a new sample.
Specifically, the method for judging that the identification information of the voice signal to be evaluated is different from the stored identification information of the voice signal by using the characteristic parameters can include, but is not limited to, the following methods:
(1) Based on the unified normal distribution detection:
the original data set is x i,1 ,x i,2 ,x i,3 ,…,x i,n I epsilon (1, …, m) contains m samples, n-dimensional features, can be countedThe mean value and the method of each feature dimension are calculated:
the probability for new data can be calculated as:
the difference between the characteristic distribution of the new data and the old data can be judged according to the probability.
(2) Based on multivariate gaussian distribution detection:
the original data set isIn total, n-dimensional eigenvectors, an n-dimensional eigenvalue vector and an n-x n covariance matrix can be calculated:
Σ=[Cov(x i ,x j )],i,j∈(1,…,n)
the probability for new data can be calculated as:
the difference between the characteristic distribution of the new data and the characteristic distribution of the old data can be judged according to the probability, wherein T in the formula represents transposition of the matrix.
(3) Based on mahalanobis distance detection:
for a multidimensional data set, a is a mean vector, and the mahalanobis distance from new data a to a is:
where T represents the transpose of the matrix and S is the covariance matrix, and if the S value is too large, the feature distribution is considered to be different.
(4) Feature importance based detection:
the importance ranking of features can be derived using tree-based models, such as GBDT.
The global importance of feature j is measured by the average of the importance of feature j in a single tree:
where M is the number of trees.
The importance of feature j in a single tree is as follows:
wherein L is the number of leaf nodes of the tree, and L-1 is the number of non-leaf nodes of the tree. V (v) t Is a feature associated with the node t and,is the square loss reduction value after splitting of the node T, J represents the feature set and T represents the set of trees.
For the first k features of the new sample training, if there are more differences from the original dataset features, it is considered to be different from the original dataset distribution.
In one possible implementation manner of the embodiment of the present application, the speech data may be learned in an incremental manner through a flowchart of the method shown in fig. 4, so as to update the built-in speech quality evaluation model, which is shown in fig. 4.
It will be appreciated that the normal scoring in fig. 4 is scoring by a built-in speech quality assessment model.
The whole method flow in the embodiment of the application can participate in a method flow chart shown in fig. 5, in the method, external voice signals are obtained, the voice signals are preprocessed, then the voice signal quality is classified by utilizing a decision tree algorithm to obtain the quality score of the voice signals, then whether voice sample data accords with new sample characteristics is judged, when the voice signals are new samples, after a certain number of new samples are collected, a built-in voice quality evaluation model is updated, and the standard score is carried out by utilizing the updated voice quality evaluation model.
Based on the same concept as the above-mentioned method embodiment, the embodiment of the present application further provides a block diagram of a voice quality evaluation device, referring to fig. 6, where the device includes: an acquisition unit 101, a determination unit 102, an update unit 103, and an evaluation unit 104.
Wherein, the obtaining unit 101 is configured to obtain a voice signal to be evaluated.
A determining unit 102 configured to determine the identification information of the speech signal to be evaluated acquired by the acquiring unit 101, and when determining that the speech quality of the speech signal to be evaluated is different from the speech quality of the speech signal already stored, take the speech signal to be evaluated as a new speech signal.
And an updating unit 103, configured to update the first speech quality evaluation model to obtain a second speech quality evaluation model when the number of new speech signals determined by the determining unit 102 is greater than the first preset threshold.
Wherein the saved voice signal is a voice signal acquired before the voice signal to be evaluated.
And an evaluation unit 104 for evaluating the speech signal to be evaluated using the second speech quality evaluation model obtained by the updating unit 103.
Specifically, the updating unit 103 is specifically configured to update the first speech quality evaluation model to obtain the second speech quality evaluation model in the following manner:
acquiring characteristic parameters of a new voice signal; and training the characteristic parameters by utilizing a decision tree algorithm, and updating the first voice quality evaluation model to obtain a second voice quality evaluation model.
Wherein the characteristic parameters include at least one of: signal to noise ratio, background noise, noise level, asymmetric interference value of average speech signal spectrum, high frequency flatness analysis, spectrum level range, spectrum level standard deviation, relative noise floor, skewness coefficient of linear prediction coefficient, cepstrum skewness coefficient, voiced sound, average section of back cavity, channel amplitude variation, speech level.
Correspondingly, the device further comprises: the processing unit 105 is configured to:
the speech signal to be evaluated is subjected to at least one of the following preprocessing: voice data validity detection, voice data normalization processing and default value difference fitting.
Still further, the evaluation unit 104 is further configured to:
and evaluating the voice signal to be evaluated according to the first voice quality evaluation model to obtain the voice quality of the voice signal to be evaluated.
The processing unit 105 is further configured to:
classifying the voice quality of the voice signal to be evaluated to obtain the voice quality of different intermediate grades; the speech quality of different compartment levels is used to characterize the speech quality of different classes.
Optionally, the identification information of the speech signal to be evaluated is different from the identification information of the speech signal already stored, including:
the speech quality of the speech signal to be evaluated is different from the speech quality of the already stored speech signal and/or the characteristic parameters of the speech signal to be evaluated are different from the characteristic parameters of the already stored speech signal.
Further, the speech quality of the speech signal to be evaluated is different from the speech quality of the already stored speech signal, comprising:
the voice quality of the voice signal to be evaluated and the voice quality of the stored voice signal are voice signals with the same interval level, and the difference value between the voice quality of the voice signal to be evaluated and the voice quality of the stored voice signal is smaller than a second preset threshold value; or the voice quality of the voice signal to be evaluated and the voice quality of the stored voice signal are voice signals with different intermediate grades.
It should be noted that, in the foregoing embodiment of the present application, the functional implementation of each unit in the speech quality evaluation apparatus may further refer to the description of the related method embodiment, which is not repeated herein.
The embodiment of the application also provides another device for evaluating voice quality, as shown in fig. 7, the device comprises:
memory 202 for storing program instructions.
A transceiver 201 for receiving and transmitting an evaluation instruction of voice quality.
The processor 200 is configured to invoke program instructions stored in the memory, and execute the method executed by the processing unit (102), the determining unit (103), the updating unit (104) and the evaluating unit (105) according to the obtained program according to the embodiment of the present application according to the instruction received by the transceiver 201.
Wherein in fig. 7, a bus architecture may comprise any number of interconnected buses and bridges, and in particular one or more processors represented by processor 200 and various circuits of memory represented by memory 202, linked together. The bus architecture may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., which are well known in the art and, therefore, will not be described further herein. The bus interface provides an interface.
Transceiver 201 may be a number of elements, including a transmitter and a transceiver, providing a means for communicating with various other apparatus over a transmission medium.
The processor 200 is responsible for managing the bus architecture and general processing, and the memory 202 may store data used by the processor 200 in performing operations.
The processor 200 may be a Central Processing Unit (CPU), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA), or a complex programmable logic device (Complex Programmable Logic Device, CPLD).
The embodiment of the present application also provides a computer storage medium, which is used for storing computer program instructions for any one of the devices described in the embodiment of the present application, and the computer storage medium includes a program for executing any one of the methods provided in the embodiment of the present application.
The computer storage media may be any available media or data storage device that can be accessed by a computer, including, but not limited to, magnetic storage (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical storage (e.g., CD, DVD, BD, HVD, etc.), and semiconductor storage (e.g., ROM, EPROM, EEPROM, nonvolatile storage (NAND FLASH), solid State Disk (SSD)), etc.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Claims (12)
1. A method for evaluating speech quality, comprising:
acquiring a voice signal to be evaluated, and determining identification information of the voice signal to be evaluated;
if the identification information of the voice signal to be evaluated is different from the stored identification information of the voice signal, the voice signal to be evaluated is used as a new voice signal, and when the number of the new voice signals is larger than a first preset threshold value, a first voice quality evaluation model is updated to obtain a second voice quality evaluation model;
wherein the saved voice signal is a voice signal acquired before the voice signal to be evaluated;
evaluating the voice signal to be evaluated by using the second voice quality evaluation model;
the identification information of the voice signal to be evaluated is different from the stored identification information of the voice signal, and the method comprises the following steps:
the speech quality of the speech signal to be evaluated is different from the speech quality of the already stored speech signal and/or the characteristic parameters of the speech signal to be evaluated are different from the characteristic parameters of the already stored speech signal.
2. The method of claim 1, wherein updating the first speech quality assessment model to obtain a second speech quality assessment model comprises:
acquiring characteristic parameters of the new voice signal;
training the characteristic parameters by utilizing a decision tree algorithm, and updating the first voice quality evaluation model to obtain a second voice quality evaluation model;
the characteristic parameters include at least one of: signal to noise ratio, background noise, noise level, asymmetric interference value of average speech signal spectrum, high frequency flatness analysis, spectrum level range, spectrum level standard deviation, relative noise floor, skewness coefficient of linear prediction coefficient, cepstrum skewness coefficient, voiced sound, average section of back cavity, channel amplitude variation, speech level.
3. The method of claim 1, wherein after obtaining the speech signal to be evaluated, the method further comprises:
and carrying out at least one of the following preprocessing on the voice signal to be evaluated: voice data validity detection, voice data normalization processing and default value difference fitting.
4. The method of claim 1, wherein after obtaining the speech signal to be evaluated, the method further comprises:
evaluating the voice signal to be evaluated according to the first voice quality evaluation model to obtain the voice quality of the voice signal to be evaluated;
classifying the voice quality of the voice signal to be evaluated to obtain the voice quality of different intermediate grades; the different intermediate levels of speech quality are used to characterize different classes of speech quality.
5. The method of claim 1, wherein the speech quality of the speech signal to be evaluated is different from the speech quality of the already stored speech signal, comprising:
the voice quality of the voice signal to be evaluated and the voice quality of the stored voice signal are voice signals with the same interval level, and the difference value between the voice quality of the voice signal to be evaluated and the voice quality of the stored voice signal is smaller than a second preset threshold value; or (b)
The voice quality of the voice signal to be evaluated and the voice quality of the stored voice signal are voice signals with different intermediate grades.
6. An apparatus for evaluating speech quality, comprising:
an acquisition unit configured to acquire a speech signal to be evaluated;
a determining unit, configured to determine identification information of the speech signal to be evaluated, and when determining that the speech quality of the speech signal to be evaluated is different from the speech quality of the speech signal that has been stored, take the speech signal to be evaluated as a new speech signal;
the updating unit is used for updating the first voice quality evaluation model when the number of the new voice signals is larger than a first preset threshold value to obtain a second voice quality evaluation model;
wherein the saved speech signal is a speech signal that has been evaluated using the first speech quality evaluation model before the speech signal to be evaluated;
the evaluation unit is used for evaluating the voice signal to be evaluated by utilizing the second voice quality evaluation model;
the identification information of the voice signal to be evaluated is different from the stored identification information of the voice signal, and the method comprises the following steps:
the speech quality of the speech signal to be evaluated is different from the speech quality of the already stored speech signal and/or the characteristic parameters of the speech signal to be evaluated are different from the characteristic parameters of the already stored speech signal.
7. The apparatus of claim 6, wherein the updating unit is specifically configured to update the first speech quality assessment model to obtain a second speech quality assessment model as follows:
acquiring characteristic parameters of the new voice signal;
training the characteristic parameters by utilizing a decision tree algorithm, and updating the first voice quality evaluation model to obtain a second voice quality evaluation model;
the characteristic parameters include at least one of: signal to noise ratio, background noise, noise level, asymmetric interference value of average speech signal spectrum, high frequency flatness analysis, spectrum level range, spectrum level standard deviation, relative noise floor, skewness coefficient of linear prediction coefficient, cepstrum skewness coefficient, voiced sound, average section of back cavity, channel amplitude variation, speech level.
8. The apparatus of claim 6, further comprising a processing unit to:
and carrying out at least one of the following preprocessing on the voice signal to be evaluated: voice data validity detection, voice data normalization processing and default value difference fitting.
9. The apparatus of claim 6, wherein the evaluation unit is further to:
evaluating the voice signal to be evaluated according to the first voice quality evaluation model to obtain the voice quality of the voice signal to be evaluated;
the apparatus further comprises a processing unit for:
classifying the voice quality of the voice signal to be evaluated to obtain voice quality of different regional grades; the different intermediate levels of speech quality are used to characterize different classes of speech quality.
10. The apparatus of claim 6, wherein the speech signal to be evaluated has a speech quality that is different from the speech quality of the already stored speech signal, comprising:
the voice quality of the voice signal to be evaluated and the voice quality of the stored voice signal are voice signals with the same interval level, and the difference value between the voice quality of the voice signal to be evaluated and the voice quality of the stored voice signal is smaller than a second preset threshold value; or (b)
The voice quality of the voice signal to be evaluated and the voice quality of the stored voice signal are voice signals with different intermediate grades.
11. An apparatus for evaluating speech quality, comprising:
a memory for storing program instructions;
and the processor is used for calling the program instructions stored in the memory and executing the method according to the obtained program.
12. A computer readable storage medium storing computer instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811544623.XA CN111326169B (en) | 2018-12-17 | 2018-12-17 | Voice quality evaluation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811544623.XA CN111326169B (en) | 2018-12-17 | 2018-12-17 | Voice quality evaluation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111326169A CN111326169A (en) | 2020-06-23 |
CN111326169B true CN111326169B (en) | 2023-11-10 |
Family
ID=71172436
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811544623.XA Active CN111326169B (en) | 2018-12-17 | 2018-12-17 | Voice quality evaluation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111326169B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111816207B (en) * | 2020-08-31 | 2021-01-26 | 广州汽车集团股份有限公司 | Sound analysis method, sound analysis system, automobile and storage medium |
CN112632841A (en) * | 2020-12-22 | 2021-04-09 | 交通运输部科学研究院 | Road surface long-term performance prediction method and device |
CN112634946B (en) * | 2020-12-25 | 2022-04-12 | 博瑞得科技有限公司 | Voice quality classification prediction method, computer equipment and storage medium |
CN112885377A (en) * | 2021-02-26 | 2021-06-01 | 平安普惠企业管理有限公司 | Voice quality evaluation method and device, computer equipment and storage medium |
CN113393863B (en) * | 2021-06-10 | 2023-11-03 | 北京字跳网络技术有限公司 | Voice evaluation method, device and equipment |
CN113838168B (en) * | 2021-10-13 | 2023-10-03 | 亿览在线网络技术(北京)有限公司 | Particle special effect animation generation method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101740024A (en) * | 2008-11-19 | 2010-06-16 | 中国科学院自动化研究所 | Method for automatic evaluation based on generalized fluent spoken language fluency |
WO2017041553A1 (en) * | 2015-09-07 | 2017-03-16 | 中兴通讯股份有限公司 | Method and apparatus for determining voice quality |
CN106558308A (en) * | 2016-12-02 | 2017-04-05 | 深圳撒哈拉数据科技有限公司 | A kind of internet audio quality of data auto-scoring system and method |
CN107895582A (en) * | 2017-10-16 | 2018-04-10 | 中国电子科技集团公司第二十八研究所 | Towards the speaker adaptation speech-emotion recognition method in multi-source information field |
CN108346434A (en) * | 2017-01-24 | 2018-07-31 | 中国移动通信集团安徽有限公司 | A kind of method and apparatus of speech quality evaluation |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR2944640A1 (en) * | 2009-04-17 | 2010-10-22 | France Telecom | METHOD AND DEVICE FOR OBJECTIVE EVALUATION OF THE VOICE QUALITY OF A SPEECH SIGNAL TAKING INTO ACCOUNT THE CLASSIFICATION OF THE BACKGROUND NOISE CONTAINED IN THE SIGNAL. |
FR2973923A1 (en) * | 2011-04-11 | 2012-10-12 | France Telecom | EVALUATION OF THE VOICE QUALITY OF A CODE SPEECH SIGNAL |
-
2018
- 2018-12-17 CN CN201811544623.XA patent/CN111326169B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101740024A (en) * | 2008-11-19 | 2010-06-16 | 中国科学院自动化研究所 | Method for automatic evaluation based on generalized fluent spoken language fluency |
WO2017041553A1 (en) * | 2015-09-07 | 2017-03-16 | 中兴通讯股份有限公司 | Method and apparatus for determining voice quality |
CN106558308A (en) * | 2016-12-02 | 2017-04-05 | 深圳撒哈拉数据科技有限公司 | A kind of internet audio quality of data auto-scoring system and method |
CN108346434A (en) * | 2017-01-24 | 2018-07-31 | 中国移动通信集团安徽有限公司 | A kind of method and apparatus of speech quality evaluation |
CN107895582A (en) * | 2017-10-16 | 2018-04-10 | 中国电子科技集团公司第二十八研究所 | Towards the speaker adaptation speech-emotion recognition method in multi-source information field |
Also Published As
Publication number | Publication date |
---|---|
CN111326169A (en) | 2020-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111326169B (en) | Voice quality evaluation method and device | |
US10964337B2 (en) | Method, device, and storage medium for evaluating speech quality | |
Morrison | A comparison of procedures for the calculation of forensic likelihood ratios from acoustic–phonetic data: Multivariate kernel density (MVKD) versus Gaussian mixture model–universal background model (GMM–UBM) | |
WO2019174422A1 (en) | Method for analyzing entity association relationship, and related apparatus | |
CN100585697C (en) | Speech detection method | |
US10390130B2 (en) | Sound processing apparatus and sound processing method | |
CN102915728B (en) | Sound segmentation device and method and speaker recognition system | |
Karbasi et al. | Twin-HMM-based non-intrusive speech intelligibility prediction | |
Dang et al. | An Investigation of Emotion Prediction Uncertainty Using Gaussian Mixture Regression. | |
CN115062678B (en) | Training method of equipment fault detection model, fault detection method and device | |
CN111160959B (en) | User click conversion prediction method and device | |
Gold et al. | Issues and opportunities: The application of the numerical likelihood ratio framework to forensic speaker comparison | |
CN115798518B (en) | Model training method, device, equipment and medium | |
McBride et al. | Improved poverty targeting through machine learning: An application to the USAID Poverty Assessment Tools | |
CN112801231A (en) | Decision model training method and device for business object classification | |
CN103488782A (en) | Method for recognizing musical emotion through lyrics | |
CN111523604A (en) | User classification method and related device | |
Mossavat et al. | A Bayesian hierarchical mixture of experts approach to estimate speech quality | |
CN114582325A (en) | Audio detection method and device, computer equipment and storage medium | |
CN111833842A (en) | Synthetic sound template discovery method, device and equipment | |
Heute et al. | Integral and diagnostic speech-quality measurement: State of the art, problems, and new approaches | |
CN113297412A (en) | Music recommendation method and device, electronic equipment and storage medium | |
CN116263889A (en) | Service quality prediction method, device and equipment | |
CN115278757A (en) | Method and device for detecting abnormal data and electronic equipment | |
CN111835541B (en) | Method, device, equipment and system for detecting aging of flow identification model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |