CN115512718A

CN115512718A - Voice quality evaluation method, device and system for stock voice file

Info

Publication number: CN115512718A
Application number: CN202211115342.9A
Authority: CN
Inventors: 张向东; 沈苏; 刘宏坤; 向拔尖; 武泽东
Original assignee: Zhongke Yousheng Suzhou Technology Co ltd
Current assignee: Zhongke Yousheng Suzhou Technology Co ltd
Priority date: 2022-09-14
Filing date: 2022-09-14
Publication date: 2022-12-23

Abstract

The application discloses a voice quality evaluation method, a device and a system for stock voice files, wherein the evaluation method comprises the steps of receiving a target voice signal, and the target voice signal comprises a target stock voice file played by a target sound production device; calculating in real time according to the target voice signal to obtain a corresponding voice quality evaluation result; performing front-end display on the voice quality evaluation result; the voice quality evaluation method for the stock voice file can realize voice quality evaluation on any stock voice file so as to test and obtain real-time display of language intelligibility of the stock voice file during playing, and can further realize optimization and adjustment of voice file recording environment and parameters through real-time evaluation on the recording quality of the stock voice file.

Description

Voice quality evaluation method, device and system for stock voice file

Technical Field

The application relates to the technical field of acoustic measurement, in particular to a voice quality evaluation method, device and system for stock voice files.

Background

The recording quality of the stock voice file is generally evaluated by the language intelligibility during playing as the recording quality, and when the language intelligibility during playing is low, the recording quality of the current voice recording audio is low, and even the current voice recording audio needs to be recorded again.

In the prior art, a professional is usually adopted to manually evaluate intelligibility of played stock voice files, so that the evaluation result is low in accuracy and has no uniform measurement standard, and the quality of the stock voice files is uneven. Certainly, a professional is also required to perform measurement by using a special playing device, for example, a special standard modulation test signal is played by using a manual mouth (artificial mouth) or a voice box (talkbox), or a room impulse response is acquired by using a professional device to perform complex calculation, which is time-consuming and expensive, and a common user cannot perform the test, cannot obtain a test result in real time, and has great limitation.

Therefore, a measurement method capable of accurately, conveniently and timely acquiring the speech intelligibility of the stored speech file is needed to be found.

Disclosure of Invention

The application aims to provide a voice quality evaluation method, device and system for stock voice files, which can conveniently and accurately evaluate and display the voice quality of the stock voice files in real time.

In order to achieve the purpose of the application, the application provides the following technical scheme:

in a first aspect, a method for evaluating speech quality is provided, the evaluation method comprises the following steps:

receiving a target voice signal, wherein the target voice signal comprises a target storage voice file played by a target sound production device;

calculating in real time according to the target voice signal to obtain a corresponding voice quality evaluation result;

and performing front-end display on the voice quality evaluation result.

In a preferred embodiment, the obtaining of the corresponding speech quality evaluation result by real-time computing according to the target speech signal includes:

performing feature extraction on the target voice signals to obtain at least one group of target voice signals;

and taking the at least one group of target voice signals as input, and obtaining a corresponding voice quality evaluation result through a pre-trained voice quality model.

In a preferred embodiment, the voice quality assessment result includes, but is not limited to, one of a voice transmission index or a mean opinion score.

In a preferred embodiment, when the voice quality evaluation result includes a voice transmission index, the obtaining of the corresponding voice quality evaluation result by real-time calculation according to the target voice signal includes:

respectively extracting features of the target voice signal according to p different octave filtering signal bands to obtain corresponding p groups of target features, wherein p is more than or equal to 2;

obtaining a first target location correspondence based on the p sets of target features and evaluating the voice quality of the target voice signal.

In a preferred embodiment, the performing feature extraction on the target speech signal according to p different octave filtering signal bands to obtain p corresponding sets of target features respectively includes:

respectively filtering the target voice signals to obtain p groups of different octave filtering signal bands, wherein any group of octave filtering signal bands comprises n modulation frequencies f _m ，n≥1，m≥1；

Respectively carrying out envelope extraction on the p groups of different octave filtering signal bands to obtain p groups of sub-band envelope characteristics;

taking an octave filtering signal band corresponding to any group of sub-band envelope characteristics as input, and respectively obtaining p groups of reverberation time T through a pre-trained voice quality model corresponding to the octave filtering signal band;

and respectively obtaining p groups of corresponding target characteristics based on the p groups of reverberation time T, wherein the target characteristics are modulation transfer function values.

In a preferred embodiment of the present invention, the respectively carrying out envelope extraction on the p groups of different octave filtering signal bands to obtain p groups of sub-band envelope characteristics comprises:

and respectively carrying out half-wave envelope detection on the p groups of different octave filtering signal bands to obtain p groups of sub-band envelope characteristics.

In a preferred embodiment, the obtaining p sets of reverberation times T by using a pre-trained speech quality model corresponding to a corresponding octave filtering signal band corresponding to any one set of the subband envelope features as an input includes:

dividing any octave filtering signal band into N voice segments with continuous and equal duration, wherein N is more than or equal to 2;

performing feature extraction on any voice segment in the N voice segments included in any octave filtering signal band through a combined structure of one or more of a convolutional neural network, a linear connection layer, an activation layer and a normalization layer to obtain a matrix in a shape of [ P, Q ], and obtaining corresponding N voice segment features;

any one of the obtained N voice segment characteristics is interacted through a combined structure of one or more of a long-time memory module, a multi-head/single-head attention module, a linear connection layer, an activation layer and a normalization layer, obtaining corresponding voice segment interaction characteristics;

based on the obtained N voice segment interactive characteristics, respectively predicting through a linear regression layer or a classification layer to obtain N reverberation times T respectively corresponding to the N voice segment interactive characteristics _N ；

Filtering corresponding to any octave N reverberation times T of a signal band _N Averaging is performed to obtain p sets of reverberation times T corresponding to the octave filtered signal bands, respectively.

In a preferred embodiment of the present invention, the obtaining p groups of target features respectively based on the p groups of reverberation times T includes:

based on any modulation frequency f _m The value and the corresponding reverberation time T obtain any modulation frequency f of any group of octave filtering signal bands _m Modulation transfer function value m of _k,fm 。

In a preferred embodiment, the obtaining a voice quality evaluation result that the first target position corresponds to the target voice signal based on the p sets of target features includes:

based on any of said modulation transfer function values m _k,fm Obtaining any modulation frequency f of corresponding octave filtering signal band k _m Effective signal-to-noise ratio SNR of _eff _k,fm ；

Based on any of the effective SNR _eff _k,fm Obtaining any modulation frequency f of corresponding octave filtering signal band k _m Transmission index TI of _k,fm ；

Calculating n transmission indexes TI of any octave filtering signal band k _k,fm To obtain a modulation transfer index M of the corresponding octave filtered signal band k _k ；

Modulation transfer index M based on p octave filtering signal bands _k And calculating to obtain a voice quality evaluation result of the first target position corresponding to the target voice signal.

In a preferred embodiment, the method further includes training p speech quality models corresponding to the p different octave filter signal bands in advance, respectively, including:

obtaining p groups of corresponding different octave filtering signal band sample sets based on any one stock voice file in a stock voice file sample set, wherein any one group of octave filtering signal band sample sets comprises q modulation frequency samples and q corresponding impulse response samples, and any one impulse response sample comprises a reverberation time sample T ₀ ，q≥2；

Using the q modulation frequency samples as input, q corresponding reverberation time samples T ₀ For output, p speech quality models corresponding to p octave filtering signal bands are obtained based on neural network training respectively.

In a preferred embodiment, the voice quality evaluation result is displayed in a front end, and the display manner includes but is not limited to:

displaying the voice quality evaluation result on an interface in a mode of numerical values and dynamic moving signals; or the like, or, alternatively,

displaying the voice quality evaluation result on an interface in a mode of numerical values and dynamic wifi signals; or the like, or, alternatively,

displaying the voice quality evaluation result on an interface in a numerical value and dynamic instrument panel mode; or the like, or a combination thereof,

and displaying the voice quality evaluation result on an interface in a mode of numerical values and a dynamic degree bar.

In a second aspect, there is provided a voice quality evaluation apparatus for stock voice files, characterized in that the device comprises:

the receiving module is used for receiving a target voice signal, and the target voice signal comprises a target storage voice file played by a target sound production device; (ii) a

The processing module is used for calculating in real time according to the target voice signal to obtain a corresponding voice quality evaluation result;

and the display module is used for carrying out front-end display on the voice quality evaluation result.

In a third aspect, a speech quality assessment system for stock speech files is provided, the evaluation system includes:

the voice receiving device is used for receiving a target voice signal, and the target voice signal comprises a target stock voice file played by target sound-producing equipment;

the display device is used for carrying out front-end display on the voice quality evaluation result;

the intelligent equipment is used for receiving the target voice signal sent by the at least one voice receiving device, performing the operation according to the target voice signal, calculating in real time to obtain a corresponding voice quality evaluation result, and sending the voice quality evaluation result to the at least one display device for front-end display.

In a fourth aspect, an electronic device is provided, comprising:

one or more processors; and

memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the operations of any of the first aspects.

In a fifth aspect, a computer-readable storage medium is provided, on which a computer program is stored, wherein the program, when executed by a processor, implements the method according to any of the first aspects.

Compared with the prior art, the method has the following beneficial effects:

the application provides a voice quality evaluation method, a device and a system for stock voice files, wherein the evaluation method comprises the steps of receiving a target voice signal, and the target voice signal comprises a target stock voice file played by a target sound production device; calculating in real time according to the target voice signal to obtain a corresponding voice quality evaluation result; performing front-end display on the voice quality evaluation result; the voice quality evaluation method for the stock voice files can realize voice quality evaluation on any stock voice files so as to test and obtain real-time display of the language intelligibility of the stock voice files during playing, and can further realize optimization and adjustment of the recording environment and parameters of the voice files through real-time evaluation on the recording quality of the stock voice files.

Drawings

FIG. 1 is an illustration of the STI score;

FIG. 2 is a flowchart of a voice quality evaluation method for stock voice files in the present embodiment;

FIG. 3 is a diagram illustrating envelope boundaries obtained by envelope extraction in the present embodiment;

FIG. 4 is a circuit diagram of a half-wave envelope detection circuit in the present embodiment;

FIG. 5 is a schematic diagram of a neural network architecture;

FIGS. 6a to 6d are exemplary display contents when the voice quality evaluation result is displayed on an interface;

fig. 7 is a system architecture diagram of a voice quality evaluation system for stock voice files.

Detailed Description

In order to make the purpose, technical solutions and advantages of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the description of the present application, it is to be understood that the terms "first", "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless otherwise specified.

The embodiment provides a voice quality evaluation method with real-time feedback and accurate measurement results, aiming at the current situation that the measurement of the intelligibility degree of the existing stock voice file language is difficult to realize real-time measurement and real-time feedback. The following describes a method, an apparatus, and a system for evaluating speech quality of stock speech files in detail with reference to specific embodiments.

Examples

As shown in fig. 2, the present embodiment provides a speech quality evaluation method for a stock speech file, which is suitable for evaluation of the language intelligibility of the stock speech file.

Specifically, the method for evaluating the voice quality of the voice file inventory in the embodiment includes the following steps:

s1, receiving a target voice signal. The target voice signal includes a target inventory voice file played by the target sound-emitting device. The present embodiment does not limit the target sound emission device. .

And S2, calculating in real time according to the target voice signal to obtain a corresponding voice quality evaluation result.

Generally, the step S2 includes the following steps:

s21, performing feature extraction on the target voice signals to obtain at least one group of target voice signals;

and S22, taking at least one group of target voice signals as input, and obtaining a corresponding voice quality evaluation result through a pre-trained voice quality model. It should be noted that the voice quality evaluation result includes, but is not limited to, one of a Speech Transmission Index (STI) or a Mean Opinion Score (MOS).

For convenience of description, the voice quality evaluation result in this embodiment is exemplified by the voice transmission index STI, but not limited thereto. In general, STI derives the speech transmission quality of a transmission channel by sending a specific test signal to the transmission channel and then analyzing the received signal and using a score expression between 0 and 1 (see fig. 1).

Step S21 the method specifically comprises the following steps: and respectively carrying out feature extraction on the target voice signal according to p different octave filtering signal bands k to obtain corresponding p groups of target features, wherein p is not less than 2,1 and not more than k is not less than p.

Further, step S21 includes:

s21a, respectively filtering the target voice signals to obtain p groups of different octave filtering signal bands k, wherein any group of octave filtering signal bands k comprises n modulation frequencies f _m ，n≥1，m≥1。

It should be noted that, human speech is usually divided into seven frequency bands, therefore, in this embodiment, p =7, i.e., 1. Ltoreq. K.ltoreq.7 is preferable. Thus, the center frequency f is obtained by filtering the target voices respectively _c Octave filter signal bands k of 125Hz, 250Hz, 500Hz,1kHz, 2kHz, 4kHz and 8kHz respectively, the upper limit frequency f in each octave filter signal band k _u And a lower limit frequency f _l As shown in the following equations (1) and (2), respectively:

and S21b, respectively carrying out envelope extraction on p groups of different octave filtering signal bands k to obtain p groups of sub-band envelope characteristics, wherein the envelope extraction result is an envelope boundary shown in figure 3.

In this embodiment, the envelope extraction algorithm is not limited, and preferably p sets of different octave filtering signal bands k are respectively subjected to half-wave envelope detection to obtain p sets of sub-band envelope characteristics (as shown in fig. 4), and the expression is expressed as a difference equation as shown in the following formulas (3) and (4):

and S21c, taking an octave filtering signal band k corresponding to any group of sub-band envelope characteristics as input, and respectively obtaining p groups of reverberation time T through a pre-trained voice quality model corresponding to the corresponding octave filtering signal band k.

Any voice quality model sequentially comprises a data preprocessing module, a feature extraction module, a time interaction module and a prediction module. Specifically, the method comprises the following steps:

a data preprocessing module: any octave filtering signal band k is divided into continuous N voice segments (such as voice slices in figure 5) by taking a preset time length (such as x seconds) as a unit, wherein N is more than or equal to 2.

A feature extraction module: any voice segment in N voice segments included in any octave filtering signal band k is subjected to feature extraction through a combined structure of one or more of a convolutional neural network, a linear connection layer, an activation layer and a normalization layer to obtain [ P, Q ] matrix, and obtaining corresponding N voice segment characteristics, such as the voice slice characteristics in FIG. 5.

A time interaction module: any one of the obtained N voice segment features is interacted through one or more of a long-time and short-time memory module, a multi-head/single-head attention module, a linear connection layer, an activation layer, and a normalization layer to obtain a corresponding voice segment interaction feature, such as a slice interaction feature in fig. 5.

A prediction module: based on the obtained N voice segment interactive characteristics, respectively predicting through a linear regression layer or a classification layer and the like to obtain N reverberation times T respectively corresponding to the N voice segment interactive characteristics _N Like the slice reverberation time T in FIG. 5 _N . For N number of T _N Averaging results in a reverberation time T corresponding to the respective octave filtered signal band k. The averaging method includes, but is not limited to, any one of a simple averaging method, a weighted averaging method, or a harmonic averaging method.

For this reason, before step S21c, the evaluation method further includes: sa, pre-training p speech quality models corresponding to p different octave filter signal bands k, respectively, comprising:

sa1, obtaining corresponding p groups of different octave filtering signal band sample sets based on any one stock voice file in the stock voice file sample sets, wherein any one group of octave filtering signal band sample sets comprises q modulation frequency samples and corresponding q impulse response samples, and any impulse response sample comprises a reverberation time sample T ₀ ，q≥2；

Sa2, q modulation frequency samples as input, q corresponding reverberation time samples T ₀ For output, p speech quality models corresponding to p octave filtering signal bands k are obtained through training respectively based on a neural network.

S21d, respectively obtaining corresponding p groups of target characteristics based on the p groups of reverberation time T, the target characteristic is the modulation transfer function value.

The Modulation Transfer Function (MTF) describes the extent to which Modulation m is transmitted from the target object (acoustic source) to the receiving transducer, and is the Modulation frequency f _m Function m of _k,fm The MTF determines the degree of modulation reduction of the target speech signal. In particular, the modulation frequency f _m In the range of 0.63Hz to 12.5Hz. The MTF function value is therefore dependent on the system environmental characteristics and the background noise. MTF calculation processAs shown in the following formula (5):

step S22 specifically includes: and obtaining a voice quality evaluation result of the first target position corresponding to the target voice signal based on the p groups of target characteristics.

Specifically, the step S22 includes:

s221, transfer function value m based on any modulation _k,fm Obtaining any modulation frequency f of corresponding octave filtering signal band k _m Effective signal-to-noise ratio SNR of _eff _k,fm (ii) a In particular, the effective SNR _eff _k,fm Obtained by calculation as follows (6):

s222, SNR based on any effective signal-to-noise ratio _eff _k,fm Obtaining any modulation frequency f of corresponding octave filtering signal band k _m Of the transmission index TI _k,fm (ii) a In particular, the transmission index TI _k,fm Obtained by calculation as follows:

s223, calculating n transmission indexes TI of any octave filtering signal band k _k，fm To obtain a modulation transfer index M of the corresponding octave filtered signal band k _k (ii) a Modulation transfer index M _k The value range of (a) is-15 dB to +15dB. In particular, the modulation transfer index M _k Obtained by calculation as follows:

s224, modulation transmission based on p octave filtering signal bandsIndex M _k Calculating to obtain a voice quality evaluation result (STI) of the first target position corresponding to the target voice signal; specifically, STI is obtained by calculation as follows: .

Wherein alpha is _k Different gender weight factors representing octave filter signal band k;

β _k representing different gender redundancy factors between an octave filtering signal band k and an octave filtering signal band k + 1;

M _k refers to the modulation transfer index of the octave filtered signal band.

It should be noted that the STI method can distinguish between male and female voice signals, but in practice, only male voice is used to evaluate the voice transmission path in order to simplify the measurement process. Table 1 gives the male voice STI weight factor α and the redundancy factor β as a function of the octave band.

TABLE 1

And S3, performing front-end display on the voice quality evaluation result obtained in the step S2.

As shown in fig. 6a to 6d, the interface display manner adopted for performing front-end display on the voice quality evaluation result includes, but is not limited to:

displaying the voice quality evaluation result on an interface in a mode of numerical values and dynamic moving signals; or the like, or a combination thereof,

displaying the voice quality evaluation result on an interface in a numerical value and dynamic wifi signal mode; or the like, or, alternatively,

displaying the voice quality evaluation result on an interface in a numerical value and dynamic instrument panel mode; or the like, or, alternatively,

In summary, the voice quality evaluation method for voice files stored in the embodiment can realize voice quality evaluation of any voice file stored in the memory, and compared with a method for finishing evaluation by means of professional equipment and a standard method in the prior art, the method has the advantages of higher universality, convenience and real-time feedback;

further, when the voice quality evaluation method for storing the voice files provided by the embodiment calculates and obtains the voice quality evaluation result, a method of respectively constructing voice quality models for different octaves to respectively obtain reverberation time is adopted, so that the method has strong robustness and reproducibility.

Corresponding to the above voice quality evaluation method, the present embodiment further provides a voice quality evaluation apparatus corresponding to the evaluation method, which is implemented by each functional module. The speech quality evaluation device includes:

the receiving module is used for receiving a target voice signal, wherein the target voice signal comprises a target stock voice file played by a target sound production device;

the display module is used for carrying out front-end display on the voice quality evaluation result;

and the model training module is used for respectively training p voice quality models corresponding to the p different octave filtering signal bands in advance.

Wherein, processing module includes:

the characteristic extraction unit is used for respectively carrying out characteristic extraction on the target voice signal according to p different octave filtering signal bands to obtain corresponding p groups of target characteristics, wherein p is more than or equal to 2;

and the evaluation unit is used for obtaining a voice quality evaluation result of the first target position corresponding to the target voice signal based on the p groups of target characteristics.

Further, the feature extraction unit specifically includes:

a first processing subunit for respectively filtering the target speech signalsObtaining p groups of different octave filtering signal bands, wherein any group of octave filtering signal bands comprises n modulation frequencies f _m ，n≥1，m≥1；

The second processing subunit is used for respectively carrying out envelope extraction on the p groups of different octave filtering signal bands to obtain p groups of sub-band envelope characteristics;

a third processing subunit, configured to take an octave filtering signal band corresponding to any one group of the subband envelope features as an input, and obtain p groups of reverberation times T through a pre-trained speech quality model corresponding to the octave filtering signal band;

and the fourth processing subunit is configured to obtain p corresponding sets of target features based on the p sets of reverberation times T, where the target features are modulation transfer function values.

The second processing subunit is specifically configured to perform half-wave envelope detection on the p groups of different octave filtering signal bands respectively to obtain p groups of sub-band envelope characteristics.

The third processing subunit is specifically configured to:

dividing any octave filtering signal band into N continuous voice segments with equal duration, wherein N is more than or equal to 2;

performing feature extraction on any voice fragment in the N voice fragments included in any octave filtering signal band through a composite structure of one or more of a convolutional neural network, a linear connection layer, an activation layer and a normalization layer to obtain a matrix in a shape of [ P, Q ], and obtaining corresponding N voice fragment features;

any one of the obtained N voice segment characteristics is interacted through one or more combined structures of a long-time and short-time memory module, a multi-head/single-head attention module, a linear connection layer, an activation layer and a normalization layer to obtain corresponding voice segment interaction characteristics;

For any octaveN reverberation times T of a filtered signal band _N Averaging is performed to obtain p sets of reverberation times T corresponding to the octave filtered signal bands, respectively.

The fourth processing subunit is specifically configured to base any modulation frequency f _m Value and corresponding reverberation time T obtain any modulation frequency f of any group of octave filtering signal bands _m Modulation transfer function value m of _k,fm 。

Further, the evaluation unit specifically includes:

a fifth processing subunit for performing a modulation transfer function on the basis of any of the modulation transfer function values m _k,fm Obtaining any modulation frequency f of corresponding octave filtering signal band k _m Effective signal-to-noise ratio SNR of _eff _k,fm ；

A sixth processing subunit for processing the received signal based on any of the effective SNR _eff _k,fm Obtaining any modulation frequency f of corresponding octave filtering signal band k _m Transmission index TI of _k,fm ；

A seventh processing subunit, configured to calculate n transmission indexes TI of any octave-filtered signal band k _k,fm To obtain a modulation transfer index M of the corresponding octave filtered signal band k _k ；

An eighth processing subunit for processing the modulation transfer index M based on the p octave filtering signal bands _k And calculating to obtain a voice quality evaluation result of the first target position corresponding to the target voice signal.

The eighth processing subunit specifically performs calculation as shown in the following formula (9) to obtain a voice quality evaluation result of the first target position corresponding to the target voice signal:

β _k representing the space between an octave filtered signal band k and an octave filtered signal band k +1Different gender redundancy factors of (a);

The model training module is specifically configured to:

When the display module displays the voice quality evaluation result at the front end, the adopted display mode includes but is not limited to interface display of the voice quality evaluation result in a numerical value and dynamic signal moving mode; or, displaying the voice quality evaluation result on an interface in a numerical value and dynamic wifi signal mode; or, the voice quality evaluation result is displayed on the interface in a numerical value and dynamic instrument panel mode; or, the voice quality evaluation result is displayed on an interface in a mode of numerical value and dynamic degree bar.

It should be noted that: in the voice quality evaluation device for voice files stored in the foregoing embodiment, when performing a voice quality evaluation service, only the division of the above functional modules is illustrated, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the system may be divided into different functional modules to complete all or part of the above described functions. In addition, the embodiment of the voice quality evaluation device and the embodiment of the voice quality evaluation method provided by the above embodiments belong to the same concept, that is, the device is based on the method, and the specific implementation process thereof is described in the method embodiments in detail and is not described herein again.

And, as shown in fig. 7, the present embodiment also provides a voice quality evaluation system for stock voice files, the evaluation system including:

the system comprises at least one voice receiving device, a processing device and a control device, wherein the voice receiving device is used for receiving a target voice signal, and the target voice signal comprises a target stock voice file played by a target sound production device; preferably, the voice receiving device is a voice sensor;

and the intelligent equipment is used for receiving the target voice signal sent by the at least one voice receiving device, performing real-time calculation such as a local voice quality evaluation method according to the target voice signal to obtain a corresponding voice quality evaluation result, and sending the voice quality evaluation result to the at least one display device for front-end display.

Also, the present embodiment provides an electronic device, including:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations as any one of the speech quality assessment methods for inventory speech files.

With respect to the speech quality evaluation method executed by the execution program instruction, the specific execution details and the corresponding beneficial effects are consistent with the description in the foregoing method, and will not be described again here.

And, the present embodiment also provides a computer-readable storage medium on which a computer program is stored, wherein the program, when executed by a processor, implements the method as in any one of the voice quality evaluation methods for stock voice files.

All the above optional technical solutions can adopt any combination to form the optional embodiments of the present application, that is, any multiple embodiments can be combined, so as to obtain the requirements for coping with different application scenarios, which are within the protection scope of the present application and are not described herein any more.

It should be understood that the above-mentioned embodiments are merely preferred embodiments of the present application and are not intended to limit the present application, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A speech quality evaluation method for stock speech files, the evaluation method comprising:

and performing front-end display on the voice quality evaluation result.

2. The method for evaluating according to claim 1, wherein said obtaining a corresponding speech quality evaluation result by real-time calculation based on the target speech signal comprises:

3. The method of claim 2, wherein the voice quality assessment results include but are not limited to one of a voice transmission index or a mean opinion score.

4. The method for evaluating according to claim 2, wherein when the voice quality evaluation result includes a voice transmission index, the calculating in real time according to the target voice signal to obtain a corresponding voice quality evaluation result comprises:

and obtaining a voice quality evaluation result of the first target position corresponding to the target voice signal based on the p groups of target features.

5. The evaluation method according to claim 4, wherein said performing feature extraction on the target speech signal according to p different octave filtering signal bands to obtain p corresponding sets of target features respectively comprises:

6. The evaluation method according to claim 5, wherein said separately performing envelope extraction on p different sets of octave filtered signal bands to obtain p sets of subband envelope features comprises:

7. The evaluation method of claim 5, wherein the obtaining p sets of reverberation times T respectively by using the octave filtering signal bands corresponding to any set of the sub-band envelope characteristics as input through a pre-trained speech quality model corresponding to the corresponding octave filtering signal bands comprises:

For N reverberation times T corresponding to any octave filtering signal band _N Averaging is performed to obtain p sets of reverberation times T corresponding to the octave filtered signal bands, respectively.

8. The evaluation method according to claim 5, wherein the obtaining of the corresponding p sets of target features based on the p sets of reverberation times T respectively comprises:

9. The evaluation method according to claim 8,

the obtaining, based on the p sets of target features, a voice quality assessment result that the first target position corresponds to the target voice signal includes:

based on any of said modulation transfer function values m _k,fm Obtaining any modulation frequency f of corresponding octave filtering signal band k _m Effective signal-to-noise ratio SNR of _effk,fm ；

SNR based on any one of the effective signal-to-noise ratios _effk,fm Obtaining any modulation frequency f of corresponding octave filtering signal band k _m Transmission index TI of _k,fm ；

10. The evaluation method according to any one of claims 2 to 9, wherein the evaluation method further comprises training p speech quality models corresponding to the p different octave filtered signal bands, respectively, in advance, comprising:

obtaining p groups of corresponding different octave filtering signal band sample sets based on any voice file stock in a voice file stock sample set, wherein any group of octave filtering signal band sample sets comprises q modulation frequency samples and q corresponding impulse response samples, and any impulse response sample comprises a reverberation time sample T ₀ ，q≥2；

11. The evaluation method according to claim 1, wherein the voice quality evaluation result is presented at the front end in a manner including but not limited to:

displaying the voice quality evaluation result on an interface in a numerical value and dynamic wifi signal mode; or the like, or a combination thereof,

displaying the voice quality evaluation result on an interface in a mode of numerical values and a dynamic instrument panel; or the like, or, alternatively,

12. A speech quality evaluation apparatus for a stock speech file, characterized by comprising:

the receiving module is used for receiving a target voice signal, and the target voice signal comprises a target storage voice file played by a target sound production device;

13. A speech quality evaluation system for stock speech files, the evaluation system comprising:

the intelligent equipment is used for receiving the target voice signal sent by the at least one voice receiving device, performing real-time calculation according to the operation of any one of claims 1 to 11 on the target voice signal to obtain a corresponding voice quality evaluation result, and sending the voice quality evaluation result to the at least one display device for front-end display.

14. An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations according to any one of claims 1 to 11; and

a display associated with the one or more processors for displaying in real-time speech quality assessment results obtained after execution of the program instructions by the one or more processors.

15. A computer-readable storage medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1 to 11.