CN115512718A - Voice quality evaluation method, device and system for stock voice file - Google Patents

Voice quality evaluation method, device and system for stock voice file Download PDF

Info

Publication number
CN115512718A
CN115512718A CN202211115342.9A CN202211115342A CN115512718A CN 115512718 A CN115512718 A CN 115512718A CN 202211115342 A CN202211115342 A CN 202211115342A CN 115512718 A CN115512718 A CN 115512718A
Authority
CN
China
Prior art keywords
voice
target
quality evaluation
signal
octave
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211115342.9A
Other languages
Chinese (zh)
Inventor
张向东
沈苏
刘宏坤
向拔尖
武泽东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Yousheng Suzhou Technology Co ltd
Original Assignee
Zhongke Yousheng Suzhou Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Yousheng Suzhou Technology Co ltd filed Critical Zhongke Yousheng Suzhou Technology Co ltd
Priority to CN202211115342.9A priority Critical patent/CN115512718A/en
Publication of CN115512718A publication Critical patent/CN115512718A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Abstract

The application discloses a voice quality evaluation method, a device and a system for stock voice files, wherein the evaluation method comprises the steps of receiving a target voice signal, and the target voice signal comprises a target stock voice file played by a target sound production device; calculating in real time according to the target voice signal to obtain a corresponding voice quality evaluation result; performing front-end display on the voice quality evaluation result; the voice quality evaluation method for the stock voice file can realize voice quality evaluation on any stock voice file so as to test and obtain real-time display of language intelligibility of the stock voice file during playing, and can further realize optimization and adjustment of voice file recording environment and parameters through real-time evaluation on the recording quality of the stock voice file.

Description

Voice quality evaluation method, device and system for stock voice file
Technical Field
The application relates to the technical field of acoustic measurement, in particular to a voice quality evaluation method, device and system for stock voice files.
Background
The recording quality of the stock voice file is generally evaluated by the language intelligibility during playing as the recording quality, and when the language intelligibility during playing is low, the recording quality of the current voice recording audio is low, and even the current voice recording audio needs to be recorded again.
In the prior art, a professional is usually adopted to manually evaluate intelligibility of played stock voice files, so that the evaluation result is low in accuracy and has no uniform measurement standard, and the quality of the stock voice files is uneven. Certainly, a professional is also required to perform measurement by using a special playing device, for example, a special standard modulation test signal is played by using a manual mouth (artificial mouth) or a voice box (talkbox), or a room impulse response is acquired by using a professional device to perform complex calculation, which is time-consuming and expensive, and a common user cannot perform the test, cannot obtain a test result in real time, and has great limitation.
Therefore, a measurement method capable of accurately, conveniently and timely acquiring the speech intelligibility of the stored speech file is needed to be found.
Disclosure of Invention
The application aims to provide a voice quality evaluation method, device and system for stock voice files, which can conveniently and accurately evaluate and display the voice quality of the stock voice files in real time.
In order to achieve the purpose of the application, the application provides the following technical scheme:
in a first aspect, a method for evaluating speech quality is provided, the evaluation method comprises the following steps:
receiving a target voice signal, wherein the target voice signal comprises a target storage voice file played by a target sound production device;
calculating in real time according to the target voice signal to obtain a corresponding voice quality evaluation result;
and performing front-end display on the voice quality evaluation result.
In a preferred embodiment, the obtaining of the corresponding speech quality evaluation result by real-time computing according to the target speech signal includes:
performing feature extraction on the target voice signals to obtain at least one group of target voice signals;
and taking the at least one group of target voice signals as input, and obtaining a corresponding voice quality evaluation result through a pre-trained voice quality model.
In a preferred embodiment, the voice quality assessment result includes, but is not limited to, one of a voice transmission index or a mean opinion score.
In a preferred embodiment, when the voice quality evaluation result includes a voice transmission index, the obtaining of the corresponding voice quality evaluation result by real-time calculation according to the target voice signal includes:
respectively extracting features of the target voice signal according to p different octave filtering signal bands to obtain corresponding p groups of target features, wherein p is more than or equal to 2;
obtaining a first target location correspondence based on the p sets of target features and evaluating the voice quality of the target voice signal.
In a preferred embodiment, the performing feature extraction on the target speech signal according to p different octave filtering signal bands to obtain p corresponding sets of target features respectively includes:
respectively filtering the target voice signals to obtain p groups of different octave filtering signal bands, wherein any group of octave filtering signal bands comprises n modulation frequencies f m ,n≥1,m≥1;
Respectively carrying out envelope extraction on the p groups of different octave filtering signal bands to obtain p groups of sub-band envelope characteristics;
taking an octave filtering signal band corresponding to any group of sub-band envelope characteristics as input, and respectively obtaining p groups of reverberation time T through a pre-trained voice quality model corresponding to the octave filtering signal band;
and respectively obtaining p groups of corresponding target characteristics based on the p groups of reverberation time T, wherein the target characteristics are modulation transfer function values.
In a preferred embodiment of the present invention, the respectively carrying out envelope extraction on the p groups of different octave filtering signal bands to obtain p groups of sub-band envelope characteristics comprises:
and respectively carrying out half-wave envelope detection on the p groups of different octave filtering signal bands to obtain p groups of sub-band envelope characteristics.
In a preferred embodiment, the obtaining p sets of reverberation times T by using a pre-trained speech quality model corresponding to a corresponding octave filtering signal band corresponding to any one set of the subband envelope features as an input includes:
dividing any octave filtering signal band into N voice segments with continuous and equal duration, wherein N is more than or equal to 2;
performing feature extraction on any voice segment in the N voice segments included in any octave filtering signal band through a combined structure of one or more of a convolutional neural network, a linear connection layer, an activation layer and a normalization layer to obtain a matrix in a shape of [ P, Q ], and obtaining corresponding N voice segment features;
any one of the obtained N voice segment characteristics is interacted through a combined structure of one or more of a long-time memory module, a multi-head/single-head attention module, a linear connection layer, an activation layer and a normalization layer, obtaining corresponding voice segment interaction characteristics;
based on the obtained N voice segment interactive characteristics, respectively predicting through a linear regression layer or a classification layer to obtain N reverberation times T respectively corresponding to the N voice segment interactive characteristics N
Filtering corresponding to any octave N reverberation times T of a signal band N Averaging is performed to obtain p sets of reverberation times T corresponding to the octave filtered signal bands, respectively.
In a preferred embodiment of the present invention, the obtaining p groups of target features respectively based on the p groups of reverberation times T includes:
based on any modulation frequency f m The value and the corresponding reverberation time T obtain any modulation frequency f of any group of octave filtering signal bands m Modulation transfer function value m of k,fm
In a preferred embodiment, the obtaining a voice quality evaluation result that the first target position corresponds to the target voice signal based on the p sets of target features includes:
based on any of said modulation transfer function values m k,fm Obtaining any modulation frequency f of corresponding octave filtering signal band k m Effective signal-to-noise ratio SNR of eff k,fm
Based on any of the effective SNR eff k,fm Obtaining any modulation frequency f of corresponding octave filtering signal band k m Transmission index TI of k,fm
Calculating n transmission indexes TI of any octave filtering signal band k k,fm To obtain a modulation transfer index M of the corresponding octave filtered signal band k k
Modulation transfer index M based on p octave filtering signal bands k And calculating to obtain a voice quality evaluation result of the first target position corresponding to the target voice signal.
In a preferred embodiment, the method further includes training p speech quality models corresponding to the p different octave filter signal bands in advance, respectively, including:
obtaining p groups of corresponding different octave filtering signal band sample sets based on any one stock voice file in a stock voice file sample set, wherein any one group of octave filtering signal band sample sets comprises q modulation frequency samples and q corresponding impulse response samples, and any one impulse response sample comprises a reverberation time sample T 0 ,q≥2;
Using the q modulation frequency samples as input, q corresponding reverberation time samples T 0 For output, p speech quality models corresponding to p octave filtering signal bands are obtained based on neural network training respectively.
In a preferred embodiment, the voice quality evaluation result is displayed in a front end, and the display manner includes but is not limited to:
displaying the voice quality evaluation result on an interface in a mode of numerical values and dynamic moving signals; or the like, or, alternatively,
displaying the voice quality evaluation result on an interface in a mode of numerical values and dynamic wifi signals; or the like, or, alternatively,
displaying the voice quality evaluation result on an interface in a numerical value and dynamic instrument panel mode; or the like, or a combination thereof,
and displaying the voice quality evaluation result on an interface in a mode of numerical values and a dynamic degree bar.
In a second aspect, there is provided a voice quality evaluation apparatus for stock voice files, characterized in that the device comprises:
the receiving module is used for receiving a target voice signal, and the target voice signal comprises a target storage voice file played by a target sound production device; (ii) a
The processing module is used for calculating in real time according to the target voice signal to obtain a corresponding voice quality evaluation result;
and the display module is used for carrying out front-end display on the voice quality evaluation result.
In a third aspect, a speech quality assessment system for stock speech files is provided, the evaluation system includes:
the voice receiving device is used for receiving a target voice signal, and the target voice signal comprises a target stock voice file played by target sound-producing equipment;
the display device is used for carrying out front-end display on the voice quality evaluation result;
the intelligent equipment is used for receiving the target voice signal sent by the at least one voice receiving device, performing the operation according to the target voice signal, calculating in real time to obtain a corresponding voice quality evaluation result, and sending the voice quality evaluation result to the at least one display device for front-end display.
In a fourth aspect, an electronic device is provided, comprising:
one or more processors; and
memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the operations of any of the first aspects.
In a fifth aspect, a computer-readable storage medium is provided, on which a computer program is stored, wherein the program, when executed by a processor, implements the method according to any of the first aspects.
Compared with the prior art, the method has the following beneficial effects:
the application provides a voice quality evaluation method, a device and a system for stock voice files, wherein the evaluation method comprises the steps of receiving a target voice signal, and the target voice signal comprises a target stock voice file played by a target sound production device; calculating in real time according to the target voice signal to obtain a corresponding voice quality evaluation result; performing front-end display on the voice quality evaluation result; the voice quality evaluation method for the stock voice files can realize voice quality evaluation on any stock voice files so as to test and obtain real-time display of the language intelligibility of the stock voice files during playing, and can further realize optimization and adjustment of the recording environment and parameters of the voice files through real-time evaluation on the recording quality of the stock voice files.
Drawings
FIG. 1 is an illustration of the STI score;
FIG. 2 is a flowchart of a voice quality evaluation method for stock voice files in the present embodiment;
FIG. 3 is a diagram illustrating envelope boundaries obtained by envelope extraction in the present embodiment;
FIG. 4 is a circuit diagram of a half-wave envelope detection circuit in the present embodiment;
FIG. 5 is a schematic diagram of a neural network architecture;
FIGS. 6a to 6d are exemplary display contents when the voice quality evaluation result is displayed on an interface;
fig. 7 is a system architecture diagram of a voice quality evaluation system for stock voice files.
Detailed Description
In order to make the purpose, technical solutions and advantages of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the description of the present application, it is to be understood that the terms "first", "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless otherwise specified.
The embodiment provides a voice quality evaluation method with real-time feedback and accurate measurement results, aiming at the current situation that the measurement of the intelligibility degree of the existing stock voice file language is difficult to realize real-time measurement and real-time feedback. The following describes a method, an apparatus, and a system for evaluating speech quality of stock speech files in detail with reference to specific embodiments.
Examples
As shown in fig. 2, the present embodiment provides a speech quality evaluation method for a stock speech file, which is suitable for evaluation of the language intelligibility of the stock speech file.
Specifically, the method for evaluating the voice quality of the voice file inventory in the embodiment includes the following steps:
s1, receiving a target voice signal. The target voice signal includes a target inventory voice file played by the target sound-emitting device. The present embodiment does not limit the target sound emission device. .
And S2, calculating in real time according to the target voice signal to obtain a corresponding voice quality evaluation result.
Generally, the step S2 includes the following steps:
s21, performing feature extraction on the target voice signals to obtain at least one group of target voice signals;
and S22, taking at least one group of target voice signals as input, and obtaining a corresponding voice quality evaluation result through a pre-trained voice quality model. It should be noted that the voice quality evaluation result includes, but is not limited to, one of a Speech Transmission Index (STI) or a Mean Opinion Score (MOS).
For convenience of description, the voice quality evaluation result in this embodiment is exemplified by the voice transmission index STI, but not limited thereto. In general, STI derives the speech transmission quality of a transmission channel by sending a specific test signal to the transmission channel and then analyzing the received signal and using a score expression between 0 and 1 (see fig. 1).
Step S21 the method specifically comprises the following steps: and respectively carrying out feature extraction on the target voice signal according to p different octave filtering signal bands k to obtain corresponding p groups of target features, wherein p is not less than 2,1 and not more than k is not less than p.
Further, step S21 includes:
s21a, respectively filtering the target voice signals to obtain p groups of different octave filtering signal bands k, wherein any group of octave filtering signal bands k comprises n modulation frequencies f m ,n≥1,m≥1。
It should be noted that, human speech is usually divided into seven frequency bands, therefore, in this embodiment, p =7, i.e., 1. Ltoreq. K.ltoreq.7 is preferable. Thus, the center frequency f is obtained by filtering the target voices respectively c Octave filter signal bands k of 125Hz, 250Hz, 500Hz,1kHz, 2kHz, 4kHz and 8kHz respectively, the upper limit frequency f in each octave filter signal band k u And a lower limit frequency f l As shown in the following equations (1) and (2), respectively:
Figure BDA0003845301040000071
Figure BDA0003845301040000072
and S21b, respectively carrying out envelope extraction on p groups of different octave filtering signal bands k to obtain p groups of sub-band envelope characteristics, wherein the envelope extraction result is an envelope boundary shown in figure 3.
In this embodiment, the envelope extraction algorithm is not limited, and preferably p sets of different octave filtering signal bands k are respectively subjected to half-wave envelope detection to obtain p sets of sub-band envelope characteristics (as shown in fig. 4), and the expression is expressed as a difference equation as shown in the following formulas (3) and (4):
Figure BDA0003845301040000081
Figure BDA0003845301040000082
and S21c, taking an octave filtering signal band k corresponding to any group of sub-band envelope characteristics as input, and respectively obtaining p groups of reverberation time T through a pre-trained voice quality model corresponding to the corresponding octave filtering signal band k.
Any voice quality model sequentially comprises a data preprocessing module, a feature extraction module, a time interaction module and a prediction module. Specifically, the method comprises the following steps:
a data preprocessing module: any octave filtering signal band k is divided into continuous N voice segments (such as voice slices in figure 5) by taking a preset time length (such as x seconds) as a unit, wherein N is more than or equal to 2.
A feature extraction module: any voice segment in N voice segments included in any octave filtering signal band k is subjected to feature extraction through a combined structure of one or more of a convolutional neural network, a linear connection layer, an activation layer and a normalization layer to obtain [ P, Q ] matrix, and obtaining corresponding N voice segment characteristics, such as the voice slice characteristics in FIG. 5.
A time interaction module: any one of the obtained N voice segment features is interacted through one or more of a long-time and short-time memory module, a multi-head/single-head attention module, a linear connection layer, an activation layer, and a normalization layer to obtain a corresponding voice segment interaction feature, such as a slice interaction feature in fig. 5.
A prediction module: based on the obtained N voice segment interactive characteristics, respectively predicting through a linear regression layer or a classification layer and the like to obtain N reverberation times T respectively corresponding to the N voice segment interactive characteristics N Like the slice reverberation time T in FIG. 5 N . For N number of T N Averaging results in a reverberation time T corresponding to the respective octave filtered signal band k. The averaging method includes, but is not limited to, any one of a simple averaging method, a weighted averaging method, or a harmonic averaging method.
For this reason, before step S21c, the evaluation method further includes: sa, pre-training p speech quality models corresponding to p different octave filter signal bands k, respectively, comprising:
sa1, obtaining corresponding p groups of different octave filtering signal band sample sets based on any one stock voice file in the stock voice file sample sets, wherein any one group of octave filtering signal band sample sets comprises q modulation frequency samples and corresponding q impulse response samples, and any impulse response sample comprises a reverberation time sample T 0 ,q≥2;
Sa2, q modulation frequency samples as input, q corresponding reverberation time samples T 0 For output, p speech quality models corresponding to p octave filtering signal bands k are obtained through training respectively based on a neural network.
S21d, respectively obtaining corresponding p groups of target characteristics based on the p groups of reverberation time T, the target characteristic is the modulation transfer function value.
The Modulation Transfer Function (MTF) describes the extent to which Modulation m is transmitted from the target object (acoustic source) to the receiving transducer, and is the Modulation frequency f m Function m of k,fm The MTF determines the degree of modulation reduction of the target speech signal. In particular, the modulation frequency f m In the range of 0.63Hz to 12.5Hz. The MTF function value is therefore dependent on the system environmental characteristics and the background noise. MTF calculation processAs shown in the following formula (5):
Figure BDA0003845301040000091
step S22 specifically includes: and obtaining a voice quality evaluation result of the first target position corresponding to the target voice signal based on the p groups of target characteristics.
Specifically, the step S22 includes:
s221, transfer function value m based on any modulation k,fm Obtaining any modulation frequency f of corresponding octave filtering signal band k m Effective signal-to-noise ratio SNR of eff k,fm (ii) a In particular, the effective SNR eff k,fm Obtained by calculation as follows (6):
Figure BDA0003845301040000092
s222, SNR based on any effective signal-to-noise ratio eff k,fm Obtaining any modulation frequency f of corresponding octave filtering signal band k m Of the transmission index TI k,fm (ii) a In particular, the transmission index TI k,fm Obtained by calculation as follows:
Figure BDA0003845301040000093
s223, calculating n transmission indexes TI of any octave filtering signal band k k,fm To obtain a modulation transfer index M of the corresponding octave filtered signal band k k (ii) a Modulation transfer index M k The value range of (a) is-15 dB to +15dB. In particular, the modulation transfer index M k Obtained by calculation as follows:
Figure BDA0003845301040000101
s224, modulation transmission based on p octave filtering signal bandsIndex M k Calculating to obtain a voice quality evaluation result (STI) of the first target position corresponding to the target voice signal; specifically, STI is obtained by calculation as follows: .
Figure BDA0003845301040000102
Wherein alpha is k Different gender weight factors representing octave filter signal band k;
β k representing different gender redundancy factors between an octave filtering signal band k and an octave filtering signal band k + 1;
M k refers to the modulation transfer index of the octave filtered signal band.
It should be noted that the STI method can distinguish between male and female voice signals, but in practice, only male voice is used to evaluate the voice transmission path in order to simplify the measurement process. Table 1 gives the male voice STI weight factor α and the redundancy factor β as a function of the octave band.
TABLE 1
Figure BDA0003845301040000103
And S3, performing front-end display on the voice quality evaluation result obtained in the step S2.
As shown in fig. 6a to 6d, the interface display manner adopted for performing front-end display on the voice quality evaluation result includes, but is not limited to:
displaying the voice quality evaluation result on an interface in a mode of numerical values and dynamic moving signals; or the like, or a combination thereof,
displaying the voice quality evaluation result on an interface in a numerical value and dynamic wifi signal mode; or the like, or, alternatively,
displaying the voice quality evaluation result on an interface in a numerical value and dynamic instrument panel mode; or the like, or, alternatively,
and displaying the voice quality evaluation result on an interface in a mode of numerical values and a dynamic degree bar.
In summary, the voice quality evaluation method for voice files stored in the embodiment can realize voice quality evaluation of any voice file stored in the memory, and compared with a method for finishing evaluation by means of professional equipment and a standard method in the prior art, the method has the advantages of higher universality, convenience and real-time feedback;
further, when the voice quality evaluation method for storing the voice files provided by the embodiment calculates and obtains the voice quality evaluation result, a method of respectively constructing voice quality models for different octaves to respectively obtain reverberation time is adopted, so that the method has strong robustness and reproducibility.
Corresponding to the above voice quality evaluation method, the present embodiment further provides a voice quality evaluation apparatus corresponding to the evaluation method, which is implemented by each functional module. The speech quality evaluation device includes:
the receiving module is used for receiving a target voice signal, wherein the target voice signal comprises a target stock voice file played by a target sound production device;
the processing module is used for calculating in real time according to the target voice signal to obtain a corresponding voice quality evaluation result;
the display module is used for carrying out front-end display on the voice quality evaluation result;
and the model training module is used for respectively training p voice quality models corresponding to the p different octave filtering signal bands in advance.
Wherein, processing module includes:
the characteristic extraction unit is used for respectively carrying out characteristic extraction on the target voice signal according to p different octave filtering signal bands to obtain corresponding p groups of target characteristics, wherein p is more than or equal to 2;
and the evaluation unit is used for obtaining a voice quality evaluation result of the first target position corresponding to the target voice signal based on the p groups of target characteristics.
Further, the feature extraction unit specifically includes:
a first processing subunit for respectively filtering the target speech signalsObtaining p groups of different octave filtering signal bands, wherein any group of octave filtering signal bands comprises n modulation frequencies f m ,n≥1,m≥1;
The second processing subunit is used for respectively carrying out envelope extraction on the p groups of different octave filtering signal bands to obtain p groups of sub-band envelope characteristics;
a third processing subunit, configured to take an octave filtering signal band corresponding to any one group of the subband envelope features as an input, and obtain p groups of reverberation times T through a pre-trained speech quality model corresponding to the octave filtering signal band;
and the fourth processing subunit is configured to obtain p corresponding sets of target features based on the p sets of reverberation times T, where the target features are modulation transfer function values.
The second processing subunit is specifically configured to perform half-wave envelope detection on the p groups of different octave filtering signal bands respectively to obtain p groups of sub-band envelope characteristics.
The third processing subunit is specifically configured to:
dividing any octave filtering signal band into N continuous voice segments with equal duration, wherein N is more than or equal to 2;
performing feature extraction on any voice fragment in the N voice fragments included in any octave filtering signal band through a composite structure of one or more of a convolutional neural network, a linear connection layer, an activation layer and a normalization layer to obtain a matrix in a shape of [ P, Q ], and obtaining corresponding N voice fragment features;
any one of the obtained N voice segment characteristics is interacted through one or more combined structures of a long-time and short-time memory module, a multi-head/single-head attention module, a linear connection layer, an activation layer and a normalization layer to obtain corresponding voice segment interaction characteristics;
based on the obtained N voice segment interactive characteristics, respectively predicting through a linear regression layer or a classification layer to obtain N reverberation times T respectively corresponding to the N voice segment interactive characteristics N
For any octaveN reverberation times T of a filtered signal band N Averaging is performed to obtain p sets of reverberation times T corresponding to the octave filtered signal bands, respectively.
The fourth processing subunit is specifically configured to base any modulation frequency f m Value and corresponding reverberation time T obtain any modulation frequency f of any group of octave filtering signal bands m Modulation transfer function value m of k,fm
Further, the evaluation unit specifically includes:
a fifth processing subunit for performing a modulation transfer function on the basis of any of the modulation transfer function values m k,fm Obtaining any modulation frequency f of corresponding octave filtering signal band k m Effective signal-to-noise ratio SNR of eff k,fm
A sixth processing subunit for processing the received signal based on any of the effective SNR eff k,fm Obtaining any modulation frequency f of corresponding octave filtering signal band k m Transmission index TI of k,fm
A seventh processing subunit, configured to calculate n transmission indexes TI of any octave-filtered signal band k k,fm To obtain a modulation transfer index M of the corresponding octave filtered signal band k k
An eighth processing subunit for processing the modulation transfer index M based on the p octave filtering signal bands k And calculating to obtain a voice quality evaluation result of the first target position corresponding to the target voice signal.
The eighth processing subunit specifically performs calculation as shown in the following formula (9) to obtain a voice quality evaluation result of the first target position corresponding to the target voice signal:
Figure BDA0003845301040000131
wherein alpha is k Different gender weight factors representing octave filter signal band k;
β k representing the space between an octave filtered signal band k and an octave filtered signal band k +1Different gender redundancy factors of (a);
M k refers to the modulation transfer index of the octave filtered signal band.
The model training module is specifically configured to:
obtaining p groups of corresponding different octave filtering signal band sample sets based on any one stock voice file in a stock voice file sample set, wherein any one group of octave filtering signal band sample sets comprises q modulation frequency samples and q corresponding impulse response samples, and any one impulse response sample comprises a reverberation time sample T 0 ,q≥2;
Using the q modulation frequency samples as input, q corresponding reverberation time samples T 0 For output, p speech quality models corresponding to p octave filtering signal bands are obtained based on neural network training respectively.
When the display module displays the voice quality evaluation result at the front end, the adopted display mode includes but is not limited to interface display of the voice quality evaluation result in a numerical value and dynamic signal moving mode; or, displaying the voice quality evaluation result on an interface in a numerical value and dynamic wifi signal mode; or, the voice quality evaluation result is displayed on the interface in a numerical value and dynamic instrument panel mode; or, the voice quality evaluation result is displayed on an interface in a mode of numerical value and dynamic degree bar.
It should be noted that: in the voice quality evaluation device for voice files stored in the foregoing embodiment, when performing a voice quality evaluation service, only the division of the above functional modules is illustrated, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the system may be divided into different functional modules to complete all or part of the above described functions. In addition, the embodiment of the voice quality evaluation device and the embodiment of the voice quality evaluation method provided by the above embodiments belong to the same concept, that is, the device is based on the method, and the specific implementation process thereof is described in the method embodiments in detail and is not described herein again.
And, as shown in fig. 7, the present embodiment also provides a voice quality evaluation system for stock voice files, the evaluation system including:
the system comprises at least one voice receiving device, a processing device and a control device, wherein the voice receiving device is used for receiving a target voice signal, and the target voice signal comprises a target stock voice file played by a target sound production device; preferably, the voice receiving device is a voice sensor;
the display device is used for carrying out front-end display on the voice quality evaluation result;
and the intelligent equipment is used for receiving the target voice signal sent by the at least one voice receiving device, performing real-time calculation such as a local voice quality evaluation method according to the target voice signal to obtain a corresponding voice quality evaluation result, and sending the voice quality evaluation result to the at least one display device for front-end display.
Also, the present embodiment provides an electronic device, including:
one or more processors; and
a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations as any one of the speech quality assessment methods for inventory speech files.
With respect to the speech quality evaluation method executed by the execution program instruction, the specific execution details and the corresponding beneficial effects are consistent with the description in the foregoing method, and will not be described again here.
And, the present embodiment also provides a computer-readable storage medium on which a computer program is stored, wherein the program, when executed by a processor, implements the method as in any one of the voice quality evaluation methods for stock voice files.
All the above optional technical solutions can adopt any combination to form the optional embodiments of the present application, that is, any multiple embodiments can be combined, so as to obtain the requirements for coping with different application scenarios, which are within the protection scope of the present application and are not described herein any more.
It should be understood that the above-mentioned embodiments are merely preferred embodiments of the present application and are not intended to limit the present application, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (15)

1. A speech quality evaluation method for stock speech files, the evaluation method comprising:
receiving a target voice signal, wherein the target voice signal comprises a target storage voice file played by a target sound production device;
calculating in real time according to the target voice signal to obtain a corresponding voice quality evaluation result;
and performing front-end display on the voice quality evaluation result.
2. The method for evaluating according to claim 1, wherein said obtaining a corresponding speech quality evaluation result by real-time calculation based on the target speech signal comprises:
performing feature extraction on the target voice signals to obtain at least one group of target voice signals;
and taking the at least one group of target voice signals as input, and obtaining a corresponding voice quality evaluation result through a pre-trained voice quality model.
3. The method of claim 2, wherein the voice quality assessment results include but are not limited to one of a voice transmission index or a mean opinion score.
4. The method for evaluating according to claim 2, wherein when the voice quality evaluation result includes a voice transmission index, the calculating in real time according to the target voice signal to obtain a corresponding voice quality evaluation result comprises:
respectively extracting features of the target voice signal according to p different octave filtering signal bands to obtain corresponding p groups of target features, wherein p is more than or equal to 2;
and obtaining a voice quality evaluation result of the first target position corresponding to the target voice signal based on the p groups of target features.
5. The evaluation method according to claim 4, wherein said performing feature extraction on the target speech signal according to p different octave filtering signal bands to obtain p corresponding sets of target features respectively comprises:
respectively filtering the target voice signals to obtain p groups of different octave filtering signal bands, wherein any group of octave filtering signal bands comprises n modulation frequencies f m ,n≥1,m≥1;
Respectively carrying out envelope extraction on the p groups of different octave filtering signal bands to obtain p groups of sub-band envelope characteristics;
taking an octave filtering signal band corresponding to any group of sub-band envelope characteristics as input, and respectively obtaining p groups of reverberation time T through a pre-trained voice quality model corresponding to the octave filtering signal band;
and respectively obtaining p groups of corresponding target characteristics based on the p groups of reverberation time T, wherein the target characteristics are modulation transfer function values.
6. The evaluation method according to claim 5, wherein said separately performing envelope extraction on p different sets of octave filtered signal bands to obtain p sets of subband envelope features comprises:
and respectively carrying out half-wave envelope detection on the p groups of different octave filtering signal bands to obtain p groups of sub-band envelope characteristics.
7. The evaluation method of claim 5, wherein the obtaining p sets of reverberation times T respectively by using the octave filtering signal bands corresponding to any set of the sub-band envelope characteristics as input through a pre-trained speech quality model corresponding to the corresponding octave filtering signal bands comprises:
dividing any octave filtering signal band into N continuous voice segments with equal duration, wherein N is more than or equal to 2;
performing feature extraction on any voice segment in the N voice segments included in any octave filtering signal band through a combined structure of one or more of a convolutional neural network, a linear connection layer, an activation layer and a normalization layer to obtain a matrix in a shape of [ P, Q ], and obtaining corresponding N voice segment features;
any one of the obtained N voice segment characteristics is interacted through one or more combined structures of a long-time and short-time memory module, a multi-head/single-head attention module, a linear connection layer, an activation layer and a normalization layer to obtain corresponding voice segment interaction characteristics;
based on the obtained N voice segment interactive characteristics, respectively predicting through a linear regression layer or a classification layer to obtain N reverberation times T respectively corresponding to the N voice segment interactive characteristics N
For N reverberation times T corresponding to any octave filtering signal band N Averaging is performed to obtain p sets of reverberation times T corresponding to the octave filtered signal bands, respectively.
8. The evaluation method according to claim 5, wherein the obtaining of the corresponding p sets of target features based on the p sets of reverberation times T respectively comprises:
based on any modulation frequency f m The value and the corresponding reverberation time T obtain any modulation frequency f of any group of octave filtering signal bands m Modulation transfer function value m of k,fm
9. The evaluation method according to claim 8,
the obtaining, based on the p sets of target features, a voice quality assessment result that the first target position corresponds to the target voice signal includes:
based on any of said modulation transfer function values m k,fm Obtaining any modulation frequency f of corresponding octave filtering signal band k m Effective signal-to-noise ratio SNR of effk,fm
SNR based on any one of the effective signal-to-noise ratios effk,fm Obtaining any modulation frequency f of corresponding octave filtering signal band k m Transmission index TI of k,fm
Calculating n transmission indexes TI of any octave filtering signal band k k,fm To obtain a modulation transfer index M of the corresponding octave filtered signal band k k
Modulation transfer index M based on p octave filtering signal bands k And calculating to obtain a voice quality evaluation result of the first target position corresponding to the target voice signal.
10. The evaluation method according to any one of claims 2 to 9, wherein the evaluation method further comprises training p speech quality models corresponding to the p different octave filtered signal bands, respectively, in advance, comprising:
obtaining p groups of corresponding different octave filtering signal band sample sets based on any voice file stock in a voice file stock sample set, wherein any group of octave filtering signal band sample sets comprises q modulation frequency samples and q corresponding impulse response samples, and any impulse response sample comprises a reverberation time sample T 0 ,q≥2;
Using the q modulation frequency samples as input, q corresponding reverberation time samples T 0 For output, p speech quality models corresponding to p octave filtering signal bands are obtained based on neural network training respectively.
11. The evaluation method according to claim 1, wherein the voice quality evaluation result is presented at the front end in a manner including but not limited to:
displaying the voice quality evaluation result on an interface in a mode of numerical values and dynamic moving signals; or the like, or, alternatively,
displaying the voice quality evaluation result on an interface in a numerical value and dynamic wifi signal mode; or the like, or a combination thereof,
displaying the voice quality evaluation result on an interface in a mode of numerical values and a dynamic instrument panel; or the like, or, alternatively,
and displaying the voice quality evaluation result on an interface in a mode of numerical values and a dynamic degree bar.
12. A speech quality evaluation apparatus for a stock speech file, characterized by comprising:
the receiving module is used for receiving a target voice signal, and the target voice signal comprises a target storage voice file played by a target sound production device;
the processing module is used for calculating in real time according to the target voice signal to obtain a corresponding voice quality evaluation result;
and the display module is used for carrying out front-end display on the voice quality evaluation result.
13. A speech quality evaluation system for stock speech files, the evaluation system comprising:
the voice receiving device is used for receiving a target voice signal, and the target voice signal comprises a target stock voice file played by target sound-producing equipment;
the display device is used for carrying out front-end display on the voice quality evaluation result;
the intelligent equipment is used for receiving the target voice signal sent by the at least one voice receiving device, performing real-time calculation according to the operation of any one of claims 1 to 11 on the target voice signal to obtain a corresponding voice quality evaluation result, and sending the voice quality evaluation result to the at least one display device for front-end display.
14. An electronic device, comprising:
one or more processors; and
a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations according to any one of claims 1 to 11; and
a display associated with the one or more processors for displaying in real-time speech quality assessment results obtained after execution of the program instructions by the one or more processors.
15. A computer-readable storage medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1 to 11.
CN202211115342.9A 2022-09-14 2022-09-14 Voice quality evaluation method, device and system for stock voice file Pending CN115512718A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211115342.9A CN115512718A (en) 2022-09-14 2022-09-14 Voice quality evaluation method, device and system for stock voice file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211115342.9A CN115512718A (en) 2022-09-14 2022-09-14 Voice quality evaluation method, device and system for stock voice file

Publications (1)

Publication Number Publication Date
CN115512718A true CN115512718A (en) 2022-12-23

Family

ID=84504340

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211115342.9A Pending CN115512718A (en) 2022-09-14 2022-09-14 Voice quality evaluation method, device and system for stock voice file

Country Status (1)

Country Link
CN (1) CN115512718A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116564351A (en) * 2023-04-03 2023-08-08 湖北经济学院 Voice dialogue quality evaluation method and system and portable electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116564351A (en) * 2023-04-03 2023-08-08 湖北经济学院 Voice dialogue quality evaluation method and system and portable electronic equipment
CN116564351B (en) * 2023-04-03 2024-01-23 湖北经济学院 Voice dialogue quality evaluation method and system and portable electronic equipment

Similar Documents

Publication Publication Date Title
CN108417228B (en) Human voice tone similarity measurement method under musical instrument tone migration
US8565908B2 (en) Systems, methods, and apparatus for equalization preference learning
CN103440869B (en) Audio-reverberation inhibiting device and inhibiting method thereof
JP4005128B2 (en) Signal quality evaluation
US11138989B2 (en) Sound quality prediction and interface to facilitate high-quality voice recordings
Zhang et al. Effects of telephone transmission on the performance of formant-trajectory-based forensic voice comparison–female voices
CN106997765B (en) Quantitative characterization method for human voice timbre
US20140272883A1 (en) Systems, methods, and apparatus for equalization preference learning
CN108206027A (en) A kind of audio quality evaluation method and system
Lundén et al. On urban soundscape mapping: A computer can predict the outcome of soundscape assessments
Manfredi et al. Validity of jitter measures in non-quasi-periodic voices. Part II: The effect of noise
Hall Application of multidimensional scaling to subjective evaluation of coded speech
CN115512718A (en) Voice quality evaluation method, device and system for stock voice file
CN108735192A (en) A kind of piano performance assessment of acoustics system and method for combination style of song
CN111757235A (en) Sound expansion system with classroom language definition measuring function
CN114302301B (en) Frequency response correction method and related product
JP3350713B2 (en) Method, apparatus and medium for identifying type of noise source
CN111816207B (en) Sound analysis method, sound analysis system, automobile and storage medium
Björkner et al. Subglottal pressure and normalized amplitude quotient variation in classically trained baritone singers
CN112233693B (en) Sound quality evaluation method, device and equipment
CN109360583B (en) Tone evaluation method and device
CN115512719A (en) Voice quality evaluation method, device and system for local communication
JP4590545B2 (en) Acoustic evaluation method and system
USRE48462E1 (en) Systems, methods, and apparatus for equalization preference learning
JP2021015137A (en) Information processing device, program, and information processing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination