CN112614510A

CN112614510A - Audio quality evaluation method and device

Info

Publication number: CN112614510A
Application number: CN202011540097.7A
Authority: CN
Inventors: 元海明; 刘鲁鹏; 王晓红; 陈佳路; 高强; 夏龙; 郭常圳
Original assignee: Beijing Ape Power Future Technology Co Ltd
Current assignee: Beijing Ape Power Future Technology Co Ltd
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2021-04-06
Anticipated expiration: 2040-12-23
Also published as: CN112614510B

Abstract

The application provides an audio quality evaluation method and an audio quality evaluation device, wherein the audio quality evaluation method comprises the following steps: acquiring audio to be evaluated and reference audio; extracting a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extracting a reference phoneme-time sequence corresponding to the reference audio; setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy to generate a reference phoneme-time-weight sequence; calculating the phoneme distance between the phoneme-time sequence to be evaluated and the corresponding phoneme in the reference phoneme-time sequence to obtain a phoneme distance-time sequence; and determining the quality evaluation score of the audio to be evaluated according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence, calibrating the phoneme level of the audio by using the method provided by the application, and correcting the quality evaluation score by using the phonemes as the investigation weight so as to more accurately represent the quality of the audio to be evaluated.

Description

Audio quality evaluation method and device

Technical Field

The present application relates to the field of internet technologies, and in particular, to an audio quality assessment method and apparatus, a computing device, and a computer-readable storage medium.

Background

In recent years, with the rapid development of networks, various audio processing technologies and audio transmission technologies have emerged, and since the subjective feelings of communication users and consumers about audio ultimately depend on the audio quality, the evaluation of audio quality becomes an important research topic.

At present, one of the most widely used objective speech evaluation methods is an objective speech quality evaluation method (PESQ), which gives a score of-0.5 to 4.5 to represent objective MOS distances between a test audio and a comparison audio, but in different application scenarios, actual evaluation values of the quality of the audio are different, and there is a possibility that an effective audio feature part is erroneously eliminated when the test audio is preprocessed, and the PESQ method has a poor effect on the evaluation of the audio quality in such a scenario and cannot give a more accurate score.

Therefore, how to solve the above technical problems becomes a problem to be solved urgently by the skilled person.

Disclosure of Invention

In view of the above, embodiments of the present application provide an audio quality assessment method and apparatus, a computing device, and a computer-readable storage medium, so as to solve the technical defects in the prior art.

According to a first aspect of embodiments of the present application, there is provided an audio quality assessment method, including:

acquiring audio to be evaluated and reference audio;

extracting a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extracting a reference phoneme-time sequence corresponding to the reference audio;

setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy to generate a reference phoneme-time-weight sequence;

calculating the phoneme distance between the phoneme-time sequence to be evaluated and the corresponding phoneme in the reference phoneme-time sequence to obtain a phoneme distance-time sequence;

and determining a quality evaluation score of the audio to be evaluated according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence.

Optionally, the extracting of the phoneme-time sequence to be evaluated corresponding to the audio to be evaluated includes:

extracting a phoneme sequence to be evaluated corresponding to the audio to be evaluated and time corresponding to each phoneme to be evaluated according to a preset speech recognition method;

and generating a phoneme-time sequence to be evaluated according to the phoneme sequence to be evaluated and the time corresponding to each phoneme to be evaluated.

Optionally, extracting a reference phoneme-time sequence corresponding to the reference audio includes:

extracting a reference phoneme sequence corresponding to the reference audio and time corresponding to each reference phoneme according to a preset speech recognition method;

and generating a reference phoneme-time sequence according to the reference phoneme sequence and the time corresponding to each reference phoneme.

Optionally, the setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation policy to generate a reference phoneme-time-weight sequence includes:

determining a weight value of each phoneme type according to a preset evaluation strategy and the phoneme type;

and setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to the weight value of each phoneme type, and generating a reference phoneme-time-weight sequence.

Optionally, the calculating a phoneme distance between the phoneme-time series to be evaluated and a corresponding phoneme in the reference phoneme-time series includes:

performing phoneme alignment on the phoneme-time sequence to be evaluated and the reference phoneme-time sequence;

and calculating the phoneme distance between the phoneme-time sequence to be evaluated after phoneme alignment and the audio fragment corresponding to the corresponding phoneme in the reference phoneme-time sequence by an objective speech quality evaluation method.

Optionally, determining a quality evaluation score of the audio to be evaluated according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence includes:

determining a phoneme score corresponding to each target time point according to the reference phoneme-time-weight sequence and the phoneme weight and phoneme distance corresponding to each target time point of the phoneme distance-time sequence;

and determining the quality evaluation score of the audio to be evaluated according to the phoneme score corresponding to each target time point.

Optionally, before extracting the phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, the method further includes:

and preprocessing the audio to be evaluated to obtain the preprocessed audio to be evaluated.

Optionally, the pre-processing the audio to be evaluated includes:

and carrying out noise reduction processing and/or voice enhancement processing on the audio to be evaluated.

According to a second aspect of embodiments of the present application, there is provided an audio quality evaluation apparatus including:

the acquisition module is configured to acquire audio to be evaluated and reference audio;

the extraction module is configured to extract a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated and extract a reference phoneme-time sequence corresponding to the reference audio;

a setting module configured to set a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy, and generate a reference phoneme-time-weight sequence;

a calculation module configured to calculate a phoneme distance between the phoneme-time series to be evaluated and a corresponding phoneme in the reference phoneme-time series to obtain a phoneme distance-time series;

a determination module configured to determine a quality assessment score for the audio to be assessed based on the reference phoneme-time-weight sequence and the phoneme distance-time sequence.

According to a third aspect of embodiments herein, there is provided a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the audio quality assessment method when executing the instructions.

According to a fourth aspect of embodiments of the present application, there is provided a computer-readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the audio quality assessment method.

According to the audio quality evaluation method provided by the embodiment of the application, the audio to be evaluated and the reference audio are obtained; extracting a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extracting a reference phoneme-time sequence corresponding to the reference audio; setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy to generate a reference phoneme-time-weight sequence; calculating the phoneme distance between the phoneme-time sequence to be evaluated and the corresponding phoneme in the reference phoneme-time sequence to obtain a phoneme distance-time sequence; the quality evaluation score of the audio to be evaluated is determined according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence, the audio quality evaluation method provided by the application is simple in steps and convenient to use, only the audio to be evaluated is subjected to phoneme level calibration, the phonemes are used as the investigation weights according to a preset evaluation strategy, the quality evaluation score is corrected, and the quality of the audio to be evaluated can be represented more accurately according to an application scene.

Drawings

FIG. 1 is a block diagram of a computing device provided by an embodiment of the present application;

FIG. 2 is a flow chart of an audio quality assessment method provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of an audio quality assessment method provided by an embodiment of the present application;

FIG. 4 is a flowchart of an audio quality assessment method applied to a spoken language evaluation scenario according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an audio quality evaluation apparatus according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

First, the noun terms to which one or more embodiments of the present invention relate are explained.

Automatic speech recognition technology: (ASR), a technology for converting human Speech into text.

The objective speech quality assessment method comprises the following steps: (Perceptial evaluation of speed quality, PESQ), a method for providing an objective MOS value evaluation.

Mean subjective opinion score: (Mean opinion scores, MOS) in international standards, MOS values are used uniformly to evaluate voice quality.

Phoneme: the minimum phonetic unit is divided according to the natural attributes of the speech, and is analyzed according to the pronunciation action in the syllable, and one action forms a phoneme. Phonemes are divided into two major categories, vowels and consonants. For example, the chinese syllable ā (a) has only one phoneme, and the p-ai (pi) has two phonemes.

In the present application, an audio quality assessment method and apparatus, a computing device, and a computer-readable storage medium are provided, which are described in detail in the following embodiments one by one.

FIG. 1 shows a block diagram of a computing device 100 according to an embodiment of the present application. The components of the computing device 100 include, but are not limited to, memory 110 and processor 120. The processor 120 is coupled to the memory 110 via a bus 130 and a database 150 is used to store data.

Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 140 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present application, the above-mentioned components of the computing device 100 and other components not shown in fig. 1 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 1 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.

Computing device 100 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.

Wherein the processor 120 may perform the steps of the audio quality assessment method shown in fig. 2. Fig. 2 shows a flow chart of an audio quality assessment method according to an embodiment of the present application, comprising steps 202 to 210.

Step 202: and acquiring the audio to be evaluated and the reference audio.

The audio quality evaluation method provided by the application has a plurality of application scenes, such as spoken language testing, audio matching and the like, and the specific application scene of the audio quality evaluation method is not limited in the application.

The audio to be evaluated is the audio to be evaluated, such as spoken language testing personnel speaking in spoken language testing, and the audio to be matched in audio matching, and the like.

Correspondingly, the reference audio is a standard audio, and in practical application, if the scores of the audio to be evaluated and the reference audio are higher, it is indicated that the quality of the audio to be evaluated is higher.

In a specific embodiment provided by the application, taking an audio matching scene as an example, the obtained audio to be evaluated is a recording of a mobile phone, "little eating breakfast" with reference to the audio as a standard audio.

Step 204: and extracting the phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extracting the reference phoneme-time sequence corresponding to the reference audio.

Specifically, the preprocessing the audio to be evaluated includes: and carrying out noise reduction processing and/or voice enhancement processing on the audio to be evaluated.

In practical applications, generally, the obtained audio to be evaluated is not pure audio, and there are noise that may cause interference to the audio quality, such as noise, etc., therefore, the audio to be evaluated needs to be preprocessed, there are many preprocessing methods, such as noise reduction processing, speech enhancement processing, etc., and at least one processing method may be optionally used to perform corresponding preprocessing operation on the audio to be processed.

When the audio to be evaluated is preprocessed, a traditional audio processing method, such as a noise reduction module of Web Real-Time Communication (webrtc), may be used, or an audio processing model based on a deep neural network model may also be used.

In a specific embodiment provided by the application, along with the above example, the audio to be evaluated is a recording of a mobile phone, "little breakfast", wherein when a user records a recording with the mobile phone, a whistle sound of the car is recorded at the same time, the audio to be evaluated needs to be subjected to noise reduction processing, the noise of the whistle sound of the car in the audio to be evaluated is removed, and meanwhile, the audio of the "little breakfast" is subjected to voice enhancement.

Specifically, the extracting of the phoneme-time sequence to be evaluated corresponding to the audio to be evaluated includes:

In practical applications, after obtaining the audio to be evaluated, the audio to be evaluated may be subjected to speech recognition through a preset speech recognition technology, and the automatic speech recognition technology may convert speech into text, where the text may be of many types, such as binary code, character sequence, phoneme sequence, and the like.

Phonemes are the smallest units of speech that are divided according to the natural properties of the speech, and are analyzed according to the pronunciation actions in the syllables, with one action constituting a phoneme. Phonemes are divided into two major categories, vowels and consonants. For example, the chinese syllable ā (a) has only one phoneme, and the p-ai (pi) has two phonemes.

After the audio to be evaluated is processed by a speech recognition technology, an audio sequence to be evaluated corresponding to the audio to be evaluated can be extracted, for example, if the audio to be evaluated is "weather-friendly today", a phoneme sequence of "j, in1, t, i, an1, t, i, an1, q, i4, b, u2, c, uo 4" can be extracted, wherein the number 1 in the phoneme of "in 1" represents a tone.

In practical applications, after the audio to be evaluated is subjected to noise reduction or speech enhancement, the audio may be damaged in the processing process, so that the phoneme recognition result may have errors with respect to the real situation, such as phoneme loss, phoneme replacement, and the like, for example, the real phoneme is "j, in1, t, i, an 1" may be recognized as "j, in1, t, i, ao 1".

The speech recognition technology may further implement time alignment of phonemes and audio, for example, for a phoneme "j" corresponding to 10 th millisecond to 12 th millisecond in the audio to be evaluated, and for a phoneme "in 1" corresponding to 13 th millisecond to 16 th millisecond in the audio to be evaluated, in practical application, the alignment of phonemes and time may be further expressed as dividing according to time, for example, dividing every 1 millisecond, the phonemes corresponding to 10 th to 16 th milliseconds are "j, j, j, in1, in1, in1, in 1", and a specific form of a phoneme-time series to be evaluated is not specifically limited in this application, and is based on practical application.

In a specific embodiment provided by the present application, following the above example, the audio to be evaluated is "small white eating breakfast", the phonemes of the audio to be evaluated are determined to be "x, i, ao3, b, ao2, ch, i1, z, ao3, f, an 4" by using a speech recognition technology, and the generated phoneme-time series to be evaluated is referred to table 1 below for convenience of explanation.

TABLE 1

Phoneme to be evaluated	Time (millisecond)
		x	5-200
i	201-320
		ao3	321-500
b	501-600
		ao 2	601-740
ch	741-889
		i1	890-1200
z	1201-1396
		ao3	1397-1525
f	1526-1650
		an4	1651-1950

Specifically, the extracting of the reference phoneme-time sequence corresponding to the reference audio includes:

The method for extracting the reference phoneme-time sequence corresponding to the reference audio is the same as the method for extracting the phoneme-time sequence to be evaluated, and therefore, the specific method for extracting the reference phoneme-time sequence is described in the above description of extracting the phoneme-time sequence to be evaluated, and is not described herein again.

In one embodiment provided by the present application, following the above example, where the reference audio is "small white eating breakfast", the phonemes of the reference audio are determined by speech recognition techniques to be "x, i, ao3, b, ai2, ch, i1, z, ao3, f, an 4", and the generated reference phoneme-time sequence is shown in table 2 below.

TABLE 2

Reference phoneme	Time (millisecond)
		x	7-202
i	203-334
		ao3	335-514
b	515-620
		ai2	621-760
ch	761-909
		i1	910-1211
z	1212-1407
		ao3	1407-1540
f	1541-1669
		an4	1670-1987

Step 206: and setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy to generate a reference phoneme-time-weight sequence.

The preset evaluation strategy is an evaluation strategy determined according to a specific evaluation task, and in order to adapt to different evaluation tasks, different weight values need to be set for each reference phoneme, and a corresponding reference phoneme-time-weight sequence is generated, wherein the reference phoneme-time-weight sequence includes pronunciation time and weight value corresponding to each reference phoneme.

Specifically, the setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation policy to generate a reference phoneme-time-weight sequence includes:

In some evaluation tasks in practical application, if the sounding of the initial consonant is more concerned, it is necessary to set a weight value of the initial consonant in the phoneme sequence to be higher than a weight value of the final consonant, if the sounding of the final consonant is more concerned, it is necessary to set a weight value of the final consonant in the phoneme sequence to be higher than a weight value of the initial consonant, and a corresponding weight value is set for each reference phoneme in the reference phoneme-time sequence, and after determining a weight value of each phoneme type, a corresponding weight value can be set for each reference phoneme in the reference phoneme-time sequence, for example, in an evaluation task more concerned with the sounding of the initial consonant, the weight value of the initial consonant can be set to 1.5, the weight value of the final consonant can be set to 0.7, and in an evaluation task more concerned with the sounding of the final consonant, the weight value of the initial consonant can be set to 0..

In a specific embodiment provided by the present application, following the above example, the weighted value of the initial consonant is set to 1.3 and the weighted value of the final consonant is set to 0.6 according to the evaluation strategy corresponding to the evaluation task, and the phonemes in the reference phoneme-time sequence are set, and the generated reference phoneme-time-weighted sequence is shown in table 3 below.

TABLE 3

Step 208: and calculating the phoneme distance between the phoneme-time sequence to be evaluated and the corresponding phoneme in the reference phoneme-time sequence to obtain a phoneme distance-time sequence.

Objective assessment of speech quality (PESQ) is an objective Mean subjective opinion score (MOS) value evaluation method provided by the ITU-T p.862 recommendation. The method is one of the most extensive objective speech evaluation methods at present, and the evaluation method can give a score of-0.5 to 4.5 and represent the objective MOS distance between the audio to be evaluated and the reference audio.

In the present application, the MOS distance corresponding to each phoneme is calculated from the phoneme-time series to be evaluated and the reference phoneme-time series.

Specifically, the calculating of the phoneme distance between the phoneme-time series to be evaluated and the corresponding phoneme in the reference phoneme-time series includes:

In practical applications, when the phoneme is extracted from the audio to be evaluated, the audio to be evaluated may be damaged in the processing process after the noise reduction or speech enhancement processing, which may cause an error between the result of the phoneme recognition and the real situation, and therefore, it is necessary to perform phoneme alignment between the phoneme-time series to be evaluated and the reference phoneme-time series.

There are many possibilities for implementing the phoneme alignment algorithm, for example, when the phoneme sequence in the audio to be evaluated is consistent with the phoneme sequence in the reference audio, the distance may be directly calculated; if the phoneme sequence in the audio to be evaluated has partial deletion and error, but the error or deletion ratio is smaller than the threshold value, the alignment result can be corrected in a distance editing mode; and if the difference between the phoneme sequence in the audio to be evaluated and the phoneme sequence of the reference audio is greater than a preset threshold value, determining a corresponding relation in time according to the similarity of the audio, and aligning the phoneme-time sequence to be evaluated and the reference phoneme-time sequence.

After the phoneme alignment, the phoneme distance between the phoneme-time sequence to be evaluated and the corresponding phoneme audio segment in the reference phoneme-time sequence can be obtained through PESQ.

In a specific embodiment provided by the present application, the phoneme-time series to be evaluated shown in table 1 and the reference phoneme-time series shown in table 2 are subjected to phoneme alignment, and the distance between the phoneme-time series to be evaluated and each phoneme in the reference phoneme-time series is obtained through PESQ calculation, and the obtained phoneme distance-time series is ("x-d, i-d, ao3-d, b-d, ai2-d, ch-d, i1-d, z-d, ao3-d, f-d, an 4-d"), where "x-d" represents the phoneme distance between the phoneme x to be evaluated and the reference phoneme x.

Step 210: and determining a quality evaluation score of the audio to be evaluated according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence.

After the weight information and phoneme distance-time sequence corresponding to each reference phoneme are determined, the quality evaluation score of the audio to be evaluated can be calculated according to the weight information and phoneme distance-time sequence corresponding to each reference phoneme.

Specifically, determining the quality assessment score of the audio to be assessed according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence includes:

In practical application, the factor distance and the phoneme weight of the corresponding phoneme in the reference phoneme-time-weight sequence and the phoneme distance-time sequence can be used as the phoneme score corresponding to the current phoneme, and then the quality evaluation score of the audio to be evaluated can be determined according to the phoneme score corresponding to each phoneme.

In a specific embodiment provided by the present application, following the above example, if the weight value corresponding to the phoneme x is 1.3, and the corresponding phoneme distance is x-d, the phoneme score corresponding to the phoneme x is 1.3 × (x-d), and so on, and finally all the phoneme scores are added, so as to determine the quality evaluation score of the audio to be evaluated.

To further explain the audio quality assessment method provided by an embodiment of the present application with reference to fig. 3 and fig. 4, fig. 3 shows a schematic diagram of the audio quality assessment method provided by an embodiment of the present application, and as shown in fig. 3, a reference audio x and an audio y to be assessed are obtained, speech recognition is performed on the reference audio x to obtain a reference phoneme-time sequence, speech recognition is performed on the audio y to be assessed to obtain a phoneme-time sequence to be assessed, a phoneme distance-time sequence d (t) is calculated according to the reference phoneme-time sequence and the phoneme-time sequence to be assessed, weights w (t) corresponding to phonemes in the reference phoneme-time sequence are determined according to an assessment task, a quality assessment score MOS of the audio to be assessed is calculated according to the phoneme distance-time sequence d (t) and the weight (w t) corresponding to phonemes, thereby determining the quality of the audio to be evaluated.

Fig. 4 shows a flowchart of an audio quality assessment method applied to a spoken language assessment scenario, where the audio quality assessment method is described by taking spoken language assessment as an example, and includes steps 402 to 416.

Step 402: and acquiring the audio to be evaluated and the reference audio.

In the specific embodiment provided by the application, the obtained audio to be evaluated is a recording recorded by a mobile phone, namely 'grape eating and grape skin not spitting', and the reference audio is an evaluation standard audio 'grape eating and grape skin not spitting'.

Step 404: and carrying out noise reduction processing on the audio to be evaluated to obtain the audio to be evaluated after noise reduction.

In the specific embodiment provided by the application, the noise reduction treatment is carried out on the recording recorded by the mobile phone, namely that grape skin is not spitted when the grape is eaten, so that noise and noise in the recording of the mobile phone are removed, and the quality of the audio to be evaluated is improved.

Step 406: and respectively extracting the phoneme sequences of the audio to be evaluated and the reference audio and the time corresponding to each phoneme according to a preset speech recognition method, and generating a phoneme-time sequence to be evaluated and a reference phoneme-time sequence.

In the specific embodiment provided by the present application, the phoneme-time series to be evaluated of the audio to be evaluated and the reference phoneme-time series of the reference audio are respectively extracted according to a speech recognition technology.

Step 408: and determining the weight value of each phoneme type according to a preset evaluation strategy and the phoneme type.

In the specific embodiment provided by the application, the emphasis on the oral evaluation is on the initial consonant, so that the weight value of the initial consonant is set to 1.6, and the weight value of the final sound is set to 0.8.

Step 410: and setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to the weight value of each phoneme type, and generating a reference phoneme-time-weight sequence.

In the specific embodiment provided by the present application, according to the weight value w corresponding to each phoneme (corresponding to the initial weight value of 1.6 and the final weight value of 0.8 in the above steps), the corresponding relationship w (t) between the pronunciation time t of each phoneme and the weight value w can be obtained.

Step 412: and performing phoneme alignment on the phoneme-time sequence to be evaluated and the reference phoneme-time sequence, and calculating phoneme distances between the phoneme-time sequence to be evaluated and the corresponding audio segments in the reference phoneme-time sequence after the phoneme alignment to obtain a phoneme distance-time sequence.

In the specific embodiment provided by the present application, the phoneme-time sequence to be evaluated and the reference phoneme-time sequence are time-aligned, then the corresponding phoneme distance d in the phoneme-time sequence to be evaluated and the reference phoneme-time sequence is calculated according to the PESQ algorithm, and based on the corresponding relationship between each phoneme and the time t, correspondingly, the corresponding relationship d (t) exists between each phoneme distance d and the time t.

Step 414: and determining the phoneme score corresponding to the target time point according to the phoneme weight and the phoneme distance corresponding to each target time point in the reference phoneme-time-weight sequence and the phoneme distance-time sequence.

In the specific embodiment provided by the present application, according to the corresponding relationship w (t) between the time t and the weight value w of each phoneme and the corresponding relationship d (t) between the phoneme distance d and the time t, the phoneme score corresponding to each phoneme can be determined as w (t) d (t).

Step 416: and determining the quality evaluation score of the audio to be evaluated according to the phoneme score corresponding to each target time point.

In the specific embodiment provided by the present application, the quality assessment score of the audio to be assessed is determined according to w (t) d (t) corresponding to each phoneme, and the higher the score of the finally obtained quality assessment score is, the better the quality of the audio to be assessed is.

Corresponding to the above method embodiment, the present application further provides an audio quality assessment apparatus embodiment, and fig. 5 shows a schematic structural diagram of an audio quality assessment apparatus according to an embodiment of the present application. As shown in fig. 5, the apparatus includes:

an obtaining module 502 configured to obtain an audio to be evaluated and a reference audio;

an extracting module 504, configured to extract a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extract a reference phoneme-time sequence corresponding to the reference audio;

a setting module 506 configured to set a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation policy, and generate a reference phoneme-time-weight sequence;

a calculating module 508 configured to calculate a phoneme distance between the phoneme-time series to be evaluated and a corresponding phoneme in the reference phoneme-time series to obtain a phoneme distance-time series;

a determining module 510 configured to determine a quality assessment score for the audio to be assessed according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence.

Optionally, the extracting module 504 is further configured to:

Optionally, the setting module 506 is further configured to:

Optionally, the calculating module 508 is further configured to:

Optionally, the determining module 510 is further configured to:

Optionally, the apparatus further comprises:

and the preprocessing module is configured to preprocess the audio to be evaluated to obtain the preprocessed audio to be evaluated.

Optionally, the preprocessing module is further configured to perform noise reduction processing and/or speech enhancement processing on the audio to be evaluated.

The audio quality evaluation device provided by the embodiment of the application obtains the audio to be evaluated and the reference audio; extracting a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extracting a reference phoneme-time sequence corresponding to the reference audio; setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy to generate a reference phoneme-time-weight sequence; calculating the phoneme distance between the phoneme-time sequence to be evaluated and the corresponding phoneme in the reference phoneme-time sequence to obtain a phoneme distance-time sequence; the quality evaluation score of the audio to be evaluated is determined according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence, the audio quality evaluation device provided by the application is simplified in steps and convenient to use, only the audio to be evaluated is subjected to phoneme level calibration, the phonemes are used as the investigation weights according to a preset evaluation strategy, the quality evaluation score is corrected, and the quality of the audio to be evaluated can be represented more accurately according to an application scene.

The above is a schematic scheme of an audio quality evaluation apparatus of the present embodiment. It should be noted that the technical solution of the audio quality estimation apparatus and the technical solution of the audio quality estimation method belong to the same concept, and details that are not described in detail in the technical solution of the audio quality estimation apparatus can be referred to the description of the technical solution of the audio quality estimation method.

It should be noted that the components in the device claims should be understood as functional blocks which are necessary to implement the steps of the program flow or the steps of the method, and each functional block is not actually defined by functional division or separation. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.

There is also provided in an embodiment of the present application a computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the audio quality assessment method when executing the instructions.

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the audio quality assessment method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the audio quality assessment method.

An embodiment of the present application further provides a computer-readable storage medium storing computer instructions, which when executed by a processor, implement the steps of the audio quality assessment method as described above.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the above-mentioned audio quality evaluation method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the above-mentioned audio quality evaluation method.

The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and its practical applications, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims

1. An audio quality assessment method, comprising:

acquiring audio to be evaluated and reference audio;

2. The audio quality assessment method of claim 1, wherein extracting the phone-time series to be assessed corresponding to the audio to be assessed comprises:

3. The audio quality assessment method of claim 1, wherein extracting the reference phoneme-time series corresponding to the reference audio comprises:

4. The audio quality assessment method of claim 3, wherein setting a corresponding weight value for each reference phoneme in the reference phoneme-time series according to a preset assessment strategy to generate a reference phoneme-time-weight series comprises:

5. The audio quality assessment method of claim 1, wherein calculating the phone distance of the phone-time series to be assessed from the corresponding phone in the reference phone-time series comprises:

6. The audio quality assessment method of claim 1, wherein determining a quality assessment score for the audio to be assessed based on the reference phoneme-time-weight sequence and the phoneme distance-time sequence comprises:

7. The audio quality assessment method according to claim 1, further comprising, before extracting a phoneme-time series to be assessed corresponding to the audio to be assessed, the following steps:

8. The audio quality assessment method of claim 7, wherein pre-processing the audio to be assessed comprises:

9. An audio quality evaluation apparatus, comprising:

10. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1-8 when executing the instructions.

11. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1 to 8.