CN112614510B

CN112614510B - Audio quality assessment method and device

Info

Publication number: CN112614510B
Application number: CN202011540097.7A
Authority: CN
Inventors: 元海明; 刘鲁鹏; 王晓红; 陈佳路; 高强; 夏龙; 郭常圳
Original assignee: Beijing Ape Power Future Technology Co Ltd
Current assignee: Beijing Ape Power Future Technology Co Ltd
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2024-04-30
Anticipated expiration: 2040-12-23
Also published as: CN112614510A

Abstract

The application provides an audio quality assessment method and device, wherein the audio quality assessment method comprises the following steps: acquiring audio to be evaluated and reference audio; extracting a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extracting a reference phoneme-time sequence corresponding to the reference audio; setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy, and generating a reference phoneme-time-weight sequence; calculating the phoneme distance between the phoneme to be evaluated and the corresponding phoneme in the reference phoneme-time sequence to obtain a phoneme distance-time sequence; according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence, determining a quality evaluation score of the audio to be evaluated, calibrating the phoneme level of the audio by the method provided by the application, taking the phonemes as investigation weight correction quality evaluation scores, and more accurately representing the quality of the audio to be evaluated.

Description

Audio quality assessment method and device

Technical Field

The present application relates to the field of internet technologies, and in particular, to an audio quality assessment method and apparatus, a computing device, and a computer readable storage medium.

Background

In recent years, with the rapid development of networks, various audio processing technologies and audio transmission technologies are layered endlessly, and as subjective feelings of communication users and consumers on audio are ultimately dependent on audio quality, evaluation of audio quality is an important research topic.

One of the most widely adopted objective speech evaluation methods is an objective speech quality evaluation method (Perceptual evaluation of speech quality, PESQ) which gives a score of-0.5 to 4.5 and characterizes objective MOS distances of test audio and control audio, but in different application scenarios, actual evaluation values of audio quality are different, and when the test audio is preprocessed, the effective audio feature part may be erroneously eliminated, so that the PESQ method has a poor effect on the audio quality evaluation in such scenarios and cannot give a more accurate score.

Therefore, how to solve the above technical problems is a problem to be solved by the technicians.

Disclosure of Invention

In view of the above, embodiments of the present application provide an audio quality evaluation method and apparatus, a computing device and a computer readable storage medium, so as to solve the technical drawbacks in the prior art.

According to a first aspect of an embodiment of the present application, there is provided an audio quality assessment method, including:

acquiring audio to be evaluated and reference audio;

extracting a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extracting a reference phoneme-time sequence corresponding to the reference audio;

setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy, and generating a reference phoneme-time-weight sequence;

Calculating the phoneme distance between the phoneme to be evaluated and the corresponding phoneme in the reference phoneme-time sequence to obtain a phoneme distance-time sequence;

And determining a quality assessment score of the audio to be assessed according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence.

Optionally, extracting a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated includes:

extracting a phoneme sequence to be evaluated corresponding to the audio to be evaluated and the time corresponding to each phoneme to be evaluated according to a preset voice recognition method;

and generating a phoneme-time sequence to be evaluated according to the phoneme sequence to be evaluated and the time corresponding to each phoneme to be evaluated.

Optionally, extracting the reference phoneme-time sequence corresponding to the reference audio includes:

Extracting a reference phoneme sequence corresponding to the reference audio and time corresponding to each reference phoneme according to a preset voice recognition method;

and generating a reference phoneme-time sequence according to the reference phoneme sequence and the time corresponding to each reference phoneme.

Optionally, setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy, and generating a reference phoneme-time-weight sequence includes:

Determining a weight value of each phoneme type according to a preset evaluation strategy and the phoneme type;

And setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to the weight value of each phoneme type, and generating a reference phoneme-time-weight sequence.

Optionally, calculating the phoneme distance between the phoneme-time sequence to be evaluated and the corresponding phoneme in the reference phoneme-time sequence includes:

Performing phoneme alignment on the phoneme-time sequence to be evaluated and the reference phoneme-time sequence;

and calculating the phoneme distance between the phoneme-time sequence to be evaluated after the phoneme alignment and the corresponding audio fragment of the corresponding phoneme in the reference phoneme-time sequence by using an objective voice quality evaluation method.

Optionally, determining the quality assessment score of the audio to be assessed according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence includes:

Determining a phoneme score corresponding to each target time point according to the reference phoneme-time-weight sequence and the phoneme weight and the phoneme distance corresponding to each target time point of the phoneme distance-time sequence;

and determining the quality evaluation score of the audio to be evaluated according to the phoneme score corresponding to each target time point.

Optionally, before extracting the phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, the method further includes:

and preprocessing the audio to be evaluated to obtain preprocessed audio to be evaluated.

Optionally, preprocessing the audio to be evaluated includes:

and carrying out noise reduction processing and/or voice enhancement processing on the audio to be evaluated.

According to a second aspect of an embodiment of the present application, there is provided an audio quality evaluation apparatus including:

An acquisition module configured to acquire audio to be evaluated and reference audio;

An extraction module configured to extract a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and to extract a reference phoneme-time sequence corresponding to the reference audio;

the setting module is configured to set a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy, and generate a reference phoneme-time-weight sequence;

A calculating module configured to calculate a phoneme distance between the phoneme-time sequence to be evaluated and a corresponding phoneme in the reference phoneme-time sequence, to obtain a phoneme distance-time sequence;

A determination module configured to determine a quality assessment score for the audio under assessment from the reference phoneme-time-weight sequence and the phoneme distance-time sequence.

According to a third aspect of embodiments of the present application, there is provided a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the audio quality assessment method when executing the instructions.

According to a fourth aspect of embodiments of the present application, there is provided a computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the audio quality assessment method.

According to the audio quality assessment method provided by the embodiment of the application, the audio to be assessed and the reference audio are obtained; extracting a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extracting a reference phoneme-time sequence corresponding to the reference audio; setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy, and generating a reference phoneme-time-weight sequence; calculating the phoneme distance between the phoneme to be evaluated and the corresponding phoneme in the reference phoneme-time sequence to obtain a phoneme distance-time sequence; the method for evaluating the audio quality provided by the application has the advantages that the steps are simplified, the use is convenient, only the audio level of the audio to be evaluated is calibrated, the phonemes are used as investigation weights according to a preset evaluation strategy, the quality evaluation score is corrected, and the quality of the audio to be evaluated can be more accurately represented according to application scenes.

Drawings

FIG. 1 is a block diagram of a computing device provided by an embodiment of the present application;

FIG. 2 is a flow chart of an audio quality assessment method provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of an audio quality assessment method according to an embodiment of the present application;

FIG. 4 is a flowchart of an audio quality assessment method applied to a spoken evaluation scenario, provided by an embodiment of the present application;

fig. 5 is a schematic structural diagram of an audio quality assessment apparatus according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. The present application may be embodied in many other forms than those herein described, and those skilled in the art will readily appreciate that the present application may be similarly embodied without departing from the spirit or essential characteristics thereof, and therefore the present application is not limited to the specific embodiments disclosed below.

The terminology used in the one or more embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the application. As used in one or more embodiments of the application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of the application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" depending on the context.

First, terms related to one or more embodiments of the present invention will be explained.

Automatic speech recognition technology: (Automatic Speech Recognition, ASR), a technique that converts human speech into text.

Objective speech quality assessment method: (Perceptual evaluation of speech quality, PESQ), a method of providing objective MOS value evaluation.

Average subjective opinion score: (Mean opinion scores, MOS) in international standards, the MOS value is used uniformly to evaluate voice quality.

Phonemes: the method is characterized in that minimum voice units are divided according to the natural attribute of voice, the voice units are analyzed according to pronunciation actions in syllables, and one action forms a phoneme. Phonemes are divided into two major classes, vowels and consonants. For example, chinese syllables ā (o) have only one phoneme, and p a pi (p pi) has two phonemes.

In the present application, an audio quality evaluation method and apparatus, a computing device, and a computer-readable storage medium are provided, and detailed description is given in the following embodiments.

FIG. 1 illustrates a block diagram of a computing device 100, according to an embodiment of the application. The components of the computing device 100 include, but are not limited to, a memory 110 and a processor 120. Processor 120 is coupled to memory 110 via bus 130 and database 150 is used to store data.

Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 140 may include one or more of any type of network interface, wired or wireless (e.g., a Network Interface Card (NIC)), such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the application, the above-described components of computing device 100, as well as other components not shown in FIG. 1, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device shown in FIG. 1 is for exemplary purposes only and is not intended to limit the scope of the present application. Those skilled in the art may add or replace other components as desired.

Computing device 100 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.

Wherein the processor 120 may perform the steps of the audio quality assessment method shown in fig. 2. Fig. 2 shows a flow chart of an audio quality assessment method according to an embodiment of the application, comprising steps 202 to 210.

Step 202: and acquiring the audio to be evaluated and the reference audio.

The audio quality evaluation method provided by the application has a plurality of application scenes, such as spoken language test, audio matching and the like, and the specific application scene of the audio quality evaluation method is not limited in the application.

The audio to be evaluated is the audio to be evaluated, such as the voice of a spoken language tester in a spoken language test, the audio to be matched in an audio matching process, and the like, and the audio to be evaluated is obtained in a plurality of ways, such as mobile phone recording, recording pen recording, microphone recording, and the like.

Correspondingly, in practical application, if the score of the audio to be evaluated and the score of the reference audio are higher, the quality of the audio to be evaluated is higher, in the spoken language test, the reference audio can be the audio used for scoring in the test, and in the audio matching, the reference audio is the matched audio.

In a specific embodiment provided by the application, taking an audio matching scene as an example, the obtained audio to be evaluated is a recording "snack breakfast" of a mobile phone, and the reference audio is a standard audio "snack breakfast".

Step 204: and extracting a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extracting a reference phoneme-time sequence corresponding to the reference audio.

Specifically, preprocessing the audio to be evaluated includes: and carrying out noise reduction processing and/or voice enhancement processing on the audio to be evaluated.

In practical applications, the audio to be evaluated is usually not pure audio, and noise, such as noise, etc., which may interfere with the quality of the audio, is usually obtained, so that there are many methods for preprocessing the audio to be evaluated, such as noise reduction processing, speech enhancement processing, etc., and at least one processing method may be optionally used to perform a corresponding preprocessing operation on the audio to be processed.

When the audio to be evaluated is preprocessed, a traditional audio processing method, such as a noise reduction module of Web instant messaging (Web Real-Time Communication, webrtc), or an audio processing model based on a deep neural network model, can be used.

In a specific embodiment provided by the application, along the use of the above example, the audio to be evaluated is a recorded "snack breakfast" of a mobile phone, wherein when the user records the sound of the whistle of the automobile at the same time, the noise reduction treatment is needed to be carried out on the audio to be evaluated, the noise of the whistle of the automobile in the audio to be evaluated is removed, and meanwhile, the audio of the "snack breakfast" is subjected to voice enhancement.

Specifically, extracting a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated includes:

In practical application, after the audio to be evaluated is obtained, the audio to be evaluated can be subjected to voice recognition through a preset voice recognition technology, the automatic voice recognition technology can convert voice into text, and the text has various types, such as binary codes, character sequences, phoneme sequences and the like.

The phonemes are the minimum phonetic units divided according to the natural attributes of the speech, and are analyzed according to the pronunciation actions in syllables, and one action constitutes one phoneme. Phonemes are divided into two major classes, vowels and consonants. For example, chinese syllables ā (o) have only one phoneme, and p a pi (p pi) has two phonemes.

After the audio to be evaluated is processed by the voice recognition technology, an audio sequence to be evaluated corresponding to the audio to be evaluated can be extracted, for example, the audio to be evaluated is "weather today is good", and phoneme sequences are "j, in1, t, i, an1, t, i, an1, q, i4, b, u2, c and uo4", wherein the number 1 in the phoneme "in1" represents a tone.

In practical applications, after the audio to be evaluated is subjected to noise reduction or speech enhancement, the audio to be evaluated may be damaged during the processing, so that errors may exist between the phoneme recognition result and the real situation, such as a phoneme loss situation and a phoneme replacement situation, for example, the real phonemes are "j, in1, t, i, an1" and may be recognized as "j, in1, t, i, ao1".

The speech recognition technology may also implement time alignment of phonemes with audio, for example, for the time in the audio to be evaluated corresponding to the phoneme "j" is 10 th to 12 th milliseconds, for the time in the audio to be evaluated corresponding to the phoneme "in1" is 13 th to 16 th milliseconds, in practical application, the time alignment manner of phonemes with time may also be represented by dividing according to time, for example, dividing the time by 1 millisecond every time, and then the phonemes corresponding to the 10 th to 16 th milliseconds are "j, j, j, in1, in1, in1", and the specific form of the phoneme-time sequence to be evaluated is not specifically limited in the present application, which is based on practical application.

In a specific embodiment of the present application, along with the above example, the audio to be evaluated is "snack breakfast", and the phonemes of the audio to be evaluated may be determined to be "x, i, ao3, b, ao2, ch, i1, z, ao3, f, an4" by using a speech recognition technology, and the generated phoneme-time sequence to be evaluated is referred to in table 1 below for convenience of explanation.

TABLE 1

Phonemes to be evaluated	Time (millisecond)
		x	5-200
i	201-320
		ao3	321-500
b	501-600
		ao 2	601-740
ch	741-889
		i1	890-1200
z	1201-1396
		ao3	1397-1525
f	1526-1650
		an4	1651-1950

Specifically, extracting a reference phoneme-time sequence corresponding to the reference audio includes:

The method for extracting the reference phoneme-time sequence corresponding to the reference audio is the same as the method for extracting the phoneme-time sequence to be evaluated, so that a specific method for extracting the reference phoneme-time sequence is referred to the description of extracting the phoneme-time sequence to be evaluated, and is not repeated here.

In a specific embodiment provided by the present application, along the above example, the reference audio is "snack breakfast", and the phonemes of the reference audio are "x, i, ao3, b, ai2, ch, i1, z, ao3, f, an4", which can be determined by the speech recognition technology, and the generated reference phoneme-time sequence is shown in the following table 2.

TABLE 2

Reference phonemes	Time (millisecond)
		x	7-202
i	203-334
		ao3	335-514
b	515-620
		ai2	621-760
ch	761-909
		i1	910-1211
z	1212-1407
		ao3	1407-1540
f	1541-1669
		an4	1670-1987

Step 206: and setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy, and generating a reference phoneme-time-weight sequence.

The preset evaluation strategy is an evaluation strategy determined according to a specific evaluation task, in order to adapt to different evaluation tasks, different weight values are required to be set for each reference phoneme, and a corresponding reference phoneme-time-weight sequence is generated, wherein the reference phoneme-time-weight sequence comprises pronunciation time and weight values corresponding to each reference phoneme.

Specifically, setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy, and generating a reference phoneme-time-weight sequence includes:

In some evaluation tasks of practical application, the weight value of the initial consonant in the phoneme sequence needs to be set higher than the weight value of the final sound, if the sound of the final sound is more concerned, the weight value of the final sound in the phoneme sequence needs to be set higher than the weight value of the initial consonant, and a corresponding weight value is set for each reference phoneme in the reference phoneme-time sequence, after the weight value of each phoneme type is determined, a corresponding weight value can be set for each reference phoneme in the reference phoneme-time sequence, for example, in the evaluation task of the sound of the final sound of more concerned, the weight value of the initial sound can be set to be 1.5, the weight value of the final sound is set to be 0.7, in the evaluation task of the sound of the final sound of more concerned, the weight value of the initial sound is set to be 0.8, the weight value of the final sound is set to be 1.3, and the like.

In a specific embodiment of the present application, following the above example, the weight value of the initial consonant is set to 1.3, the weight value of the final is set to 0.6 according to the evaluation strategy corresponding to the evaluation task, and the phonemes in the reference phoneme-time sequence are set, and the generated reference phoneme-time-weight sequence is shown in table 3 below.

TABLE 3 Table 3

Step 208: and calculating the phoneme distance between the phoneme-time sequence to be evaluated and the corresponding phoneme in the reference phoneme-time sequence to obtain a phoneme distance-time sequence.

Objective speech quality assessment (Perceptual evaluation of speech quality, PESQ), which is an objective mean subjective opinion score (Mean opinion scores, MOS) value assessment method provided by the ITU-T p.862 recommendation. The method is one of the most widely-used objective voice evaluation methods, and the evaluation method can give a score of-0.5 to 4.5 and represents objective MOS distance between the audio to be evaluated and the reference audio.

In the application, the MOS distance corresponding to each phoneme is calculated according to the phoneme-time sequence to be evaluated and the reference phoneme-time sequence.

Specifically, calculating the phoneme distance between the phoneme-time sequence to be evaluated and the corresponding phoneme in the reference phoneme-time sequence includes:

In practical applications, since the audio to be evaluated may be damaged during the processing after the audio to be evaluated is subjected to noise reduction or speech enhancement when extracting phonemes, and there may be errors between the phoneme recognition result and the real situation, it is necessary to perform phoneme alignment between the phoneme-time sequence to be evaluated and the reference phoneme-time sequence.

The specific implementation algorithm of the phoneme alignment has various possibilities, for example, when the phoneme sequence in the audio to be evaluated is consistent with the phoneme sequence in the reference audio, the distance is directly calculated; if the phoneme sequence in the audio to be evaluated has partial deletion and error, but the error or the deletion proportion is smaller than the threshold value, correcting the alignment result in a distance editing mode; if the difference between the phoneme sequence in the audio to be evaluated and the phoneme sequence of the reference audio is greater than a preset threshold, determining a corresponding relation in time according to the similarity of the audio, and further aligning the phoneme-time sequence to be evaluated with the reference phoneme-time sequence.

After the alignment of the phonemes, a phoneme distance between the phoneme-time sequence to be evaluated and the corresponding audio piece of the corresponding phoneme in the reference phoneme-time sequence can be obtained by PESQ.

In one embodiment of the present application, along with the above example, the phoneme-time sequence to be evaluated shown in table 1 is aligned with the reference phoneme-time sequence shown in table 2, and the distance between the phoneme-time sequence to be evaluated and each of the phonemes in the reference phoneme-time sequence is obtained by PESQ calculation, where "x-d" represents the phoneme distance between the phoneme x to be evaluated and the reference phoneme x.

Step 210: and determining a quality assessment score of the audio to be assessed according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence.

After determining the weight information and the phoneme distance-time sequence corresponding to each reference phoneme, the quality evaluation score of the audio to be evaluated can be calculated according to the weight information and the phoneme distance-time sequence corresponding to each reference phoneme.

Specifically, determining the quality assessment score of the audio to be assessed according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence includes:

In practical application, the factor distance of the corresponding phonemes in the reference phoneme-time-weight sequence and the phoneme distance-time sequence can be multiplied by the phoneme weight to be used as the phoneme score corresponding to the current phoneme, and then the quality evaluation score of the audio to be evaluated can be determined according to the phoneme score corresponding to each phoneme.

In a specific embodiment of the present application, along the above example, the weight value corresponding to the phoneme x is 1.3, the corresponding phoneme distance is x-d, the phoneme score corresponding to the phoneme x is 1.3 x (x-d), and so on, and finally, all the phoneme scores are added to determine the quality evaluation score of the audio to be evaluated.

In the following, an audio quality evaluation method according to an embodiment of the present application will be further explained with reference to fig. 3 and 4, fig. 3 shows a schematic diagram of the audio quality evaluation method according to an embodiment of the present application, where, as shown in fig. 3, a reference audio x and an audio y to be evaluated are obtained, a speech recognition is performed on the reference audio x to obtain a reference phoneme-time sequence, a speech recognition is performed on the audio y to be evaluated to obtain a phoneme-time sequence to be evaluated, a phoneme distance-time sequence d (t) is calculated according to the reference phoneme-time sequence and the phoneme-time sequence to be evaluated, a weight w (t) corresponding to a phoneme in the reference phoneme-time sequence is determined according to an evaluation task, and a quality evaluation score MOS of the audio to be evaluated is determined according to the phoneme distance-time sequence d (t) and the weight w (t) corresponding to a phoneme, thereby determining the quality of the audio to be evaluated.

Fig. 4 shows a flowchart of an audio quality assessment method applied to a spoken language evaluation scene according to an embodiment of the present application, where the audio quality assessment method is described by taking spoken language evaluation as an example, and includes steps 402 to 416.

Step 402: and acquiring the audio to be evaluated and the reference audio.

In the specific embodiment provided by the application, the audio to be evaluated is recorded by a mobile phone, namely, the audio to be evaluated is recorded by the mobile phone.

Step 404: and carrying out noise reduction treatment on the audio to be evaluated to obtain the audio to be evaluated after noise reduction.

In the specific embodiment provided by the application, the noise reduction treatment is carried out on the recorded 'eating the grape without spitting the grape skin' of the mobile phone, so that the noise and the noise in the recorded mobile phone are removed, and the quality of the audio to be evaluated is improved.

Step 406: and respectively extracting the phoneme sequences of the audio to be evaluated and the reference audio and the corresponding time of each phoneme according to a preset voice recognition method, and generating a phoneme-time sequence to be evaluated and a reference phoneme-time sequence.

In a specific embodiment provided by the application, the phoneme-time sequence to be evaluated of the audio to be evaluated and the reference phoneme-time sequence of the reference audio are respectively extracted according to a voice recognition technology.

Step 408: and determining a weight value of each phoneme type according to a preset evaluation strategy and the phoneme type.

In the specific embodiment provided by the application, the emphasis on the oral evaluation is on the initial consonant, so that the weight value of the initial consonant is set to be 1.6, and the weight value of the final is set to be 0.8.

Step 410: and setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to the weight value of each phoneme type, and generating a reference phoneme-time-weight sequence.

In the embodiment provided by the application, according to the weight value w (corresponding to the initial weight value of 1.6 and the final weight value of 0.8) corresponding to each phoneme, the corresponding relation w (t) between the pronunciation time t and the weight value w of each phoneme can be obtained.

Step 412: and carrying out phoneme alignment on the phoneme-time sequence to be evaluated and the reference phoneme-time sequence, and calculating the phoneme distance between the corresponding phoneme corresponding audio fragments in the phoneme-time sequence to be evaluated and the reference phoneme-time sequence after the phoneme alignment to obtain a phoneme distance-time sequence.

In the specific embodiment provided by the application, firstly, the phoneme-time sequence to be evaluated is time aligned with the reference phoneme-time sequence, then, the corresponding phoneme distance d in the phoneme-time sequence to be evaluated and the reference phoneme-time sequence is calculated according to the PESQ algorithm, and based on the corresponding relation between each phoneme and the time t, correspondingly, each phoneme distance d has the corresponding relation d (t) with the time t.

Step 414: and determining a phoneme score corresponding to each target time point according to the phoneme weight and the phoneme distance corresponding to each target time point of the reference phoneme-time-weight sequence and the phoneme distance-time sequence.

In the specific embodiment provided by the application, according to the corresponding relation w (t) between the time t and the weight value w of each phoneme and the corresponding relation d (t) between the distance d of the phoneme and the time t, the phoneme score corresponding to each phoneme can be determined as w (t) x d (t).

Step 416: and determining the quality evaluation score of the audio to be evaluated according to the phoneme score corresponding to each target time point.

In the specific embodiment provided by the application, the quality evaluation score of the audio to be evaluated is determined according to w (t) d (t) corresponding to each phoneme, and the higher the score of the quality evaluation score finally obtained, the better the quality of the audio to be evaluated.

Corresponding to the above method embodiment, the present application further provides an embodiment of an audio quality assessment apparatus, and fig. 5 shows a schematic structural diagram of an audio quality assessment apparatus according to an embodiment of the present application. As shown in fig. 5, the apparatus includes:

An acquisition module 502 configured to acquire audio to be evaluated and reference audio;

An extraction module 504 configured to extract a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extract a reference phoneme-time sequence corresponding to the reference audio;

A setting module 506 configured to set a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy, and generate a reference phoneme-time-weight sequence;

A calculating module 508 configured to calculate a phoneme distance between the phoneme-time sequence to be evaluated and a corresponding phoneme in the reference phoneme-time sequence, to obtain a phoneme distance-time sequence;

A determination module 510 is configured to determine a quality assessment score for the audio under assessment from the reference phoneme-time-weight sequence and the phoneme distance-time sequence.

Optionally, the extracting module 504 is further configured to:

Optionally, the setting module 506 is further configured to:

Optionally, the computing module 508 is further configured to:

Optionally, the determining module 510 is further configured to:

Optionally, the apparatus further includes:

And the preprocessing module is configured to preprocess the audio to be evaluated to obtain preprocessed audio to be evaluated.

Optionally, the preprocessing module is further configured to perform noise reduction processing and/or voice enhancement processing on the audio to be evaluated.

The audio quality assessment device provided by the embodiment of the application is characterized by acquiring the audio to be assessed and the reference audio; extracting a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extracting a reference phoneme-time sequence corresponding to the reference audio; setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy, and generating a reference phoneme-time-weight sequence; calculating the phoneme distance between the phoneme to be evaluated and the corresponding phoneme in the reference phoneme-time sequence to obtain a phoneme distance-time sequence; the audio quality assessment device provided by the application has the advantages that the steps are simplified, the use is convenient, only the audio level of the audio to be assessed is calibrated, the phonemes are used as investigation weights according to a preset assessment strategy, the quality assessment score is corrected, and the quality of the audio to be assessed can be more accurately represented according to an application scene.

The above is an exemplary scheme of an audio quality evaluation apparatus of the present embodiment. It should be noted that, the technical solution of the audio quality assessment device and the technical solution of the audio quality assessment method belong to the same concept, and details of the technical solution of the audio quality assessment device, which are not described in detail, can be referred to the description of the technical solution of the audio quality assessment method.

It should be noted that, the components in the apparatus claims should be understood as functional modules that are necessary to be established for implementing the steps of the program flow or the steps of the method, and the functional modules are not actually functional divisions or separate limitations. The device claims defined by such a set of functional modules should be understood as a functional module architecture for implementing the solution primarily by means of the computer program described in the specification, and not as a physical device for implementing the solution primarily by means of hardware.

In one embodiment, the application also provides a computing device, including a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the audio quality assessment method when executing the instructions.

The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the above-mentioned audio quality assessment method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the above-mentioned audio quality assessment method.

An embodiment of the application also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, implement the steps of the audio quality assessment method as described above.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the above-mentioned audio quality assessment method belong to the same concept, and details of the technical solution of the storage medium, which are not described in detail, can be referred to the description of the technical solution of the above-mentioned audio quality assessment method.

The foregoing describes certain embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the application disclosed above are intended only to assist in the explanation of the application. Alternative embodiments are not intended to be exhaustive or to limit the application to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and the practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and the full scope and equivalents thereof.

Claims

1. An audio quality assessment method, comprising:

acquiring audio to be evaluated and reference audio;

Setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy, and generating a reference phoneme-time-weight sequence, wherein the preset evaluation strategy is an evaluation strategy determined according to a specific evaluation task;

Calculating the aligned phoneme distance between the to-be-evaluated phoneme-time sequence and a corresponding phoneme in the reference phoneme-time sequence to obtain a phoneme distance-time sequence, wherein before the calculating the aligned phoneme distance between the to-be-evaluated phoneme-time sequence and the corresponding phoneme in the reference phoneme-time sequence to obtain the phoneme distance-time sequence, the method further comprises the steps of: if the phoneme sequence in the audio to be evaluated has partial deletion and error, but the error or the deletion proportion is smaller than a threshold value, correcting the alignment result in a distance editing mode; if the difference between the phoneme sequence in the audio to be evaluated and the phoneme sequence of the reference audio is larger than a preset threshold, determining a corresponding relation in time according to the similarity of the audio, and aligning the phoneme-time sequence to be evaluated with the reference phoneme-time sequence;

2. The audio quality assessment method according to claim 1, wherein extracting a phoneme-time sequence to be assessed corresponding to the audio to be assessed, comprises:

3. The audio quality assessment method according to claim 1, wherein extracting a reference phoneme-time sequence corresponding to the reference audio comprises:

4. The audio quality assessment method according to claim 3, wherein setting a corresponding weight value for each reference phoneme in said reference phoneme-time series according to a preset assessment policy, generating a reference phoneme-time-weight series, comprises:

5. The audio quality assessment method according to claim 1, wherein calculating the aligned phoneme distance of the phoneme-time series to be assessed from the corresponding phoneme in the reference phoneme-time series comprises:

6. The audio quality assessment method according to claim 1, wherein determining the quality assessment score of the audio to be assessed from the reference phoneme-time-weight sequence and the phoneme distance-time sequence comprises:

7. The audio quality assessment method according to claim 1, further comprising, prior to extracting a phoneme-time sequence to be assessed corresponding to said audio to be assessed:

8. The audio quality assessment method according to claim 7, wherein preprocessing the audio to be assessed comprises:

9. An audio quality assessment apparatus, comprising:

The setting module is configured to set a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy, and generate a reference phoneme-time-weight sequence, wherein the preset evaluation strategy is an evaluation strategy determined according to a specific evaluation task;

A calculating module, configured to calculate a phoneme distance between the aligned phoneme-time sequence to be evaluated and a corresponding phoneme in the reference phoneme-time sequence, so as to obtain a phoneme distance-time sequence, where before the calculating the aligned phoneme distance between the phoneme-time sequence to be evaluated and the corresponding phoneme in the reference phoneme-time sequence, the calculating module further includes: if the phoneme sequence in the audio to be evaluated has partial deletion and error, but the error or the deletion proportion is smaller than a threshold value, correcting the alignment result in a distance editing mode; if the difference between the phoneme sequence in the audio to be evaluated and the phoneme sequence of the reference audio is larger than a preset threshold, determining a corresponding relation in time according to the similarity of the audio, and aligning the phoneme-time sequence to be evaluated with the reference phoneme-time sequence;

10. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor, when executing the instructions, implements the steps of the method of any of claims 1-8.

11. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the method of any one of claims 1 to 8.

12. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of any of claims 1-8.