CN112614510A - Audio quality evaluation method and device - Google Patents

Audio quality evaluation method and device Download PDF

Info

Publication number
CN112614510A
CN112614510A CN202011540097.7A CN202011540097A CN112614510A CN 112614510 A CN112614510 A CN 112614510A CN 202011540097 A CN202011540097 A CN 202011540097A CN 112614510 A CN112614510 A CN 112614510A
Authority
CN
China
Prior art keywords
phoneme
audio
evaluated
time
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011540097.7A
Other languages
Chinese (zh)
Other versions
CN112614510B (en
Inventor
元海明
刘鲁鹏
王晓红
陈佳路
高强
夏龙
郭常圳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ape Power Future Technology Co Ltd
Original Assignee
Beijing Ape Power Future Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ape Power Future Technology Co Ltd filed Critical Beijing Ape Power Future Technology Co Ltd
Priority to CN202011540097.7A priority Critical patent/CN112614510B/en
Publication of CN112614510A publication Critical patent/CN112614510A/en
Application granted granted Critical
Publication of CN112614510B publication Critical patent/CN112614510B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides an audio quality evaluation method and an audio quality evaluation device, wherein the audio quality evaluation method comprises the following steps: acquiring audio to be evaluated and reference audio; extracting a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extracting a reference phoneme-time sequence corresponding to the reference audio; setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy to generate a reference phoneme-time-weight sequence; calculating the phoneme distance between the phoneme-time sequence to be evaluated and the corresponding phoneme in the reference phoneme-time sequence to obtain a phoneme distance-time sequence; and determining the quality evaluation score of the audio to be evaluated according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence, calibrating the phoneme level of the audio by using the method provided by the application, and correcting the quality evaluation score by using the phonemes as the investigation weight so as to more accurately represent the quality of the audio to be evaluated.

Description

Audio quality evaluation method and device
Technical Field
The present application relates to the field of internet technologies, and in particular, to an audio quality assessment method and apparatus, a computing device, and a computer-readable storage medium.
Background
In recent years, with the rapid development of networks, various audio processing technologies and audio transmission technologies have emerged, and since the subjective feelings of communication users and consumers about audio ultimately depend on the audio quality, the evaluation of audio quality becomes an important research topic.
At present, one of the most widely used objective speech evaluation methods is an objective speech quality evaluation method (PESQ), which gives a score of-0.5 to 4.5 to represent objective MOS distances between a test audio and a comparison audio, but in different application scenarios, actual evaluation values of the quality of the audio are different, and there is a possibility that an effective audio feature part is erroneously eliminated when the test audio is preprocessed, and the PESQ method has a poor effect on the evaluation of the audio quality in such a scenario and cannot give a more accurate score.
Therefore, how to solve the above technical problems becomes a problem to be solved urgently by the skilled person.
Disclosure of Invention
In view of the above, embodiments of the present application provide an audio quality assessment method and apparatus, a computing device, and a computer-readable storage medium, so as to solve the technical defects in the prior art.
According to a first aspect of embodiments of the present application, there is provided an audio quality assessment method, including:
acquiring audio to be evaluated and reference audio;
extracting a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extracting a reference phoneme-time sequence corresponding to the reference audio;
setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy to generate a reference phoneme-time-weight sequence;
calculating the phoneme distance between the phoneme-time sequence to be evaluated and the corresponding phoneme in the reference phoneme-time sequence to obtain a phoneme distance-time sequence;
and determining a quality evaluation score of the audio to be evaluated according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence.
Optionally, the extracting of the phoneme-time sequence to be evaluated corresponding to the audio to be evaluated includes:
extracting a phoneme sequence to be evaluated corresponding to the audio to be evaluated and time corresponding to each phoneme to be evaluated according to a preset speech recognition method;
and generating a phoneme-time sequence to be evaluated according to the phoneme sequence to be evaluated and the time corresponding to each phoneme to be evaluated.
Optionally, extracting a reference phoneme-time sequence corresponding to the reference audio includes:
extracting a reference phoneme sequence corresponding to the reference audio and time corresponding to each reference phoneme according to a preset speech recognition method;
and generating a reference phoneme-time sequence according to the reference phoneme sequence and the time corresponding to each reference phoneme.
Optionally, the setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation policy to generate a reference phoneme-time-weight sequence includes:
determining a weight value of each phoneme type according to a preset evaluation strategy and the phoneme type;
and setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to the weight value of each phoneme type, and generating a reference phoneme-time-weight sequence.
Optionally, the calculating a phoneme distance between the phoneme-time series to be evaluated and a corresponding phoneme in the reference phoneme-time series includes:
performing phoneme alignment on the phoneme-time sequence to be evaluated and the reference phoneme-time sequence;
and calculating the phoneme distance between the phoneme-time sequence to be evaluated after phoneme alignment and the audio fragment corresponding to the corresponding phoneme in the reference phoneme-time sequence by an objective speech quality evaluation method.
Optionally, determining a quality evaluation score of the audio to be evaluated according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence includes:
determining a phoneme score corresponding to each target time point according to the reference phoneme-time-weight sequence and the phoneme weight and phoneme distance corresponding to each target time point of the phoneme distance-time sequence;
and determining the quality evaluation score of the audio to be evaluated according to the phoneme score corresponding to each target time point.
Optionally, before extracting the phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, the method further includes:
and preprocessing the audio to be evaluated to obtain the preprocessed audio to be evaluated.
Optionally, the pre-processing the audio to be evaluated includes:
and carrying out noise reduction processing and/or voice enhancement processing on the audio to be evaluated.
According to a second aspect of embodiments of the present application, there is provided an audio quality evaluation apparatus including:
the acquisition module is configured to acquire audio to be evaluated and reference audio;
the extraction module is configured to extract a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated and extract a reference phoneme-time sequence corresponding to the reference audio;
a setting module configured to set a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy, and generate a reference phoneme-time-weight sequence;
a calculation module configured to calculate a phoneme distance between the phoneme-time series to be evaluated and a corresponding phoneme in the reference phoneme-time series to obtain a phoneme distance-time series;
a determination module configured to determine a quality assessment score for the audio to be assessed based on the reference phoneme-time-weight sequence and the phoneme distance-time sequence.
According to a third aspect of embodiments herein, there is provided a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the audio quality assessment method when executing the instructions.
According to a fourth aspect of embodiments of the present application, there is provided a computer-readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the audio quality assessment method.
According to the audio quality evaluation method provided by the embodiment of the application, the audio to be evaluated and the reference audio are obtained; extracting a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extracting a reference phoneme-time sequence corresponding to the reference audio; setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy to generate a reference phoneme-time-weight sequence; calculating the phoneme distance between the phoneme-time sequence to be evaluated and the corresponding phoneme in the reference phoneme-time sequence to obtain a phoneme distance-time sequence; the quality evaluation score of the audio to be evaluated is determined according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence, the audio quality evaluation method provided by the application is simple in steps and convenient to use, only the audio to be evaluated is subjected to phoneme level calibration, the phonemes are used as the investigation weights according to a preset evaluation strategy, the quality evaluation score is corrected, and the quality of the audio to be evaluated can be represented more accurately according to an application scene.
Drawings
FIG. 1 is a block diagram of a computing device provided by an embodiment of the present application;
FIG. 2 is a flow chart of an audio quality assessment method provided by an embodiment of the present application;
FIG. 3 is a schematic diagram of an audio quality assessment method provided by an embodiment of the present application;
FIG. 4 is a flowchart of an audio quality assessment method applied to a spoken language evaluation scenario according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an audio quality evaluation apparatus according to an embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
First, the noun terms to which one or more embodiments of the present invention relate are explained.
Automatic speech recognition technology: (ASR), a technology for converting human Speech into text.
The objective speech quality assessment method comprises the following steps: (Perceptial evaluation of speed quality, PESQ), a method for providing an objective MOS value evaluation.
Mean subjective opinion score: (Mean opinion scores, MOS) in international standards, MOS values are used uniformly to evaluate voice quality.
Phoneme: the minimum phonetic unit is divided according to the natural attributes of the speech, and is analyzed according to the pronunciation action in the syllable, and one action forms a phoneme. Phonemes are divided into two major categories, vowels and consonants. For example, the chinese syllable ā (a) has only one phoneme, and the p-ai (pi) has two phonemes.
In the present application, an audio quality assessment method and apparatus, a computing device, and a computer-readable storage medium are provided, which are described in detail in the following embodiments one by one.
FIG. 1 shows a block diagram of a computing device 100 according to an embodiment of the present application. The components of the computing device 100 include, but are not limited to, memory 110 and processor 120. The processor 120 is coupled to the memory 110 via a bus 130 and a database 150 is used to store data.
Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 140 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present application, the above-mentioned components of the computing device 100 and other components not shown in fig. 1 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 1 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.
Computing device 100 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.
Wherein the processor 120 may perform the steps of the audio quality assessment method shown in fig. 2. Fig. 2 shows a flow chart of an audio quality assessment method according to an embodiment of the present application, comprising steps 202 to 210.
Step 202: and acquiring the audio to be evaluated and the reference audio.
The audio quality evaluation method provided by the application has a plurality of application scenes, such as spoken language testing, audio matching and the like, and the specific application scene of the audio quality evaluation method is not limited in the application.
The audio to be evaluated is the audio to be evaluated, such as spoken language testing personnel speaking in spoken language testing, and the audio to be matched in audio matching, and the like.
Correspondingly, the reference audio is a standard audio, and in practical application, if the scores of the audio to be evaluated and the reference audio are higher, it is indicated that the quality of the audio to be evaluated is higher.
In a specific embodiment provided by the application, taking an audio matching scene as an example, the obtained audio to be evaluated is a recording of a mobile phone, "little eating breakfast" with reference to the audio as a standard audio.
Step 204: and extracting the phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extracting the reference phoneme-time sequence corresponding to the reference audio.
Optionally, before extracting the phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, the method further includes:
and preprocessing the audio to be evaluated to obtain the preprocessed audio to be evaluated.
Specifically, the preprocessing the audio to be evaluated includes: and carrying out noise reduction processing and/or voice enhancement processing on the audio to be evaluated.
In practical applications, generally, the obtained audio to be evaluated is not pure audio, and there are noise that may cause interference to the audio quality, such as noise, etc., therefore, the audio to be evaluated needs to be preprocessed, there are many preprocessing methods, such as noise reduction processing, speech enhancement processing, etc., and at least one processing method may be optionally used to perform corresponding preprocessing operation on the audio to be processed.
When the audio to be evaluated is preprocessed, a traditional audio processing method, such as a noise reduction module of Web Real-Time Communication (webrtc), may be used, or an audio processing model based on a deep neural network model may also be used.
In a specific embodiment provided by the application, along with the above example, the audio to be evaluated is a recording of a mobile phone, "little breakfast", wherein when a user records a recording with the mobile phone, a whistle sound of the car is recorded at the same time, the audio to be evaluated needs to be subjected to noise reduction processing, the noise of the whistle sound of the car in the audio to be evaluated is removed, and meanwhile, the audio of the "little breakfast" is subjected to voice enhancement.
Specifically, the extracting of the phoneme-time sequence to be evaluated corresponding to the audio to be evaluated includes:
extracting a phoneme sequence to be evaluated corresponding to the audio to be evaluated and time corresponding to each phoneme to be evaluated according to a preset speech recognition method;
and generating a phoneme-time sequence to be evaluated according to the phoneme sequence to be evaluated and the time corresponding to each phoneme to be evaluated.
In practical applications, after obtaining the audio to be evaluated, the audio to be evaluated may be subjected to speech recognition through a preset speech recognition technology, and the automatic speech recognition technology may convert speech into text, where the text may be of many types, such as binary code, character sequence, phoneme sequence, and the like.
Phonemes are the smallest units of speech that are divided according to the natural properties of the speech, and are analyzed according to the pronunciation actions in the syllables, with one action constituting a phoneme. Phonemes are divided into two major categories, vowels and consonants. For example, the chinese syllable ā (a) has only one phoneme, and the p-ai (pi) has two phonemes.
After the audio to be evaluated is processed by a speech recognition technology, an audio sequence to be evaluated corresponding to the audio to be evaluated can be extracted, for example, if the audio to be evaluated is "weather-friendly today", a phoneme sequence of "j, in1, t, i, an1, t, i, an1, q, i4, b, u2, c, uo 4" can be extracted, wherein the number 1 in the phoneme of "in 1" represents a tone.
In practical applications, after the audio to be evaluated is subjected to noise reduction or speech enhancement, the audio may be damaged in the processing process, so that the phoneme recognition result may have errors with respect to the real situation, such as phoneme loss, phoneme replacement, and the like, for example, the real phoneme is "j, in1, t, i, an 1" may be recognized as "j, in1, t, i, ao 1".
The speech recognition technology may further implement time alignment of phonemes and audio, for example, for a phoneme "j" corresponding to 10 th millisecond to 12 th millisecond in the audio to be evaluated, and for a phoneme "in 1" corresponding to 13 th millisecond to 16 th millisecond in the audio to be evaluated, in practical application, the alignment of phonemes and time may be further expressed as dividing according to time, for example, dividing every 1 millisecond, the phonemes corresponding to 10 th to 16 th milliseconds are "j, j, j, in1, in1, in1, in 1", and a specific form of a phoneme-time series to be evaluated is not specifically limited in this application, and is based on practical application.
In a specific embodiment provided by the present application, following the above example, the audio to be evaluated is "small white eating breakfast", the phonemes of the audio to be evaluated are determined to be "x, i, ao3, b, ao2, ch, i1, z, ao3, f, an 4" by using a speech recognition technology, and the generated phoneme-time series to be evaluated is referred to table 1 below for convenience of explanation.
TABLE 1
Phoneme to be evaluated Time (millisecond)
x 5-200
i 201-320
ao3 321-500
b 501-600
ao 2 601-740
ch 741-889
i1 890-1200
z 1201-1396
ao3 1397-1525
f 1526-1650
an4 1651-1950
Specifically, the extracting of the reference phoneme-time sequence corresponding to the reference audio includes:
extracting a reference phoneme sequence corresponding to the reference audio and time corresponding to each reference phoneme according to a preset speech recognition method;
and generating a reference phoneme-time sequence according to the reference phoneme sequence and the time corresponding to each reference phoneme.
The method for extracting the reference phoneme-time sequence corresponding to the reference audio is the same as the method for extracting the phoneme-time sequence to be evaluated, and therefore, the specific method for extracting the reference phoneme-time sequence is described in the above description of extracting the phoneme-time sequence to be evaluated, and is not described herein again.
In one embodiment provided by the present application, following the above example, where the reference audio is "small white eating breakfast", the phonemes of the reference audio are determined by speech recognition techniques to be "x, i, ao3, b, ai2, ch, i1, z, ao3, f, an 4", and the generated reference phoneme-time sequence is shown in table 2 below.
TABLE 2
Reference phoneme Time (millisecond)
x 7-202
i 203-334
ao3 335-514
b 515-620
ai2 621-760
ch 761-909
i1 910-1211
z 1212-1407
ao3 1407-1540
f 1541-1669
an4 1670-1987
Step 206: and setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy to generate a reference phoneme-time-weight sequence.
The preset evaluation strategy is an evaluation strategy determined according to a specific evaluation task, and in order to adapt to different evaluation tasks, different weight values need to be set for each reference phoneme, and a corresponding reference phoneme-time-weight sequence is generated, wherein the reference phoneme-time-weight sequence includes pronunciation time and weight value corresponding to each reference phoneme.
Specifically, the setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation policy to generate a reference phoneme-time-weight sequence includes:
determining a weight value of each phoneme type according to a preset evaluation strategy and the phoneme type;
and setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to the weight value of each phoneme type, and generating a reference phoneme-time-weight sequence.
In some evaluation tasks in practical application, if the sounding of the initial consonant is more concerned, it is necessary to set a weight value of the initial consonant in the phoneme sequence to be higher than a weight value of the final consonant, if the sounding of the final consonant is more concerned, it is necessary to set a weight value of the final consonant in the phoneme sequence to be higher than a weight value of the initial consonant, and a corresponding weight value is set for each reference phoneme in the reference phoneme-time sequence, and after determining a weight value of each phoneme type, a corresponding weight value can be set for each reference phoneme in the reference phoneme-time sequence, for example, in an evaluation task more concerned with the sounding of the initial consonant, the weight value of the initial consonant can be set to 1.5, the weight value of the final consonant can be set to 0.7, and in an evaluation task more concerned with the sounding of the final consonant, the weight value of the initial consonant can be set to 0..
In a specific embodiment provided by the present application, following the above example, the weighted value of the initial consonant is set to 1.3 and the weighted value of the final consonant is set to 0.6 according to the evaluation strategy corresponding to the evaluation task, and the phonemes in the reference phoneme-time sequence are set, and the generated reference phoneme-time-weighted sequence is shown in table 3 below.
TABLE 3
Figure BDA0002854311620000121
Figure BDA0002854311620000131
Step 208: and calculating the phoneme distance between the phoneme-time sequence to be evaluated and the corresponding phoneme in the reference phoneme-time sequence to obtain a phoneme distance-time sequence.
Objective assessment of speech quality (PESQ) is an objective Mean subjective opinion score (MOS) value evaluation method provided by the ITU-T p.862 recommendation. The method is one of the most extensive objective speech evaluation methods at present, and the evaluation method can give a score of-0.5 to 4.5 and represent the objective MOS distance between the audio to be evaluated and the reference audio.
In the present application, the MOS distance corresponding to each phoneme is calculated from the phoneme-time series to be evaluated and the reference phoneme-time series.
Specifically, the calculating of the phoneme distance between the phoneme-time series to be evaluated and the corresponding phoneme in the reference phoneme-time series includes:
performing phoneme alignment on the phoneme-time sequence to be evaluated and the reference phoneme-time sequence;
and calculating the phoneme distance between the phoneme-time sequence to be evaluated after phoneme alignment and the audio fragment corresponding to the corresponding phoneme in the reference phoneme-time sequence by an objective speech quality evaluation method.
In practical applications, when the phoneme is extracted from the audio to be evaluated, the audio to be evaluated may be damaged in the processing process after the noise reduction or speech enhancement processing, which may cause an error between the result of the phoneme recognition and the real situation, and therefore, it is necessary to perform phoneme alignment between the phoneme-time series to be evaluated and the reference phoneme-time series.
There are many possibilities for implementing the phoneme alignment algorithm, for example, when the phoneme sequence in the audio to be evaluated is consistent with the phoneme sequence in the reference audio, the distance may be directly calculated; if the phoneme sequence in the audio to be evaluated has partial deletion and error, but the error or deletion ratio is smaller than the threshold value, the alignment result can be corrected in a distance editing mode; and if the difference between the phoneme sequence in the audio to be evaluated and the phoneme sequence of the reference audio is greater than a preset threshold value, determining a corresponding relation in time according to the similarity of the audio, and aligning the phoneme-time sequence to be evaluated and the reference phoneme-time sequence.
After the phoneme alignment, the phoneme distance between the phoneme-time sequence to be evaluated and the corresponding phoneme audio segment in the reference phoneme-time sequence can be obtained through PESQ.
In a specific embodiment provided by the present application, the phoneme-time series to be evaluated shown in table 1 and the reference phoneme-time series shown in table 2 are subjected to phoneme alignment, and the distance between the phoneme-time series to be evaluated and each phoneme in the reference phoneme-time series is obtained through PESQ calculation, and the obtained phoneme distance-time series is ("x-d, i-d, ao3-d, b-d, ai2-d, ch-d, i1-d, z-d, ao3-d, f-d, an 4-d"), where "x-d" represents the phoneme distance between the phoneme x to be evaluated and the reference phoneme x.
Step 210: and determining a quality evaluation score of the audio to be evaluated according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence.
After the weight information and phoneme distance-time sequence corresponding to each reference phoneme are determined, the quality evaluation score of the audio to be evaluated can be calculated according to the weight information and phoneme distance-time sequence corresponding to each reference phoneme.
Specifically, determining the quality assessment score of the audio to be assessed according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence includes:
determining a phoneme score corresponding to each target time point according to the reference phoneme-time-weight sequence and the phoneme weight and phoneme distance corresponding to each target time point of the phoneme distance-time sequence;
and determining the quality evaluation score of the audio to be evaluated according to the phoneme score corresponding to each target time point.
In practical application, the factor distance and the phoneme weight of the corresponding phoneme in the reference phoneme-time-weight sequence and the phoneme distance-time sequence can be used as the phoneme score corresponding to the current phoneme, and then the quality evaluation score of the audio to be evaluated can be determined according to the phoneme score corresponding to each phoneme.
In a specific embodiment provided by the present application, following the above example, if the weight value corresponding to the phoneme x is 1.3, and the corresponding phoneme distance is x-d, the phoneme score corresponding to the phoneme x is 1.3 × (x-d), and so on, and finally all the phoneme scores are added, so as to determine the quality evaluation score of the audio to be evaluated.
According to the audio quality evaluation method provided by the embodiment of the application, the audio to be evaluated and the reference audio are obtained; extracting a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extracting a reference phoneme-time sequence corresponding to the reference audio; setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy to generate a reference phoneme-time-weight sequence; calculating the phoneme distance between the phoneme-time sequence to be evaluated and the corresponding phoneme in the reference phoneme-time sequence to obtain a phoneme distance-time sequence; the quality evaluation score of the audio to be evaluated is determined according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence, the audio quality evaluation method provided by the application is simple in steps and convenient to use, only the audio to be evaluated is subjected to phoneme level calibration, the phonemes are used as the investigation weights according to a preset evaluation strategy, the quality evaluation score is corrected, and the quality of the audio to be evaluated can be represented more accurately according to an application scene.
To further explain the audio quality assessment method provided by an embodiment of the present application with reference to fig. 3 and fig. 4, fig. 3 shows a schematic diagram of the audio quality assessment method provided by an embodiment of the present application, and as shown in fig. 3, a reference audio x and an audio y to be assessed are obtained, speech recognition is performed on the reference audio x to obtain a reference phoneme-time sequence, speech recognition is performed on the audio y to be assessed to obtain a phoneme-time sequence to be assessed, a phoneme distance-time sequence d (t) is calculated according to the reference phoneme-time sequence and the phoneme-time sequence to be assessed, weights w (t) corresponding to phonemes in the reference phoneme-time sequence are determined according to an assessment task, a quality assessment score MOS of the audio to be assessed is calculated according to the phoneme distance-time sequence d (t) and the weight (w t) corresponding to phonemes, thereby determining the quality of the audio to be evaluated.
Fig. 4 shows a flowchart of an audio quality assessment method applied to a spoken language assessment scenario, where the audio quality assessment method is described by taking spoken language assessment as an example, and includes steps 402 to 416.
Step 402: and acquiring the audio to be evaluated and the reference audio.
In the specific embodiment provided by the application, the obtained audio to be evaluated is a recording recorded by a mobile phone, namely 'grape eating and grape skin not spitting', and the reference audio is an evaluation standard audio 'grape eating and grape skin not spitting'.
Step 404: and carrying out noise reduction processing on the audio to be evaluated to obtain the audio to be evaluated after noise reduction.
In the specific embodiment provided by the application, the noise reduction treatment is carried out on the recording recorded by the mobile phone, namely that grape skin is not spitted when the grape is eaten, so that noise and noise in the recording of the mobile phone are removed, and the quality of the audio to be evaluated is improved.
Step 406: and respectively extracting the phoneme sequences of the audio to be evaluated and the reference audio and the time corresponding to each phoneme according to a preset speech recognition method, and generating a phoneme-time sequence to be evaluated and a reference phoneme-time sequence.
In the specific embodiment provided by the present application, the phoneme-time series to be evaluated of the audio to be evaluated and the reference phoneme-time series of the reference audio are respectively extracted according to a speech recognition technology.
Step 408: and determining the weight value of each phoneme type according to a preset evaluation strategy and the phoneme type.
In the specific embodiment provided by the application, the emphasis on the oral evaluation is on the initial consonant, so that the weight value of the initial consonant is set to 1.6, and the weight value of the final sound is set to 0.8.
Step 410: and setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to the weight value of each phoneme type, and generating a reference phoneme-time-weight sequence.
In the specific embodiment provided by the present application, according to the weight value w corresponding to each phoneme (corresponding to the initial weight value of 1.6 and the final weight value of 0.8 in the above steps), the corresponding relationship w (t) between the pronunciation time t of each phoneme and the weight value w can be obtained.
Step 412: and performing phoneme alignment on the phoneme-time sequence to be evaluated and the reference phoneme-time sequence, and calculating phoneme distances between the phoneme-time sequence to be evaluated and the corresponding audio segments in the reference phoneme-time sequence after the phoneme alignment to obtain a phoneme distance-time sequence.
In the specific embodiment provided by the present application, the phoneme-time sequence to be evaluated and the reference phoneme-time sequence are time-aligned, then the corresponding phoneme distance d in the phoneme-time sequence to be evaluated and the reference phoneme-time sequence is calculated according to the PESQ algorithm, and based on the corresponding relationship between each phoneme and the time t, correspondingly, the corresponding relationship d (t) exists between each phoneme distance d and the time t.
Step 414: and determining the phoneme score corresponding to the target time point according to the phoneme weight and the phoneme distance corresponding to each target time point in the reference phoneme-time-weight sequence and the phoneme distance-time sequence.
In the specific embodiment provided by the present application, according to the corresponding relationship w (t) between the time t and the weight value w of each phoneme and the corresponding relationship d (t) between the phoneme distance d and the time t, the phoneme score corresponding to each phoneme can be determined as w (t) d (t).
Step 416: and determining the quality evaluation score of the audio to be evaluated according to the phoneme score corresponding to each target time point.
In the specific embodiment provided by the present application, the quality assessment score of the audio to be assessed is determined according to w (t) d (t) corresponding to each phoneme, and the higher the score of the finally obtained quality assessment score is, the better the quality of the audio to be assessed is.
According to the audio quality evaluation method provided by the embodiment of the application, the audio to be evaluated and the reference audio are obtained; extracting a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extracting a reference phoneme-time sequence corresponding to the reference audio; setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy to generate a reference phoneme-time-weight sequence; calculating the phoneme distance between the phoneme-time sequence to be evaluated and the corresponding phoneme in the reference phoneme-time sequence to obtain a phoneme distance-time sequence; the quality evaluation score of the audio to be evaluated is determined according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence, the audio quality evaluation method provided by the application is simple in steps and convenient to use, only the audio to be evaluated is subjected to phoneme level calibration, the phonemes are used as the investigation weights according to a preset evaluation strategy, the quality evaluation score is corrected, and the quality of the audio to be evaluated can be represented more accurately according to an application scene.
Corresponding to the above method embodiment, the present application further provides an audio quality assessment apparatus embodiment, and fig. 5 shows a schematic structural diagram of an audio quality assessment apparatus according to an embodiment of the present application. As shown in fig. 5, the apparatus includes:
an obtaining module 502 configured to obtain an audio to be evaluated and a reference audio;
an extracting module 504, configured to extract a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extract a reference phoneme-time sequence corresponding to the reference audio;
a setting module 506 configured to set a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation policy, and generate a reference phoneme-time-weight sequence;
a calculating module 508 configured to calculate a phoneme distance between the phoneme-time series to be evaluated and a corresponding phoneme in the reference phoneme-time series to obtain a phoneme distance-time series;
a determining module 510 configured to determine a quality assessment score for the audio to be assessed according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence.
Optionally, the extracting module 504 is further configured to:
extracting a phoneme sequence to be evaluated corresponding to the audio to be evaluated and time corresponding to each phoneme to be evaluated according to a preset speech recognition method;
and generating a phoneme-time sequence to be evaluated according to the phoneme sequence to be evaluated and the time corresponding to each phoneme to be evaluated.
Optionally, the extracting module 504 is further configured to:
extracting a reference phoneme sequence corresponding to the reference audio and time corresponding to each reference phoneme according to a preset speech recognition method;
and generating a reference phoneme-time sequence according to the reference phoneme sequence and the time corresponding to each reference phoneme.
Optionally, the setting module 506 is further configured to:
determining a weight value of each phoneme type according to a preset evaluation strategy and the phoneme type;
and setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to the weight value of each phoneme type, and generating a reference phoneme-time-weight sequence.
Optionally, the calculating module 508 is further configured to:
performing phoneme alignment on the phoneme-time sequence to be evaluated and the reference phoneme-time sequence;
and calculating the phoneme distance between the phoneme-time sequence to be evaluated after phoneme alignment and the audio fragment corresponding to the corresponding phoneme in the reference phoneme-time sequence by an objective speech quality evaluation method.
Optionally, the determining module 510 is further configured to:
determining a phoneme score corresponding to each target time point according to the reference phoneme-time-weight sequence and the phoneme weight and phoneme distance corresponding to each target time point of the phoneme distance-time sequence;
and determining the quality evaluation score of the audio to be evaluated according to the phoneme score corresponding to each target time point.
Optionally, the apparatus further comprises:
and the preprocessing module is configured to preprocess the audio to be evaluated to obtain the preprocessed audio to be evaluated.
Optionally, the preprocessing module is further configured to perform noise reduction processing and/or speech enhancement processing on the audio to be evaluated.
The audio quality evaluation device provided by the embodiment of the application obtains the audio to be evaluated and the reference audio; extracting a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extracting a reference phoneme-time sequence corresponding to the reference audio; setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy to generate a reference phoneme-time-weight sequence; calculating the phoneme distance between the phoneme-time sequence to be evaluated and the corresponding phoneme in the reference phoneme-time sequence to obtain a phoneme distance-time sequence; the quality evaluation score of the audio to be evaluated is determined according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence, the audio quality evaluation device provided by the application is simplified in steps and convenient to use, only the audio to be evaluated is subjected to phoneme level calibration, the phonemes are used as the investigation weights according to a preset evaluation strategy, the quality evaluation score is corrected, and the quality of the audio to be evaluated can be represented more accurately according to an application scene.
The above is a schematic scheme of an audio quality evaluation apparatus of the present embodiment. It should be noted that the technical solution of the audio quality estimation apparatus and the technical solution of the audio quality estimation method belong to the same concept, and details that are not described in detail in the technical solution of the audio quality estimation apparatus can be referred to the description of the technical solution of the audio quality estimation method.
It should be noted that the components in the device claims should be understood as functional blocks which are necessary to implement the steps of the program flow or the steps of the method, and each functional block is not actually defined by functional division or separation. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.
There is also provided in an embodiment of the present application a computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the audio quality assessment method when executing the instructions.
The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the audio quality assessment method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the audio quality assessment method.
An embodiment of the present application further provides a computer-readable storage medium storing computer instructions, which when executed by a processor, implement the steps of the audio quality assessment method as described above.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the above-mentioned audio quality evaluation method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the above-mentioned audio quality evaluation method.
The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and its practical applications, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims (11)

1. An audio quality assessment method, comprising:
acquiring audio to be evaluated and reference audio;
extracting a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated, and extracting a reference phoneme-time sequence corresponding to the reference audio;
setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy to generate a reference phoneme-time-weight sequence;
calculating the phoneme distance between the phoneme-time sequence to be evaluated and the corresponding phoneme in the reference phoneme-time sequence to obtain a phoneme distance-time sequence;
and determining a quality evaluation score of the audio to be evaluated according to the reference phoneme-time-weight sequence and the phoneme distance-time sequence.
2. The audio quality assessment method of claim 1, wherein extracting the phone-time series to be assessed corresponding to the audio to be assessed comprises:
extracting a phoneme sequence to be evaluated corresponding to the audio to be evaluated and time corresponding to each phoneme to be evaluated according to a preset speech recognition method;
and generating a phoneme-time sequence to be evaluated according to the phoneme sequence to be evaluated and the time corresponding to each phoneme to be evaluated.
3. The audio quality assessment method of claim 1, wherein extracting the reference phoneme-time series corresponding to the reference audio comprises:
extracting a reference phoneme sequence corresponding to the reference audio and time corresponding to each reference phoneme according to a preset speech recognition method;
and generating a reference phoneme-time sequence according to the reference phoneme sequence and the time corresponding to each reference phoneme.
4. The audio quality assessment method of claim 3, wherein setting a corresponding weight value for each reference phoneme in the reference phoneme-time series according to a preset assessment strategy to generate a reference phoneme-time-weight series comprises:
determining a weight value of each phoneme type according to a preset evaluation strategy and the phoneme type;
and setting a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to the weight value of each phoneme type, and generating a reference phoneme-time-weight sequence.
5. The audio quality assessment method of claim 1, wherein calculating the phone distance of the phone-time series to be assessed from the corresponding phone in the reference phone-time series comprises:
performing phoneme alignment on the phoneme-time sequence to be evaluated and the reference phoneme-time sequence;
and calculating the phoneme distance between the phoneme-time sequence to be evaluated after phoneme alignment and the audio fragment corresponding to the corresponding phoneme in the reference phoneme-time sequence by an objective speech quality evaluation method.
6. The audio quality assessment method of claim 1, wherein determining a quality assessment score for the audio to be assessed based on the reference phoneme-time-weight sequence and the phoneme distance-time sequence comprises:
determining a phoneme score corresponding to each target time point according to the reference phoneme-time-weight sequence and the phoneme weight and phoneme distance corresponding to each target time point of the phoneme distance-time sequence;
and determining the quality evaluation score of the audio to be evaluated according to the phoneme score corresponding to each target time point.
7. The audio quality assessment method according to claim 1, further comprising, before extracting a phoneme-time series to be assessed corresponding to the audio to be assessed, the following steps:
and preprocessing the audio to be evaluated to obtain the preprocessed audio to be evaluated.
8. The audio quality assessment method of claim 7, wherein pre-processing the audio to be assessed comprises:
and carrying out noise reduction processing and/or voice enhancement processing on the audio to be evaluated.
9. An audio quality evaluation apparatus, comprising:
the acquisition module is configured to acquire audio to be evaluated and reference audio;
the extraction module is configured to extract a phoneme-time sequence to be evaluated corresponding to the audio to be evaluated and extract a reference phoneme-time sequence corresponding to the reference audio;
a setting module configured to set a corresponding weight value for each reference phoneme in the reference phoneme-time sequence according to a preset evaluation strategy, and generate a reference phoneme-time-weight sequence;
a calculation module configured to calculate a phoneme distance between the phoneme-time series to be evaluated and a corresponding phoneme in the reference phoneme-time series to obtain a phoneme distance-time series;
a determination module configured to determine a quality assessment score for the audio to be assessed based on the reference phoneme-time-weight sequence and the phoneme distance-time sequence.
10. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1-8 when executing the instructions.
11. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1 to 8.
CN202011540097.7A 2020-12-23 2020-12-23 Audio quality assessment method and device Active CN112614510B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011540097.7A CN112614510B (en) 2020-12-23 2020-12-23 Audio quality assessment method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011540097.7A CN112614510B (en) 2020-12-23 2020-12-23 Audio quality assessment method and device

Publications (2)

Publication Number Publication Date
CN112614510A true CN112614510A (en) 2021-04-06
CN112614510B CN112614510B (en) 2024-04-30

Family

ID=75244507

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011540097.7A Active CN112614510B (en) 2020-12-23 2020-12-23 Audio quality assessment method and device

Country Status (1)

Country Link
CN (1) CN112614510B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113450429A (en) * 2021-07-26 2021-09-28 北京猿力未来科技有限公司 Track drawing method and device
CN117409778A (en) * 2023-12-14 2024-01-16 深圳市友杰智新科技有限公司 Decoding processing method, device, equipment and storage medium
CN117612566A (en) * 2023-11-16 2024-02-27 书行科技(北京)有限公司 Audio quality assessment method and related product
CN113450429B (en) * 2021-07-26 2024-06-04 北京猿力未来科技有限公司 Track drawing method and device

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004334164A (en) * 2002-10-24 2004-11-25 Toshimasa Ishihara System for learning pronunciation and identification of english phonemes "l" and "r"
US20150302848A1 (en) * 2014-04-21 2015-10-22 International Business Machines Corporation Speech retrieval method, speech retrieval apparatus, and program for speech retrieval apparatus
CN105788589A (en) * 2016-05-04 2016-07-20 腾讯科技(深圳)有限公司 Audio data processing method and device
CN107945788A (en) * 2017-11-27 2018-04-20 桂林电子科技大学 A kind of relevant Oral English Practice pronunciation error detection of text and quality score method
CN108257615A (en) * 2018-01-15 2018-07-06 北京物灵智能科技有限公司 A kind of user language appraisal procedure and system
CN109545244A (en) * 2019-01-29 2019-03-29 北京猎户星空科技有限公司 Speech evaluating method, device, electronic equipment and storage medium
CN109545243A (en) * 2019-01-23 2019-03-29 北京猎户星空科技有限公司 Pronunciation quality evaluating method, device, electronic equipment and storage medium
CN109686383A (en) * 2017-10-18 2019-04-26 腾讯科技(深圳)有限公司 A kind of speech analysis method, device and storage medium
CN109754784A (en) * 2017-11-02 2019-05-14 华为技术有限公司 The method of the method and speech recognition of training Filtering Model
CN110085261A (en) * 2019-05-16 2019-08-02 上海流利说信息技术有限公司 A kind of pronunciation correction method, apparatus, equipment and computer readable storage medium
CN110148427A (en) * 2018-08-22 2019-08-20 腾讯数码(天津)有限公司 Audio-frequency processing method, device, system, storage medium, terminal and server
CN110176249A (en) * 2019-04-03 2019-08-27 苏州驰声信息科技有限公司 A kind of appraisal procedure and device of spoken language pronunciation
CN110288977A (en) * 2019-06-29 2019-09-27 联想(北京)有限公司 A kind of data processing method, device and electronic equipment
JP2020144213A (en) * 2019-03-06 2020-09-10 Kddi株式会社 Program, device and method for pronunciation evaluation using inter-model distance
CN111785299A (en) * 2020-08-13 2020-10-16 腾讯科技(深圳)有限公司 Voice evaluation method, device and equipment and computer storage medium
CN111816210A (en) * 2020-06-23 2020-10-23 华为技术有限公司 Voice scoring method and device
CN112017690A (en) * 2020-10-09 2020-12-01 腾讯科技(深圳)有限公司 Audio processing method, device, equipment and medium

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004334164A (en) * 2002-10-24 2004-11-25 Toshimasa Ishihara System for learning pronunciation and identification of english phonemes "l" and "r"
US20150302848A1 (en) * 2014-04-21 2015-10-22 International Business Machines Corporation Speech retrieval method, speech retrieval apparatus, and program for speech retrieval apparatus
CN105788589A (en) * 2016-05-04 2016-07-20 腾讯科技(深圳)有限公司 Audio data processing method and device
CN109686383A (en) * 2017-10-18 2019-04-26 腾讯科技(深圳)有限公司 A kind of speech analysis method, device and storage medium
CN109754784A (en) * 2017-11-02 2019-05-14 华为技术有限公司 The method of the method and speech recognition of training Filtering Model
CN107945788A (en) * 2017-11-27 2018-04-20 桂林电子科技大学 A kind of relevant Oral English Practice pronunciation error detection of text and quality score method
CN108257615A (en) * 2018-01-15 2018-07-06 北京物灵智能科技有限公司 A kind of user language appraisal procedure and system
CN110148427A (en) * 2018-08-22 2019-08-20 腾讯数码(天津)有限公司 Audio-frequency processing method, device, system, storage medium, terminal and server
CN109545243A (en) * 2019-01-23 2019-03-29 北京猎户星空科技有限公司 Pronunciation quality evaluating method, device, electronic equipment and storage medium
CN109545244A (en) * 2019-01-29 2019-03-29 北京猎户星空科技有限公司 Speech evaluating method, device, electronic equipment and storage medium
JP2020144213A (en) * 2019-03-06 2020-09-10 Kddi株式会社 Program, device and method for pronunciation evaluation using inter-model distance
CN110176249A (en) * 2019-04-03 2019-08-27 苏州驰声信息科技有限公司 A kind of appraisal procedure and device of spoken language pronunciation
CN110085261A (en) * 2019-05-16 2019-08-02 上海流利说信息技术有限公司 A kind of pronunciation correction method, apparatus, equipment and computer readable storage medium
CN110288977A (en) * 2019-06-29 2019-09-27 联想(北京)有限公司 A kind of data processing method, device and electronic equipment
CN111816210A (en) * 2020-06-23 2020-10-23 华为技术有限公司 Voice scoring method and device
CN111785299A (en) * 2020-08-13 2020-10-16 腾讯科技(深圳)有限公司 Voice evaluation method, device and equipment and computer storage medium
CN112017690A (en) * 2020-10-09 2020-12-01 腾讯科技(深圳)有限公司 Audio processing method, device, equipment and medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113450429A (en) * 2021-07-26 2021-09-28 北京猿力未来科技有限公司 Track drawing method and device
CN113450429B (en) * 2021-07-26 2024-06-04 北京猿力未来科技有限公司 Track drawing method and device
CN117612566A (en) * 2023-11-16 2024-02-27 书行科技(北京)有限公司 Audio quality assessment method and related product
CN117612566B (en) * 2023-11-16 2024-05-28 书行科技(北京)有限公司 Audio quality assessment method and related product
CN117409778A (en) * 2023-12-14 2024-01-16 深圳市友杰智新科技有限公司 Decoding processing method, device, equipment and storage medium
CN117409778B (en) * 2023-12-14 2024-03-19 深圳市友杰智新科技有限公司 Decoding processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN112614510B (en) 2024-04-30

Similar Documents

Publication Publication Date Title
CN111916111B (en) Intelligent voice outbound method and device with emotion, server and storage medium
CN111260761B (en) Method and device for generating mouth shape of animation character
Kelly et al. Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors
CN112331180A (en) Spoken language evaluation method and device
CN112735371B (en) Method and device for generating speaker video based on text information
Rammo et al. Detecting the speaker language using CNN deep learning algorithm
Borsos et al. Speechpainter: Text-conditioned speech inpainting
CN112802461B (en) Speech recognition method and device, server and computer readable storage medium
CN112614510B (en) Audio quality assessment method and device
CN109300339A (en) A kind of exercising method and system of Oral English Practice
US20230298564A1 (en) Speech synthesis method and apparatus, device, and storage medium
Sinha et al. Acoustic-phonetic feature based dialect identification in Hindi Speech
CN114125506B (en) Voice auditing method and device
JP2015175859A (en) Pattern recognition device, pattern recognition method, and pattern recognition program
Yousfi et al. Holy Qur'an speech recognition system Imaalah checking rule for warsh recitation
CN114708876A (en) Audio processing method and device, electronic equipment and storage medium
Vanderreydt et al. A Novel Channel estimate for noise robust speech recognition
CN114724589A (en) Voice quality inspection method and device, electronic equipment and storage medium
CN114005428A (en) Speech synthesis method, apparatus, electronic device, storage medium, and program product
CN114299989A (en) Voice filtering method and device, electronic equipment and storage medium
Sabu et al. Improving the Noise Robustness of Prominence Detection for Children's Oral Reading Assessment
Mansouri et al. Human Laughter Generation using Hybrid Generative Models.
Pan et al. Mandarin vowel pronunciation quality evaluation by a novel formant classification method and its combination with traditional algorithms
CN112686041B (en) Pinyin labeling method and device
CN112786052B (en) Speech recognition method, electronic equipment and storage device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant