CN117912450A

CN117912450A - Voice quality detection method, related method, device, equipment and storage medium

Info

Publication number: CN117912450A
Application number: CN202410034643.1A
Authority: CN
Inventors: 杨康; 李宝善; 吴奎; 张凯波; 盛志超
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2024-01-09
Filing date: 2024-01-09
Publication date: 2024-04-19

Abstract

The application discloses a voice quality detection method, a related device and a related storage medium. The method comprises the following steps: extracting acoustic features of target voice; identifying based on acoustic features to obtain phoneme probability of each audio frame in the target voice and identification text of the target voice; based on a reference dictionary, obtaining phoneme pronunciation of each character in the recognition text, and based on the phoneme pronunciation of each character in the recognition text, obtaining a plurality of candidate pronunciation paths; wherein each phoneme on the candidate pronunciation paths forms a candidate phoneme sequence; and obtaining a detection value of the target voice about at least one pronunciation quality index based on the phoneme probability of the audio frame and the candidate pronunciation paths. By the aid of the scheme, accuracy of target voice pronunciation quality detection can be improved.

Description

Voice quality detection method, related method, device, equipment and storage medium

Technical Field

The present application relates to the field of speech processing, and in particular, to a speech quality detection method, and related method, apparatus, device, and storage medium.

Background

Along with the development of voice recognition technology, voice information can be automatically transcribed so as to convert voice into characters, thereby facilitating subsequent quality detection and analysis.

Illustratively, in an educational scenario, a user may detect the user's spoken ability through a test question type, such as free-speech. In the prior art, the quality of the target voice is generally determined based on the recognition text transcribed from the target voice of the user. However, in the process of identifying the target voice, the context relation and other auxiliary information are referred to generate the identification text, so that the identification text cannot effectively reflect the pronunciation accuracy of the target voice, and the accuracy of detecting the pronunciation quality of the target voice is lower. In view of this, how to improve the accuracy of target voice pronunciation quality detection is a problem to be solved.

Disclosure of Invention

The application mainly solves the technical problem of providing a voice quality detection method, a related device, a related equipment and a related storage medium, and can improve the accuracy of detecting the pronunciation quality of target voice.

In order to solve the above technical problems, a first aspect of the present application provides a method for detecting voice quality, including extracting acoustic features of a target voice; identifying based on acoustic features to obtain phoneme probability of each audio frame in the target voice and identification text of the target voice; based on a reference dictionary, obtaining phoneme pronunciation of each character in the recognition text, and based on the phoneme pronunciation of each character in the recognition text, obtaining a plurality of candidate pronunciation paths; wherein each phoneme on the candidate pronunciation paths forms a candidate phoneme sequence; and obtaining a detection value of the target voice about at least one pronunciation quality index based on the phoneme probability of the audio frame and the candidate pronunciation paths.

In order to solve the technical problem, a second aspect of the present application provides a method for evaluating free speech, including obtaining a target speech of a target object for free speech, and detecting the target speech to obtain a detection value of the target speech with respect to at least one pronunciation quality index; wherein the detection value is obtained by the voice quality detection method described in the first aspect; and obtaining an evaluation value of the target object for free speaking based on at least the detection value of the target voice about at least one pronunciation quality index.

In order to solve the technical problem, a third aspect of the present application provides a voice quality detection device, which includes an extraction module, a recognition module, an acquisition module and a generation module, where the extraction module is used to extract acoustic features of a target voice; the recognition module is used for recognizing based on the acoustic characteristics to obtain the phoneme probability of each audio frame in the target voice and the recognition text of the target voice; the acquisition module is used for acquiring the phoneme pronunciation of each character in the recognition text based on the reference dictionary and acquiring a plurality of candidate pronunciation paths based on the phoneme pronunciation of each character in the recognition text; wherein each phoneme on the candidate pronunciation paths forms a candidate phoneme sequence; the generation module is used for obtaining a detection value of the target voice about at least one pronunciation quality index based on the phoneme probability of the audio frame and the candidate pronunciation paths.

In order to solve the above technical problem, a fourth aspect of the present application provides a free-speech evaluation device, which includes a detection module and an evaluation module, where the detection module is configured to obtain a target speech of a target object for free-speech detection, to obtain a detection value of the target speech with respect to at least one pronunciation quality index; wherein the detection value is obtained by the voice quality detection apparatus described in the above third aspect; the evaluation module is used for obtaining an evaluation value of the target object for free speaking based on at least the detection value of the target voice about at least one pronunciation quality index.

In order to solve the above-mentioned technical problem, a fifth aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, where the memory stores program instructions, and the processor is configured to execute the program instructions to implement the voice quality detection method described in the first aspect, or to implement the free-speech evaluation method described in the second aspect.

In order to solve the above-mentioned technical problem, a sixth aspect of the present application provides a computer-readable storage medium storing program instructions executable by a processor for implementing the voice quality detection method described in the above-mentioned first aspect or for implementing the free-speech evaluation method described in the above-mentioned second aspect.

According to the scheme, after the target voice is obtained, the acoustic characteristics of the target voice are extracted, recognition is carried out based on the acoustic characteristics, the phoneme probability of each audio frame in the target voice and the recognition text of the target voice are obtained, the phoneme pronunciation of each character in the recognition text is obtained based on the reference dictionary, a plurality of candidate pronunciation paths are obtained based on the phoneme pronunciation of each character in the recognition text, each phoneme on the candidate pronunciation paths forms a candidate phoneme sequence, and the detection value of the target voice about at least one pronunciation quality index is obtained based on the phoneme probability of the audio frame and the plurality of candidate pronunciation paths.

Drawings

FIG. 1 is a flow chart of an embodiment of a voice quality detection method according to the present application;

FIG. 2 is a schematic diagram of an embodiment of the voice quality detection method of the present application for identifying acoustic features;

FIG. 3 is a schematic diagram of a full probability set prediction method according to an embodiment of the present application;

FIG. 4 is a flow chart of an embodiment of the free-form assessment method of the present application;

FIG. 5 is a schematic flow chart of another embodiment of the free-speech assessment method of the present application;

FIG. 6 is a schematic diagram of a voice quality detection apparatus according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a frame of an embodiment of the free radicals evaluation apparatus of the present application;

FIG. 8 is a schematic diagram of a frame of an embodiment of an electronic device of the present application;

FIG. 9 is a schematic diagram of a frame of an embodiment of a computer readable storage medium of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two.

Referring to fig. 1, fig. 1 is a flow chart of a voice quality detection method according to an embodiment of the application. Specifically, the method may include the steps of:

step S10: and extracting the acoustic characteristics of the target voice.

In the embodiments of the present disclosure, the language of the target voice is not limited, for example, chinese, english, etc., and is not illustrated here.

In one implementation scenario, after the target voice is collected and before the acoustic feature of the target voice is extracted, the target voice may be preprocessed to remove some interference factors in the target voice, for example, operations such as filtering, noise reduction and the like are performed on the collected target voice, and then the acoustic feature is extracted on the processed target voice, so that accuracy of detecting the pronunciation quality of the target voice is improved.

In one implementation scenario, the method for extracting the acoustic features of the target speech includes, but is not limited to, MFCC (mel-frequency cepstral coefficient, mel Frequency Cepstrum Coefficient), LPCC (linear prediction cepstral coefficient), etc., and specific reference may be made to the technical details of acoustic feature extraction, which are not described herein for brevity.

In a specific implementation scenario, before extracting the acoustic features of the target speech, since the target speech is of indefinite length, it is first required to split it into a plurality of small segments of fixed length according to a certain method, i.e. framing. Specifically, the time length of each frame is 20 ms, which is not limited to ensure that there are enough periods in a frame and that the phoneme changes are not too severe. In addition, in order to avoid the problem of missing speech information caused by the boundary of the time window, when the time window of each frame taken from the signal is shifted, a part of the overlapping area between frames is required. Specifically, the offset of this time window is half the frame length, that is, each step is offset by about one half the frame later, and the final position of the next frame is taken as the time window.

Step S20: and identifying based on the acoustic features to obtain the phoneme probability of each audio frame in the target voice and the identification text of the target voice.

In one implementation, recognition is performed based on acoustic features to obtain phoneme probabilities for each audio frame in the target speech, where the phoneme probabilities for each audio frame may be used to characterize the true pronunciation of the target speech. Phonemes are the smallest phonetic units that are partitioned according to the natural properties of speech. From an acoustic standpoint, a phoneme is the smallest unit of speech that is separated from a sound quality perspective. From a physiological standpoint, a pronunciation action forms a phoneme. If "ma" contains "m" and "a" two pronunciation actions, two phonemes. The sounds made by the same pronunciation action are the same phonemes, and the sounds made by different pronunciation actions are different phonemes.

In a specific implementation scenario, different phone systems include different numbers of phones, for example, 48 phones used to represent language phones in linguistics are used as the phone system of the present application to describe all speech phones in human language, including vowel phones, consonant phones, and combination phones, which are not illustrated herein.

In a specific implementation scenario, a phoneme probability of each audio frame under a corresponding phoneme system is obtained, for example, the phoneme system includes five phonemes of "s1", "s2", "s3", "s4", "s5", and based on acoustic characteristics, the phoneme probability of a certain audio frame is obtained as "P (s 1) =0.45", "P (s 2) =0.25", "P (s 3) =0.75", "P (s 1) =0.45", "P (s 1) =0.25", and the set of probabilities is taken as the phoneme probability corresponding to a certain audio frame.

In one implementation scenario, the recognition text of the target speech is obtained based on acoustic features, and as a possible implementation, an audio recognition model may be trained in advance, where the audio recognition model may include, but is not limited to, a network model of Encoder-Decoder architecture, etc., the target speech is input into the audio recognition model, and the output result of the audio recognition model is used as the recognition text. In order to ensure the recognition accuracy of the audio recognition model as much as possible, sample voices can be collected, real recognition texts are marked on the sample voices, on the basis of the sample voices, the sample voices can be processed based on the audio recognition model to obtain predicted recognition texts related to the characteristics of the sample voices, and therefore network parameters of the audio recognition model can be adjusted based on the difference between the sample recognition texts and the predicted recognition texts until the training convergence of the audio recognition model, and target voices can be processed based on the training convergence of the audio recognition model to obtain recognition texts. It should be noted that, for the specific processing procedure of the audio recognition model, reference may be made to technical details such as a network model of Encoder-Decoder architecture, which are not described herein. According to the method, the trained audio recognition model is used for recognizing the target voice, so that the accuracy of acquiring the recognition text is improved as much as possible, and the accuracy of detecting the pronunciation quality of the target voice is improved.

Referring to fig. 2, fig. 2 is a schematic diagram illustrating an embodiment of the voice quality detection method for identifying acoustic features according to the present application. As shown in fig. 2, the phoneme probability of each audio frame in the target speech and the recognition text of the target speech are recognized based on an audio recognition model, wherein the audio recognition model comprises an encoder, a decoder and a CTC (Connectionist Temporal Classification, connected to a time sequence classifier), acoustic features of the target speech are input to the encoder for encoding, the output of the encoder is input to the CTC and the decoder at the same time, the CTC aligns the acoustic features and the phoneme probabilities to obtain the phoneme probability of each audio frame in the target speech, and the decoder decodes the input result to obtain the recognition text of the target speech.

Step S30: based on the reference dictionary, obtaining the phoneme pronunciation of each character in the recognition text, and based on the phoneme pronunciation of each character in the recognition text, obtaining a plurality of candidate pronunciation paths.

In one implementation, the reference dictionary includes pronunciations of a plurality of characters, for example, in a chinese system, the phonetic symbols correspond to chinese characters, in an english system, the phonetic symbols correspond to words, and the reference dictionary can be used to find the corresponding pronunciations according to the characters in the recognized text to obtain the grapheme pronunciations of the characters in the recognized text. Based on the phoneme pronunciation of each character in the recognition text, a plurality of candidate pronunciation paths can be obtained, and each phoneme on the candidate pronunciation paths forms a candidate phoneme sequence.

In a specific implementation scenario, the recognition text is divided into a plurality of sub-texts based on labels, semantic information and the like in the recognition text, and each sub-text is detected respectively, so that complexity of data processing is reduced. As a possible implementation manner, the recognition text is input into a semantic segmentation model, and the full text of the recognition text is segmented based on the content semantics of the recognition text, so that an output result of the semantic segmentation model is obtained as each sub-text corresponding to the recognition text. By the method, each sub-text in the identification text is obtained by using the semantic segmentation model, and the generation efficiency is improved. It should be noted that, the model architecture of the semantic segmentation model is not limited in the present application, for example BERT (Bidirectional Encoder Representation from Transformers).

In a specific implementation scenario, a single character in the recognition text has multiple pronunciations, i.e., multiple grapheme pronunciations, and several candidate pronunciation paths about the recognition text can be obtained, for example, the recognition text contains "character a", "character B", "character C", "character D", in which the phoneme pronunciation of "character a" is "a", "the phoneme pronunciation of" character B "is" B1 "or" B2"," the phoneme pronunciation of "character C" is "C", "the phoneme pronunciation of" D1 "or" D2", and thus candidate pronunciation paths 1" a B1C D1", candidate pronunciation paths 2" a B2C D1", candidate pronunciation paths 3" a B1C D2", and candidate pronunciation paths 4" a B2C D2 "about the recognition text can be obtained.

In a specific implementation scenario, there may be a case of a phoneme jump in the target speech, for example, the phoneme pronunciation of the word a in the recognition text is "a b c d" based on the reference dictionary, and the reading method of the word a in the target speech may be "a b d", "a b c", "a b d c", "b d c a", and so on, so that several candidate pronunciation paths are formed.

In one specific implementation, each phoneme on the candidate pronunciation path forms a candidate phoneme sequence, for example, the candidate pronunciation path is "a b c d", and the corresponding candidate phoneme sequence is { abcd }.

Step S40: and obtaining a detection value of the target voice about at least one pronunciation quality index based on the phoneme probability of the audio frame and the candidate pronunciation paths.

In one implementation scenario, the phonemic pronunciation of each character in the recognition text in the reference dictionary may be used to represent the standard pronunciation situation of the target speech, and the phonemic probability of each audio frame in the target speech may be used to represent the real pronunciation situation of the target speech.

In a specific implementation scenario, based on a phoneme probability of an audio frame, selecting a candidate phoneme sequence satisfying a preset condition from the candidate phoneme sequences and serving as a target phoneme sequence for identifying a text, acquiring frame information of each phoneme in the target phoneme sequence, wherein the frame information comprises a frame number and a phoneme probability of an audio frame corresponding to the phoneme in a target voice, and acquiring a detection value of the target voice about at least one pronunciation quality index based on at least the target phoneme sequence and the frame information.

In a specific implementation scenario, the matching degree between the candidate phoneme sequence and the target speech is calculated based on the phoneme probability of each audio frame, and the preset condition is that the matching degree is highest, so as to obtain the target phoneme sequence.

In a specific implementation scenario, multiple consecutive audio frames represent the same phoneme, e.g., the first frame to the third frame represent "phoneme 1", and the fourth frame to the seventh frame represent "phoneme 2", it is understood that multiple phonemes may constitute a word or word, and multiple words or words may constitute a complete sentence.

In a specific implementation scenario, at least one pronunciation quality index includes a pronunciation accuracy, an actual pronunciation of each phoneme in the target phoneme sequence in the target speech is determined based on the target phoneme sequence and the frame information, a standard pronunciation of each phoneme in the target phoneme sequence in a reference dictionary is obtained, and a detection value of the target speech about the pronunciation accuracy is obtained based on a difference between the actual pronunciation of each phoneme in the target phoneme sequence and the standard pronunciation. According to the scheme, on the premise that the reference text corresponding to the target voice cannot be obtained, the phoneme pronunciation of each character in the recognition text in the reference dictionary can be used for representing the standard pronunciation condition of the target voice, the phoneme probability of each audio frame in the target voice can be used for representing the real pronunciation condition of the target voice, and on the basis of the standard pronunciation condition and the real pronunciation condition of the target voice, the reference information which is as accurate as possible can be provided for obtaining the detection value of the pronunciation quality index, so that the accuracy of detecting the pronunciation quality of the target voice can be improved.

In a specific implementation scenario, based on frame information of phonemes in a target phoneme sequence, a full-probability set of each phoneme under a phoneme system is obtained, wherein the full-probability set comprises phoneme predictions, and based on the full-probability set of phonemes, actual pronunciation of the phonemes in target speech is obtained. According to the scheme, the actual pronunciation of the phonemes in the target voice is represented based on the full-probability set of the phonemes, so that accuracy of the actual pronunciation is improved, and convenience in correcting the target voice is improved.

Referring to fig. 3, fig. 3 is a schematic diagram illustrating an embodiment of a full probability set prediction method for voice quality detection according to the present application. As shown in fig. 3, the full probability set is predicted by a pronunciation probability model, where the pronunciation probability model includes an encoder and a decoder, and the pronunciation probability model is based on the encoder to encode acoustic features of the target speech to obtain phoneme probabilities of each audio frame in the target speech, and the decoder is based on the phoneme probabilities of each audio frame in the target speech and frame information of phonemes in the target phoneme sequence to obtain a full probability set of phonemes predicted as each phoneme in the phoneme system. For example, the decoder obtains the phoneme probabilities of each audio frame in the target speech output by the encoder and the frame information p1, p2 … pn of n phonemes in the target phoneme sequence, the output of the decoder is a full-probability set of the corresponding phoneme system, for example, the prob (p 1), prob (p 2) … prob (p 48) probabilities of 48 phonemes in english, and the sum of prob (p 1), prob (p 2) … prob (p 48) is 1.

In a specific implementation scenario, the target phonemes are predicted to be the full-probability set of each phoneme in the phoneme system, whether the corresponding phoneme with the largest probability is consistent with the target phonemes is judged, when the corresponding phoneme with the largest probability is consistent with the target phonemes, the pronunciation in the target speech is considered to be correct, and conversely, when the corresponding phoneme with the largest probability is inconsistent with the target phonemes, the pronunciation in the target speech is considered to be wrong.

In a specific implementation scenario, the target phonemes are predicted to be the full-probability set of each phoneme in the phoneme system, whether the corresponding phoneme with the largest probability is consistent with the target phonemes is judged, and when the corresponding phoneme with the largest probability is inconsistent with the target phonemes, a user is prompted to misread a certain phoneme as another phoneme.

In a specific implementation scenario, the target phonemes are predicted to be the full probability set of each phoneme in the phoneme system, and whether the target phonemes are standard or not is judged based on probability values, for example, a quality interval is set, when the probability is greater than 0.8, the pronunciation quality is better, when the probability is not greater than 0.8 and is greater than 0.65, the pronunciation quality is generally represented, when the probability is not greater than 0.65 and is greater than 0.5, the pronunciation quality is poorer, and the examples are not given here.

In a specific implementation scenario, at least one pronunciation quality index includes a voice fluency, before a detection value of a target voice about at least one pronunciation quality index is obtained based on a phoneme probability of an audio frame and a plurality of candidate pronunciation paths, the number of words in a text is counted and identified, after the number of words is obtained, the number of frames of the target phoneme sequence is obtained based on frame information of each phoneme in the target phoneme sequence, and a detection value of the target voice about the voice fluency is obtained based on the number of frames and the number of words. According to the scheme, the speech speed information in the target speech is obtained based on the frame number and the word number, and accuracy of detecting the pronunciation quality of the target speech is improved.

Referring to fig. 4, fig. 4 is a flow chart of an embodiment of the method for evaluating the free radicals according to the present application.

Specifically, the method may include the steps of:

step S11: and acquiring target voice of the target object for free speaking, and detecting to obtain a detection value of the target voice about at least one pronunciation quality index.

In the embodiment of the present disclosure, the obtaining of the detection value of the target voice about the at least one pronunciation quality index may refer to specific steps in the above disclosed embodiment, and for brevity, will not be described herein again.

In one implementation scenario, the free-form language is a spoken language investigation mode, and different from the reading type, the problems of free answer, no standard answer and difficult evaluation exist, so that the comprehensiveness and the spoken language capability of investigation can be reflected.

Step S21: and obtaining an evaluation value of the target object for free speaking based on at least the detection value of the target voice about at least one pronunciation quality index.

In one implementation scenario, the acoustic features of the target speech are encoded based on an encoder to obtain the phoneme probability of each audio frame in the target speech, the phoneme probability of each audio frame in the target speech and the frame information of the phonemes in the target phoneme sequence are decoded based on a decoder to obtain a full-probability set of each phoneme in a phoneme system predicted by the phonemes, the score of each phoneme is obtained based on the probability value of each phoneme, the score of a syllable is obtained based on the score of each phoneme, the score of a word or a word is obtained based on the score of the syllable, the score of a text in the target speech is obtained, and the evaluation value of the free speaking of the target object is obtained based on the score of the text in the target speech.

In one implementation scenario, after the detection value of the target voice about the voice fluency is obtained, the evaluation value of the target object for free speaking can be obtained, and it can be understood that the higher the detection value of the target voice about the voice fluency is, the more fluency is represented when the target object is free-speaking, and conversely, the lower the detection value of the target voice about the voice fluency is, the less fluency is represented when the target object is free-speaking, and the lower the evaluation value is.

In one implementation scenario, the free-words provide reference texts, such as free-words questions, reference answers, and the like, before an evaluation value of the target object for free-words is obtained at least based on a detection value of the target speech about at least one pronunciation quality index, a recognition text of the target speech is obtained, a plurality of key sub-texts in the reference texts of the free-words are extracted, a correlation value representing a degree of correlation between the target speech and a free-words topic is obtained based on a first phoneme sequence of the recognition text and a second phoneme sequence of the plurality of key sub-texts, the higher the correlation degree is, the higher the degree of correlation between the target speech and the free-words topic is represented, and conversely, the lower the correlation degree is, the evaluation value of the target object for free-words is determined based on the correlation value and the detection value of the at least one pronunciation quality index. According to the scheme, the similarity degree between the identification text and the reference text is judged based on the first phoneme sequence of the identification text and the second phoneme sequences of the key sub-texts, and the probability that the identification text contains synonyms and results in poor correlation detection results is reduced, so that the accuracy of target voice quality detection of a target object in free speaking can be improved.

In one implementation scenario, based on the recognition text, the grammar score in the recognition text is derived, and it can be appreciated that the higher the grammar score, the better the grammar logic characterizing the recognition text, and conversely, the lower the grammar score, the worse the grammar logic characterizing the recognition text.

In a specific implementation scenario, the overall evaluation of the free speaking about the target object is comprehensively given based on detection values such as pronunciation accuracy, spoken language fluency, departure results, grammar scores and the like. For example, firstly judging whether the problem is divorced, if the problem is divorced, directly reporting the problem divorce prompt, otherwise, weighting the pronunciation accuracy, the spoken language smoothness and the grammar score according to the investigation key point to give a final score, giving a detail score corresponding to each dimension, and indicating the defect in the target voice.

According to the scheme, after the target speech of the target object about the free words is obtained, the phonemic pronunciation of each character in the reference dictionary in the recognition text can be used for representing the standard pronunciation condition of the target speech, the phonemic probability of each audio frame in the target speech can be used for representing the real pronunciation condition of the target speech, and the reference information which is as accurate as possible can be provided for obtaining the detection value about the pronunciation quality index based on the standard pronunciation condition and the real pronunciation condition about the target speech, so that the auxiliary information which is as effective as possible can be provided for determining the evaluation value of the target object for free words, and the accuracy of evaluating the free words of the target object can be improved.

Referring to fig. 5, fig. 5 is a flow chart of another embodiment of the method for evaluating the free radicals according to the present application. Specifically, the method may include the steps of:

step S10: and extracting the acoustic characteristics of the target voice.

In the embodiment of the present disclosure, step S10 may refer to specific steps in the above disclosed embodiment, and for brevity, will not be described herein again.

In the embodiment of the present disclosure, step S20 may refer to specific steps in the above disclosed embodiment, and for brevity, will not be described herein again.

In the embodiment of the present disclosure, step S30 may refer to specific steps in the above disclosed embodiment, and for brevity, will not be described herein again.

Step S12: and acquiring a recognition text of the target voice, and extracting a plurality of key sub-texts in the reference text of the free words.

In the embodiment of the present disclosure, step S12 may refer to specific steps in the above disclosed embodiment, and for brevity, will not be described herein again.

Step S22: and obtaining a correlation value representing the correlation degree of the target voice and the free-speaking subject based on the first phoneme sequence of the recognition text and the second phoneme sequences of the key sub-texts.

In the embodiment of the present disclosure, step S22 may refer to specific steps in the above disclosed embodiment, and for brevity, will not be described herein again.

Step S21: and determining an evaluation value of the target object for free speaking.

In the embodiment of the present disclosure, step S21 may refer to specific steps in the above disclosed embodiment, and for brevity, will not be described herein again.

According to the scheme, after the target speech of the target object about the free words is obtained, the phoneme pronunciation of each character in the reference dictionary in the recognition text can be used for representing the standard pronunciation condition of the target speech, the phoneme probability of each audio frame in the target speech can be used for representing the real pronunciation condition of the target speech, and based on the standard pronunciation condition and the real pronunciation condition about the target speech, the reference information which is as accurate as possible can be provided for obtaining the detection value about the pronunciation quality index, further the auxiliary information which is as effective as possible can be provided for determining the evaluation value of the target object for free words, the probability that the recognition text contains synonyms and leads to poor correlation detection results is reduced, and therefore the accuracy of the free words evaluation of the target object can be improved.

Referring to fig. 6, fig. 6 is a schematic diagram of a voice quality detecting apparatus 60 according to an embodiment of the application. As shown in fig. 6, the voice quality detecting apparatus 60 includes an extracting module 61, a recognizing module 62, an acquiring module 63, and a generating module 64, the extracting module 61 being configured to extract acoustic features of a target voice; the recognition module 62 is configured to perform recognition based on the acoustic feature, so as to obtain a phoneme probability of each audio frame in the target speech and a recognition text of the target speech; the obtaining module 63 is configured to obtain a phoneme pronunciation of each character in the recognition text based on the reference dictionary, and obtain a plurality of candidate pronunciation paths based on the phoneme pronunciation of each character in the recognition text; wherein each phoneme on the candidate pronunciation paths forms a candidate phoneme sequence; the generating module 64 is configured to obtain a detection value of the target speech with respect to at least one pronunciation quality indicator based on the phoneme probability of the audio frame and the candidate pronunciation paths.

Therefore, after the voice quality detection device 60 obtains the target voice, extracts the acoustic feature of the target voice, performs recognition based on the acoustic feature, obtains the phoneme probability of each audio frame in the target voice and the recognition text of the target voice, obtains the phoneme pronunciation of each character in the recognition text based on the reference dictionary, obtains a plurality of candidate pronunciation paths based on the phoneme pronunciation of each character in the recognition text, forms a candidate phoneme sequence on each phoneme on the candidate pronunciation paths, and obtains the detection value of the target voice about at least one pronunciation quality index based on the phoneme probability of the audio frame and the plurality of candidate pronunciation paths.

In some disclosed embodiments, the generating module 64 further includes a detection value generating module (not shown) for selecting a candidate phoneme sequence as a target phoneme sequence for the recognition text based on the phoneme probabilities of the audio frames; acquiring frame information of each phoneme in a target phoneme sequence; the frame information of the phonemes comprises the frame number and the phoneme probability of the phonemes corresponding to the audio frames in the target voice; based at least on the target phoneme sequence and the frame information, a detection value of the target speech with respect to at least one pronunciation quality indicator is obtained.

In some disclosed embodiments, the detection value generation module (not shown) further includes a pronunciation accuracy generation module (not shown) configured to determine an actual pronunciation of each phoneme in the target phoneme sequence in the target speech based on the target phoneme sequence and the frame information, and obtain a standard pronunciation of each phoneme in the target phoneme sequence in the reference dictionary; and obtaining a detection value of the target voice about the pronunciation accuracy based on the difference between the actual pronunciation and the standard pronunciation of each phoneme in the target phoneme sequence.

In some disclosed embodiments, the pronunciation accuracy generation module (not shown) further includes an actual pronunciation generation module (not shown) configured to obtain a full-probability set including phonemes predicted as each phoneme under the phoneme system based on frame information of the phonemes in the target phoneme sequence; based on the full probability set of the phonemes, the actual pronunciation of the phonemes in the target speech is obtained.

In some disclosed embodiments, the actual pronunciation generating module (not shown) further includes a model prediction module (not shown), the full probability set is predicted by a pronunciation probability model, the pronunciation probability model includes an encoder and a decoder, and the encoder is used for encoding acoustic features of the target speech to obtain phoneme probabilities of each audio frame in the target speech; and decoding the phoneme probability of each audio frame in the target speech and the frame information of the phonemes in the target phoneme sequence based on the decoder to obtain a full-probability set of the phonemes predicted as each phoneme in the phoneme system.

In some disclosed embodiments, the detection value generation module (not shown) further includes a voice fluency generation module (not shown) for statistically identifying the number of words in the text; obtaining the frame number of the target phoneme sequence based on the frame information of each phoneme in the target phoneme sequence; based on the number of frames and the number of words, a detection value of the target voice about the voice fluency is obtained.

Referring to fig. 7, fig. 7 is a schematic diagram of a frame of an embodiment of the free-speech evaluation device 70 of the present application. As shown in fig. 7, the free-speech evaluation device 70 includes a detection module 71 and an evaluation module 72, where the detection module 71 is configured to obtain a target speech of a target object for free-speech detection, so as to obtain a detection value of the target speech about at least one pronunciation quality index; wherein the detection value is obtained by the voice quality detection means 60 in any of the above embodiments; the evaluation module 72 is configured to obtain an evaluation value of the target object for free speaking based on at least the detection value of the target voice with respect to the at least one pronunciation quality index.

In the above-mentioned scheme, after the freehand evaluation device 70 acquires the target speech about the freehand, the phonemic pronunciation of each character in the recognition text in the reference dictionary can be used to represent the standard pronunciation condition of the target speech, and the phonemic probability of each audio frame in the target speech can be used to represent the real pronunciation condition of the target speech.

In some disclosed embodiments, the freehand evaluation device 70 further includes an evaluation sub-module (not shown) for acquiring the recognition text of the target voice and extracting several key sub-texts in the reference text of the freehand; based on the first phoneme sequence of the recognition text and the second phoneme sequences of the key sub-texts, obtaining a correlation value representing the correlation degree of the target voice and the free-speaking subject; and determining an evaluation value of the target object for free speaking based on the correlation value and the detection value of the at least one pronunciation quality index.

Referring to fig. 8, fig. 8 is a schematic diagram of a frame of an electronic device 80 according to an embodiment of the application. As shown in fig. 8, the electronic device 80 includes a memory 81 and a processor 82 coupled to each other, where the memory 81 stores program instructions, and the processor 82 is configured to execute the program instructions to implement steps in any of the above-mentioned voice quality detection method embodiments, or steps in any of the above-mentioned freehand evaluation method embodiments. Specifically, the electronic device 80 may include, but is not limited to: servers, desktop computers, notebook computers, tablet computers, smart phones, etc., are not limited herein. In particular, the processor 82 is configured to control itself and the memory 81 to implement the steps of any of the above-described voice quality detection method embodiments, or any of the freehand evaluation method embodiments. The processor 82 may also be referred to as a CPU (Central Processing Unit ). The processor 82 may be an integrated circuit chip having signal processing capabilities. The Processor 82 may also be a general purpose Processor, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), an Application SPECIFIC INTEGRATED Circuit (ASIC), a Field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 82 may be commonly implemented by an integrated circuit chip.

Therefore, after the electronic device 80 obtains the target voice, extracts the acoustic feature of the target voice, performs recognition based on the acoustic feature, obtains the phoneme probability of each audio frame in the target voice and the recognition text of the target voice, obtains the phoneme pronunciation of each character in the recognition text based on the reference dictionary, obtains a plurality of candidate pronunciation paths based on the phoneme pronunciation of each character in the recognition text, forms a candidate phoneme sequence on each phoneme on the candidate pronunciation paths, and obtains the detection value of the target voice about at least one pronunciation quality index based on the phoneme probability of the audio frame and the plurality of candidate pronunciation paths.

Referring to FIG. 9, FIG. 9 is a schematic diagram of a computer readable storage medium 90 according to an embodiment of the application. The computer readable storage medium 90 stores program instructions 91 executable by the processor, the program instructions 91 for implementing the steps of any of the above-described voice quality detection method embodiments, or steps of any of the freehand evaluation method embodiments.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.

The elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physically located, or may be distributed over a plurality of network elements. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

If the technical scheme of the application relates to personal information, the product applying the technical scheme of the application clearly informs the personal information processing rule before processing the personal information and obtains the autonomous agreement of the individual. If the technical scheme of the application relates to sensitive personal information, the product applying the technical scheme of the application obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'explicit consent'. For example, a clear and remarkable mark is set at a personal information acquisition device such as a camera to inform that the personal information acquisition range is entered, personal information is acquired, and if the personal voluntarily enters the acquisition range, the personal information is considered as consent to be acquired; or on the device for processing the personal information, under the condition that obvious identification/information is utilized to inform the personal information processing rule, personal authorization is obtained by popup information or a person is requested to upload personal information and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing mode, and a type of personal information to be processed.

Claims

1. A method for detecting speech quality, comprising:

extracting acoustic features of target voice;

Based on the acoustic characteristics, recognizing to obtain the phoneme probability of each audio frame in the target voice and the recognition text of the target voice;

based on a reference dictionary, obtaining the phoneme pronunciation of each character in the recognition text, and based on the phoneme pronunciation of each character in the recognition text, obtaining a plurality of candidate pronunciation paths; wherein each phoneme on the candidate pronunciation paths forms a candidate phoneme sequence;

And obtaining a detection value of the target voice about at least one pronunciation quality index based on the phoneme probability of the audio frame and the candidate pronunciation paths.

2. The method of claim 1, wherein the obtaining a detection value of the target speech with respect to at least one pronunciation quality indicator based on the phoneme probability of the audio frame and the number of candidate pronunciation paths comprises:

Selecting a candidate phoneme sequence as a target phoneme sequence of the recognition text based on the phoneme probability of the audio frame;

acquiring frame information of each phoneme in the target phoneme sequence; wherein the frame information of the phonemes includes a number of frames of the phonemes in the target speech corresponding to the audio frames and a phoneme probability;

And obtaining a detection value of the target voice about at least one pronunciation quality index at least based on the target phoneme sequence and the frame information.

3. The method of claim 2, wherein the at least one voicing quality indicator comprises a voicing accuracy, wherein the deriving the detection value of the target speech with respect to the at least one voicing quality indicator based at least on the target phoneme sequence and the frame information comprises:

Determining the actual pronunciation of each phoneme in the target phoneme sequence in the target voice based on the target phoneme sequence and the frame information, and acquiring the standard pronunciation of each phoneme in the target phoneme sequence in the reference dictionary;

And obtaining a detection value of the target voice about the pronunciation accuracy based on the difference between the actual pronunciation and the standard pronunciation of each phoneme in the target phoneme sequence.

4. A method according to claim 3, wherein said determining the actual pronunciation of each phoneme in said target phoneme sequence in said target speech based on said target phoneme sequence and said frame information comprises:

Based on the frame information of the phonemes in the target phoneme sequence, obtaining a full probability set of each phoneme in a phoneme system, wherein the full probability set comprises the phonemes predicted to be in a phoneme system;

And obtaining the actual pronunciation of the phoneme in the target voice based on the full probability set of the phoneme.

5. A method as defined in claim 4, wherein the full probability set is predicted from a pronunciation probability model, the pronunciation probability model including an encoder and a decoder, the obtaining the full probability set including the phonemes predicted as each phoneme in a phoneme hierarchy based on frame information of the phonemes in the target phoneme sequence, comprising:

encoding the acoustic features of the target speech based on the encoder to obtain phoneme probabilities of each audio frame in the target speech;

And decoding the phoneme probability of each audio frame in the target voice and the frame information of the phonemes in the target phoneme sequence based on the decoder to obtain a full-probability set of each phoneme in the phoneme system predicted by the phonemes.

6. The method of claim 2, wherein the at least one voicing quality indicator comprises a degree of speech fluency, the method further comprising, prior to the deriving the detection value for the target speech for the at least one voicing quality indicator based on the phoneme probability of the audio frame and the number of candidate pronunciation paths:

counting the number of words in the identification text;

the obtaining a detection value of the target voice about at least one pronunciation quality index based on the phoneme probability of the audio frame and the candidate pronunciation paths includes:

Obtaining the frame number of the target phoneme sequence based on the frame information of each phoneme in the target phoneme sequence;

and obtaining a detection value of the target voice about the voice fluency based on the frame number and the word number.

7. A method of freeform evaluation, comprising:

Acquiring target voice of a target object for free speaking, and detecting to obtain a detection value of the target voice about at least one pronunciation quality index; wherein the detection value is obtained by the voice quality detection method according to any one of claims 1 to 6;

And obtaining an evaluation value of the target object for free speaking based on at least the detection value of the target voice about at least one pronunciation quality index.

8. The method of claim 7, wherein prior to said deriving an evaluation of the target subject for free speaking based at least on the detected value of the target speech for at least one pronunciation quality indicator, the method further comprises:

Acquiring a recognition text of the target voice, and extracting a plurality of key sub-texts in the reference text of the free words;

Obtaining a correlation value representing the correlation degree of the target voice and the free-speaking subject based on the first phoneme sequence of the recognition text and the second phoneme sequences of the key sub-texts;

the obtaining an evaluation value of the target object for free speaking based on at least the detection value of the target voice about at least one pronunciation quality index includes:

And determining an evaluation value of the target object for free speaking based on the association value and the detection value of the at least one pronunciation quality index.

9. A voice quality detection apparatus, comprising:

the extraction module is used for extracting the acoustic characteristics of the target voice;

the recognition module is used for recognizing based on the acoustic characteristics to obtain the phoneme probability of each audio frame in the target voice and the recognition text of the target voice;

The acquisition module is used for acquiring the phoneme pronunciation of each character in the recognition text based on the reference dictionary and acquiring a plurality of candidate pronunciation paths based on the phoneme pronunciation of each character in the recognition text; wherein each phoneme on the candidate pronunciation paths forms a candidate phoneme sequence;

And the generation module is used for obtaining a detection value of the target voice about at least one pronunciation quality index based on the phoneme probability of the audio frame and the candidate pronunciation paths.

10. A freeform evaluation apparatus, comprising:

The detection module is used for acquiring target voice of the target object for free speaking and detecting to obtain a detection value of the target voice about at least one pronunciation quality index; wherein the detection value is obtained by the voice quality detection apparatus of claim 9;

And the evaluation module is used for obtaining an evaluation value of the target object for free speaking at least based on the detection value of the target voice about at least one pronunciation quality index.

11. An electronic device comprising a memory and a processor coupled to each other, the processor being configured to execute program instructions stored in the memory to implement the method of detecting speech quality according to any one of claims 1 to 6 or to implement the method of evaluating free-speech according to any one of claims 7 to 8.

12. A computer readable storage medium having stored thereon program instructions, which when executed by a processor, implement the speech quality detection method of any of claims 1to 6, or implement the freehand evaluation method of any of claims 7 to 8.