CN114420159A - Audio evaluation method and device and non-transient storage medium - Google Patents

Audio evaluation method and device and non-transient storage medium Download PDF

Info

Publication number
CN114420159A
CN114420159A CN202011083263.5A CN202011083263A CN114420159A CN 114420159 A CN114420159 A CN 114420159A CN 202011083263 A CN202011083263 A CN 202011083263A CN 114420159 A CN114420159 A CN 114420159A
Authority
CN
China
Prior art keywords
text
score
model
audio
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011083263.5A
Other languages
Chinese (zh)
Inventor
杨晓飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Shengtong Information Technology Co ltd
Original Assignee
Suzhou Shengtong Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Shengtong Information Technology Co ltd filed Critical Suzhou Shengtong Information Technology Co ltd
Priority to CN202011083263.5A priority Critical patent/CN114420159A/en
Publication of CN114420159A publication Critical patent/CN114420159A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • G10L2015/0633Creating reference templates; Clustering using lexical or orthographic knowledge sources

Abstract

An audio evaluation method and device and a non-transient storage medium are provided. The audio evaluation method comprises the following steps: acquiring audio data and a reference text including a first text; performing first decoding operation on the audio data based on the first pronunciation dictionary model and the first language model to obtain a first decoded text; determining a first score according to the first decoded text and the reference text; determining a second pronunciation dictionary model according to the reference text and the first pronunciation dictionary model; determining a second language model according to the reference text, wherein the second language model is obtained based on the reference text training; performing second decoding operation on the audio data based on the second pronunciation dictionary model and the second language model to obtain a second decoded text and a corresponding relation between the audio data and the second decoded text; determining a second score according to the first text, the second decoded text and the corresponding relation; and determining a final score of the audio data according to the first score and the second score.

Description

Audio evaluation method and device and non-transient storage medium
Technical Field
The embodiment of the disclosure relates to an audio evaluation method, an audio evaluation device and a non-transient storage medium.
Background
Spoken language communication is an important interpersonal communication mode and plays an important role in the actual life of people. With the continuous development of social economy, people put higher and higher requirements on the efficiency of spoken language learning and the objectivity, fairness and scale test of spoken language assessment. The traditional artificial oral level evaluation method cannot ensure the unification of scoring standards due to the individual difference of evaluators; in addition, because of the need of a large amount of manpower, material resources and financial resources, the manual evaluation method is not suitable for large-scale spoken language tests.
With the continuous maturity of voice technology, the application of voice technology in various fields is more and more extensive. Oral evaluation is one of the earliest application fields of voice technology, and more oral teachers and users have a dispute to teach and learn oral languages by means of the oral evaluation technology.
Disclosure of Invention
At least one embodiment of the present disclosure provides an audio evaluation method, including: acquiring audio data and a reference text, wherein the reference text comprises a first text; performing first decoding operation on the audio data based on a first pronunciation dictionary model and a first language model to obtain a first decoded text; determining a first score according to the first decoded text and the reference text; determining a second pronunciation dictionary model according to the reference text and the first pronunciation dictionary model; determining a second language model according to the reference text, wherein the second language model is obtained by training based on the reference text; performing a second decoding operation on the audio data based on the second pronunciation dictionary model and the second language model to obtain a second decoded text and a corresponding relation between the audio data and the second decoded text; determining a second score according to the first text, the second decoded text and the corresponding relation between the audio data and the second decoded text; and determining a final score of the audio data according to the first score and the second score.
For example, in an audio evaluation method provided by some embodiments of the present disclosure, determining the second pronunciation dictionary model according to the reference text and the first pronunciation dictionary model includes: in response to any word in the reference text not appearing in the first pronunciation dictionary model, generating a pronunciation of the any word based on the any word, and adding the any word and the pronunciation of the any word to the first pronunciation dictionary model to obtain the second pronunciation dictionary model; and in response to all words in the reference text appearing in the first pronunciation dictionary model, treating the first pronunciation dictionary model as the second pronunciation dictionary model.
For example, in the audio evaluation method provided by some embodiments of the present disclosure, generating the pronunciation of any word based on the any word includes: processing the any word using a grapheme-to-phoneme conversion model to generate a pronunciation of the any word.
For example, in an audio evaluation method provided by some embodiments of the present disclosure, performing the first decoding operation on the audio data based on the first pronunciation dictionary model and the first language model to obtain the first decoded text includes: constructing a first weighted finite state transducer decoding graph based on an acoustic model, a context-dependent phonon model, the first pronunciation dictionary model and the first language model; and performing the first decoding operation on the audio data by using a Viterbi algorithm based on the first weighted finite state transducer decoding diagram to obtain the first decoded text.
For example, in the audio evaluating method provided in some embodiments of the present disclosure, performing the second decoding operation on the audio data based on the second pronunciation dictionary model and the second language model to obtain the second decoded text and the corresponding relationship between the audio data and the second decoded text includes: constructing a second weighted finite state transducer decoding graph based on the acoustic model, the context dependent phonon model, the second pronunciation dictionary model, and the second language model; and performing the second decoding operation on the audio data by using a Viterbi algorithm based on the second weighted finite state transducer decoding diagram to obtain the second decoded text and the corresponding relation between the audio data and the second decoded text.
For example, in the audio evaluation method provided by some embodiments of the present disclosure, the acoustic model includes a chain model or a gaussian mixture model-hidden markov model based on a time-delay neural network.
For example, in some embodiments of the present disclosure, the audio evaluation method includes a second language model.
For example, in an audio evaluation method provided in some embodiments of the present disclosure, determining the first score according to the first decoded text and the reference text includes: determining an overlap and a longest common subsequence between the first decoded text and the reference text; and deriving the first score based on the degree of overlap and the longest common subsequence.
For example, in the audio evaluation method provided in some embodiments of the present disclosure, determining the second score according to the first text, the second decoded text, and the correspondence between the audio data and the second decoded text includes: determining a second text corresponding to the first text in the second decoded text; determining an audio segment corresponding to the second text in the audio data based on the corresponding relation between the audio data and the second decoded text; and determining the second score based on the first text and the audio segment corresponding to the second text.
For example, in some embodiments of the present disclosure, the audio evaluation method, where the first text includes at least one text segment, and determining the second score based on the first text and the audio segment corresponding to the second text, includes: determining an audio sub-segment corresponding to each word in each of the at least one text segment based on the first text and the audio segment corresponding to the second text; determining a word score of each word according to an audio sub-segment corresponding to each word in each text segment based on a pronunciation accuracy algorithm, and taking an average value of the word scores of all the words in each text segment as a segment score of each text segment; and determining the second score according to the segment score of the at least one text segment.
For example, in an audio evaluation method provided by some embodiments of the present disclosure, determining the final score of the audio data according to the first score and the second score includes: acquiring a first weight corresponding to the first score and a second weight corresponding to the second score; and determining the final score according to the first score, the first weight, the second score and the second weight, wherein the final score is represented as:
Score_Final=W1*Score_1+W2*Score2,
wherein Score _ Final represents the Final Score, Score _1 represents the first Score, Score2 represents the second Score, W1 represents the first weight, W2 represents the second weight, and W1+ W2 ═ 1.
For example, in the audio evaluation method provided by some embodiments of the present disclosure, the value range of the first weight W1 is [0.3,0.5 ].
For example, in some embodiments of the present disclosure, an audio evaluation method is provided in which the first text includes at least one of a number, a symbol unit, and a foreign word.
For example, in the audio evaluating method provided in some embodiments of the present disclosure, the audio data includes voice data answering a test question, the reference text includes at least one reference answer text corresponding to the test question, and each of the reference answer texts includes the first text.
At least one embodiment of the present disclosure further provides an audio evaluating apparatus, including: a memory for non-transitory storage of computer readable instructions; and a processor for executing the computer readable instructions, wherein the computer readable instructions, when executed by the processor, perform the audio evaluation method provided by any embodiment of the present disclosure.
At least one embodiment of the present disclosure also provides a non-transitory storage medium that stores non-transitory computer-readable instructions, wherein the non-transitory computer-readable instructions, when executed by a computer, perform the instructions of the audio evaluation method provided by any one of the embodiments of the present disclosure.
Drawings
To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure and are not limiting to the present disclosure.
Fig. 1 is a flowchart of an audio evaluation method according to at least one embodiment of the present disclosure;
fig. 2 is a schematic block diagram of an audio evaluation device according to at least one embodiment of the present disclosure; and
fig. 3 is a schematic diagram of a storage medium according to at least one embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described below clearly and completely with reference to the accompanying drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.
Unless otherwise defined, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.
The present disclosure is illustrated by the following specific examples. To maintain the following description of the embodiments of the present disclosure clear and concise, a detailed description of known functions and known components have been omitted from the present disclosure. When any component of an embodiment of the present disclosure appears in more than one drawing, that component is represented by the same or similar reference numeral in each drawing.
The semi-open question type is a common question type in spoken language evaluation, and the semi-open question type refers to a test item in which a test system plays prompt contents such as images, videos or short texts and the like, and requires a user to answer related questions or to retest the played contents and the like according to the prompt contents. For example, in spoken language evaluation of the semi-open topic type, a user's voice is converted into text by a voice Recognition engine based on an Automatic Speech Recognition (ASR) technique, and then the text is semantically matched with a preset reference answer to obtain a final score of the user's voice.
For example, a generic speech recognition engine may include a generic acoustic model (acoustic model), a generic context-dependent phonon model, a generic pronunciation dictionary (lexicon) model, and a generic language model (language model). For example, an acoustic model may be used to identify a sound signal (e.g., audio data) as an acoustic unit (e.g., a context-dependent phone, such as a triphone). For example, the speech recognition engine may frame audio data and extract feature information for each frame, and then input the extracted feature information into the acoustic model. For example, the acoustic Model may process the above feature information (including determining a Hidden Markov Model (HMM) state corresponding to each frame, determining a transition process, etc.) to obtain a sequence of context-dependent phones (e.g., a triphone sequence). For example, the context-dependent phone model may process the sequence of context-dependent phones described above to obtain a sequence of phones (e.g., monophones). For example, a pronunciation dictionary model, which includes a vocabulary of words (abbreviated as a vocabulary) that can be processed by a speech recognition engine and its pronunciation, can be used to convert the phoneme sequence into a word sequence. For example, a language model may be used to filter the word sequences to obtain grammatical sentences (e.g., word strings made up of multiple words).
In the evaluation process of the semi-open question type, when a special word exists in a reference answer, even if a user correctly answers the special word, a general speech recognition engine may not recognize the special word, resulting in a low score (especially when the special word is a keyword of the reference answer). For example, special words include, but are not limited to, unknown words (OOV, also known as "Out Of Vocabulary"), uncommon words, and the like. For example, unknown words refer to words that do not appear in the vocabulary used by the speech recognition engine, and uncommon words refer to words that appear less frequently in the corpus used to build the language model.
With the test title "What do you use all eat on Dragon Boat Festival? "the reference answer is" Zongzi "or" peoplesual eat Zongzi "(where the keyword is" Zongzi ") as an example, on one hand, since the vocabulary in the general speech recognition engine cannot cover all the vocabulary, the word" Zongzi "(which is a foreign word with respect to english) is likely to belong to an unknown word (i.e., not present in the vocabulary), and thus the general speech recognition engine cannot recognize it (e.g., may erroneously recognize it as" zone z "); on the other hand, even if the word "zongzi" exists in the vocabulary, the general speech recognition engine may not recognize the word due to the influence of factors such as the general language model (for example, the word "zongzi" is likely to belong to a rare word). Therefore, in the spoken language evaluation of the semi-open topic type based on the general speech recognition engine, if the reference answer includes a keyword that cannot be recognized by the general speech recognition engine, the score may be low even if the user correctly answers the keyword.
At least one embodiment of the present disclosure provides an audio evaluation method. The audio evaluation method comprises the following steps: acquiring audio data and a reference text, wherein the reference text comprises a first text; performing first decoding operation on the audio data based on the first pronunciation dictionary model and the first language model to obtain a first decoded text; determining a first score according to the first decoded text and the reference text; determining a second pronunciation dictionary model according to the reference text and the first pronunciation dictionary model; determining a second language model according to the reference text, wherein the second language model is obtained by training based on the reference text; performing second decoding operation on the audio data based on the second pronunciation dictionary model and the second language model to obtain a second decoded text and a corresponding relation between the audio data and the second decoded text; determining a second score according to the first text, the second decoding text and the corresponding relation between the audio data and the second decoding text; and determining a final score of the audio data according to the first score and the second score.
Some embodiments of the present disclosure also provide an audio evaluation device and a non-transitory storage medium corresponding to the audio evaluation method described above.
The audio evaluation method provided by the embodiment of the disclosure can determine a first score of audio data according to a first pronunciation dictionary model and a first language model and by combining a reference text, determine a second pronunciation dictionary model and a second language model according to the reference text (for example, a reference answer text), and determine a second score of the audio data according to the first text in the reference text (for example, a keyword in the reference answer text), so that a final score of the audio data can be determined by combining the first score and the second score, and further, the problem that the final score is possibly low because the first score is directly used as the final score of the audio data (because the problem that the first text cannot be identified may exist in the process of determining the first score of the audio data) can be avoided, and a more objective evaluation (for example, a semi-open-ended spoken language evaluation) of the audio data can be provided, More reasonable and more accurate evaluation results and higher practicability.
Some embodiments of the present disclosure and examples thereof are described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.
Fig. 1 is a flowchart of an audio evaluation method according to at least one embodiment of the present disclosure. For example, the audio evaluation method may be applied to a computing device, and the computing device includes any electronic device having a computing function, for example, a smart phone, a notebook computer, a tablet computer, a desktop computer, a server, and the like, which is not limited in this respect by the embodiments of the present disclosure. For example, the computing device has a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU), and the computing device further includes a memory. The Memory is, for example, a nonvolatile Memory (e.g., a Read Only Memory (ROM)) on which codes of an operating system are stored. For example, the memory further stores codes or instructions, and by executing the codes or instructions, the audio evaluation method provided by the embodiment of the disclosure can be implemented.
For example, as shown in fig. 1, the audio evaluation method includes steps S10 to S80.
Step S10: audio data and reference text are obtained, wherein the reference text comprises first text.
For example, in some embodiments, the audio evaluation method shown in fig. 1 may be applied to spoken language evaluation of the semi-open topic type, but is not limited thereto. It should be noted that, for convenience and brevity of description, the present disclosure is described in terms of applying the audio evaluation method shown in fig. 1 to spoken language evaluation of a semi-open caption type, but should not be construed as limiting the present disclosure.
For example, the audio data may include voice data for answering a test question, e.g., audio data for a user answering a test question of the semi-open question type; for example, the reference text may include at least one reference answer text corresponding to the test question, and for example, each reference answer text may include the first text. For example, the first text may include one or more keywords (or key phrases) in the reference answer text.
For example, in one specific example, the test is entitled "What do side crop use eat on Dragon Boat Festival? ", the audio data may include voice data of the user answering the test question, the reference text may include a reference answer text" Zongzi "or/and a reference answer text" period usally eat Zongzi ", and the first text may include a keyword" Zongzi ".
For example, in some embodiments, the audio profiling method shown in fig. 1 may be performed locally, e.g., by a client. In this case, the reference text may be a reference answer text pre-stored in the client, or may be a reference answer text received by the client from the server, which is not limited in this embodiment of the disclosure; the audio data may include, but is not limited to, speech captured by an audio capture device of the client. For example, the audio data and the reference text may also be obtained by the client from the network.
For example, the client includes, but is not limited to, a smartphone, a tablet, a Personal computer, a Personal Digital Assistant (PDA), a wearable device, a head-mounted display device, etc., and for example, the audio capture device includes, but is not limited to, a microphone built in or external to the client. For example, the audio data may be recorded in real time or pre-recorded, and the embodiment of the disclosure is not limited thereto.
For example, in other embodiments, the audio evaluation method shown in fig. 1 may also be performed remotely, e.g., by a server. In this case, the server can receive the audio data uploaded by the user through the client, then perform an audio evaluation process based on a reference text pre-stored in the server, and return an evaluation result to the client for the user to refer to; of course, the reference text may not be stored in the server in advance, but uploaded to the server by the user through the client together with the audio data.
For example, the language of the reference text may be one of english, french, german, spanish, chinese, japanese, korean, etc., and embodiments of the present disclosure include, but are not limited to, this. For example, in the case where the language of the reference text is a language such as chinese, japanese, korean, or the like, "one word" in the embodiments of the present disclosure may be pair-wise understood as "one word" (e.g., chinese characters, etc.). For example, the language used in the audio data is generally consistent with the language of the reference text, including but not limited to embodiments of the present disclosure.
For example, the reference text is typically a sentence or phrase, e.g., each sentence or phrase comprising a number of words. Embodiments of the present disclosure include, but are not limited to, this.
Step S20: and performing a first decoding operation on the audio data based on the first pronunciation dictionary model and the first language model to obtain a first decoded text.
For example, in some embodiments, step S20 may include the following steps S21 through S22.
Step S21: constructing a first weighted finite state converter decoding graph based on the acoustic model, the context-dependent phonon model, the first pronunciation dictionary model and the first language model;
step S22: a first decoding operation is performed on the audio data using a Viterbi algorithm based on the first weighted finite state transducer decoding diagram to obtain a first decoded text.
For example, the acoustic model, the context-dependent phone model, the first pronunciation dictionary model, and the first language model in step S21 may respectively adopt a general acoustic model, a general context-dependent phone model, a general pronunciation dictionary model, and a general language model (i.e., a general speech recognition engine may be adopted to perform the first decoding operation in step S20), which is not limited by the embodiments of the present disclosure. The specific details of each of the four general models can refer to the foregoing related description and related technologies in the field of natural language processing, and are not described herein again.
For example, the acoustic Model may be modeled by using a Gaussian Mixture Model (GMM) -Hidden Markov Model (HMM) architecture (GMM-HMM), a Deep Neural Network (DNN) -Hidden Markov Model architecture (DNN-HMM), or a Time-Delay Neural Network (DNN) based Chain (Chain) Model architecture. It should be noted that the embodiments of the present disclosure include but are not limited thereto.
For example, the context-dependent phonon model may be constructed using a clustering algorithm (e.g., a data-driven based clustering algorithm or a decision tree based clustering algorithm, etc.), including but not limited to. For example, the specific technical details of constructing the context-dependent phonon model by using the clustering algorithm can refer to the related technologies in the field of natural language processing, and are not described herein again.
For example, a first pronunciation dictionary model (e.g., a generic pronunciation dictionary model) can include a vocabulary of words and their pronunciations that can be processed by a generic speech recognition engine. For example, specific technical details for constructing the first pronunciation dictionary model may refer to related technologies in the field of natural language processing, and are not described herein again.
For example, the first language model (e.g., a generic language model) may be a statistical-based language model or a neural network-based language model, among others, and embodiments of the present disclosure include, but are not limited to, this. For example, the statistical-based language model includes an N-gram (N-gram) language model, such as, but not limited to, the commonly used ternary (N ═ 3) language model and binary (N ═ 2) language model, and the like. For example, the N-gram language model is a probability-based discriminant model, the input to the N-gram language model is a sequential sequence of N words, and the output of the N-gram language model is the joint probability (joint probability) of the N words. For example, the specific technical details of the N-gram language model may refer to the related art in the field of natural language processing, and are not described herein again.
For example, in a Neural Network-based language model, the Neural Network includes, but is not limited to, a Recurrent Neural Network (RNN), a Long-Short Term Memory (LSTM) Network, or a Bi-directional Long-Short Term Memory (Bi-LSTM) Network. For example, the specific technical details of the neural network-based language model can refer to the related art in the field of natural language processing, and are not described herein again.
For example, Weighted Finite State Transducer (WFST) provides a unified form to represent different knowledge sources (knowledge sources) of current industry large-scale continuous speech recognition (LVCSR) system, and WFSTs representing different knowledge sources can be integrated into one WFST through a composition operation. For example, in some embodiments, the acoustic model, the context-dependent phone model, the first pronunciation dictionary model, and the first language model may be compiled into WFST, which may be referred to as h.fst, c.fst, l.fst (e.g., l.fst _1 for distinguishing from WFST corresponding to the subsequent second pronunciation dictionary model) and g.fst (e.g., g.fst _1 for distinguishing from WFST corresponding to the subsequent second language model), and may be further integrated into WFST in HCLG format (i.e., including h.fst, c.fst, l.fst _1, and g.fst _1), or "first WFST" for short, through a compound operation. For example, the decoding graph represented by the first WFST (i.e., the first weighted finite state converter decoding graph) may be made equivalent but more compact and efficient by various optimization operations to remove redundant portions thereof to speed up the decoding process. For example, specific technical details of the WFST that compiles and integrates the acoustic model, the context-dependent phone model, the first pronunciation dictionary model, and the first language model into one HCLG format may refer to the related art in the field of natural language processing, and are not described herein again.
For example, the Viterbi (Viterbi) algorithm is based on the idea of Token Passing (Token paging) for decoding. For example, based on the viterbi algorithm, an optimal decoding path may be found in the first weighted finite state transformer decoding graph, so as to determine each word in the first decoded text, i.e. the first decoded text is obtained. It should be appreciated that the time boundaries for each word in the first decoded text may be obtained during a trace-back process after the end of the viterbi algorithm, such that an audio sub-segment corresponding to each word in the first decoded text may be determined based on the time boundaries (e.g., including the start time and the end time) for each word in the first decoded text. For example, the specific technical details of the viterbi algorithm can refer to the related art in the field of natural language processing, and are not described herein.
Step S30: a first score is determined based on the first decoded text and the reference text.
For example, in some embodiments, a common text evaluation method may be employed to evaluate the first decoded text based on the reference text to determine the first score. For example, common text Evaluation methods include, but are not limited to, Bilingual Evaluation assistant (BLEU), and Recall-based summary Evaluation assistant (Recall-organized Evaluation for standardization Evaluation), among others.
For example, the BLEU may determine the degree of coincidence between the first decoded text and the reference text by calculating the degree of overlap between the first decoded text and the reference text (e.g., N-gram overlap). For example, in some embodiments, the degree of overlap between the first decoded text and the reference text may include a uni-gram (N ═ 1) degree of overlap, a bi-gram (N ═ 2) degree of overlap, a tri-gram (N ═ 3) degree of overlap, and the like, and for example, the degree of overlap between the first decoded text and the reference text may be obtained by performing weighted average on a plurality of N-gram degrees of overlap (e.g., N ═ 1, 2, 3, and the like), and further, the first score may be determined based on the degree of overlap between the first decoded text and the reference text.
For example, the ROUGE may include ROUGE-L (where L refers to the longest common subsequence, LCS), and the like. For example, the ROUGE-L score can be obtained by determining the longest common subsequence between the first decoded text and the reference text, and the ROUGE-L score can be converted to a first score.
For example, in some embodiments, the first score may be determined in combination with a plurality of text evaluation methods and in consideration of a plurality of text evaluation indexes (e.g., the above-described degree of overlap between the first decoded text and the reference text, the route-L score, and the like), for example, a score determined from the degree of overlap between the first decoded text and the reference text and a score determined from the route-L score may be weighted-averaged to determine the first score. For example, the specific technical details of the various text evaluation methods and the text evaluation indexes may refer to the related technologies in the field of natural language processing (e.g., the related technologies for generating evaluation indexes from texts), and are not described herein again.
It should be understood that embodiments of the present disclosure are not limited to the determination method of the first score; in practical applications, the operation of determining the first score in step S30 may be performed by a commonly used scoring method or any reasonable scoring method.
Step S40: and determining a second pronunciation dictionary model according to the reference text and the first pronunciation dictionary model.
For example, in some embodiments, step S40 may include the following steps S41 through S42.
Step S41: in response to any word in the reference text not appearing in the first pronunciation dictionary model, generating a pronunciation of the any word based on the any word, and adding the any word and the pronunciation of the any word to the first pronunciation dictionary model to obtain a second pronunciation dictionary model;
step S42: and in response to all words in the reference text appearing in the first pronunciation dictionary model, using the first pronunciation dictionary model as a second pronunciation dictionary model.
It should be noted that, since the vocabulary in the first pronunciation dictionary model (e.g., the universal pronunciation dictionary model) may not include all numbers (e.g., numbers representing time, year, telephone, age, etc.) and combinations of numbers and words, may not include all foreign words (especially, newly appearing foreign words), and may not include all symbol units (e.g., currency units, temperature units, etc.), any word (i.e., unknown word) that does not appear in the first pronunciation dictionary model is typically a number, a foreign word, a symbol unit, etc.
For example, in some embodiments, a first text in the reference text (e.g., a keyword in the reference answer text) may include at least one of a number, a symbol unit, and a foreign word. In this case, if the first score obtained based on steps S20 to S30 is directly used as the final score of the audio data, the final score may be low because there may be a problem in determining the first score that the first text cannot be recognized. Therefore, the audio evaluation method provided by the embodiment of the disclosure further includes steps S40 to S70 dedicated to scoring the first text in the reference text and step S80 for determining the final score by combining the first score and the second score (for specific details, reference may be made to the corresponding description of steps S40 to S80 in the disclosure), so as to evaluate the audio data more objectively, reasonably and accurately.
It should be understood that in embodiments of the present disclosure, the unknown word is in relation to the first pronunciation lexicon model. It should also be understood that in embodiments of the present disclosure, the foreign words are in a language relative to the reference text. For example, in the case where the language of the reference text is english, "zongzi" (zongzi), "baozi" (steamed stuffed bun), and the like belong to the foreign words.
For example, in some embodiments, the any word (i.e., the unknown word) may be processed using a Grapheme-to-Phoneme (G2P) conversion model to generate a pronunciation for the any word. For example, in some examples, the G2P transformation model may be implemented using RNN and LSTM, embodiments of the present disclosure including, but not limited to, this. For example, specific technical details of the G2P transformation model can refer to the related art in the field of natural language processing, and are not described herein again.
For example, in one specific example, with a test title of "What do you use all eat on Dragon Boat Festival? Taking "Zongzi" and "people usally eat Zongzi" (the reference text includes two reference answer texts) as examples, if the word "Zongzi" does not appear in the first pronunciation dictionary (i.e. there is an unknown word), the pronunciation of the word "Zongzi" can be generated through the G2P conversion model, and the word "Zongzi" and its pronunciation are added to the first pronunciation dictionary model to obtain a second pronunciation dictionary model; if all the words "zongzi", "peoples", "usally" and "eat" in the reference text appear in the first pronunciation dictionary model (i.e., there is no unknown word), the first pronunciation dictionary model can be directly used as the second pronunciation dictionary model.
It should be understood that, when the second pronunciation dictionary model is determined based on the step S40, the steps S41 and S42 are alternatively performed according to the determination condition.
It should be appreciated that step S40 may ensure that the second pronunciation dictionary model can be used to process unknown words.
Step S50: and determining a second language model according to the reference text, wherein the second language model is trained on the reference text.
For example, in some embodiments, the reference text may be used as a training corpus, a N-gram language model may be obtained by training, and the N-gram language model may be used as the second language model. For example, the N-gram language model includes, but is not limited to, a unigram (N ═ 1) language model, a bigram (N ═ 2) language model, a trigram (N ═ 3) language model, and the like. Thus, in some embodiments, the second language model may be a unary language model. For example, when the audio evaluation method shown in fig. 1 is applied to spoken language evaluation of a semi-open question type, since the reference text (e.g., the reference answer text) includes a relatively small number of words, training the second language model into the unary language model can improve the accuracy of the second decoding operation in the subsequent step S60.
It should be noted that, unlike the first language model (e.g., the common language model), the second language model and the reference text correspond to each other, that is, different reference texts correspond to different second language models. In the embodiment of the present disclosure, the reference text may be any text (for example, the reference text varies with the change of the test subject), and therefore, in each execution process of the audio evaluation method, it is generally required to train the second language model corresponding to the obtained reference text based on each obtained reference text (that is, information of the previous reference text does not affect the language model of this time), so as to ensure that each word in the second decoded text obtained by the subsequent second decoding operation is derived from the reference text. It should be appreciated that step S50 may ensure that the second language model can be used to process unknown words and uncommon words.
For example, in some embodiments, the reference text is known prior to performing the audio evaluation method shown in FIG. 1; in this case, the second language model may be obtained by training in advance (i.e. before the audio evaluating method shown in fig. 1 is executed) based on the reference text, so that, when the operation of determining the second language model in step S50 is executed, only the second language model corresponding to the reference text needs to be directly selected.
For example, in other embodiments, the reference text is unknown prior to performing the audio evaluation method shown in FIG. 1; in this case, when the operation of determining the second language model in step S50 is performed, the second language model may be trained in real time based on the reference text.
Step S60: and performing second decoding operation on the audio data based on the second pronunciation dictionary model and the second language model to obtain a second decoded text and a corresponding relation between the audio data and the second decoded text.
For example, in some embodiments, step S60 may include the following steps S61 through S62.
Step S61: constructing a second weighted finite state transducer decoding graph based on the acoustic model, the context-dependent phonon model, the second pronunciation dictionary model and the second language model;
step S62: and performing second decoding operation on the audio data by using a Viterbi algorithm based on the second weighted finite state converter decoding diagram to obtain a second decoded text and a corresponding relation between the audio data and the second decoded text.
For example, in some embodiments, the acoustic model in step S61 is generally the same model as the acoustic model in step S21; the context-dependent phonon model in step S61 is generally the same model as the context-dependent phonon model in step S21; the second pronunciation dictionary model in step S61 may be the same as or different from the first pronunciation dictionary model in step S21, for example, refer to the related description of step S40; the second language model in step S61 is generally different from the first language model in step S21, for example, reference may be made to the related description of step S50 described above.
For example, in some embodiments, the acoustic model, the context-dependent phone model, the second pronunciation dictionary model, and the second language model may be compiled into WFST, which may be referred to as h.fst, c.fst, l.fst (for example, l.fst _2 for distinguishing from WFST corresponding to the first pronunciation dictionary model) and g.fst (for example, g.fst _2 for distinguishing from WFST corresponding to the first language model), and may be further integrated into WFST in HCLG format (i.e., including h.fst, c.fst, l.fst _2, and g.fst _2), or "second WFST" for short, through a compound operation. The decoding graph represented by the second WFST (i.e., the second weighted finite state converter decoding graph) may be made equivalent but more compact and efficient by various optimization operations to remove redundant portions thereof to speed up the decoding process. For example, specific technical details of the WFST that compiles and integrates the acoustic model, the context-dependent phone model, the second pronunciation dictionary model, and the second language model into one HCLG format may refer to the related art in the field of natural language processing, and are not described herein again.
For example, based on the viterbi algorithm, an optimal decoding path may be found in the second weighted finite state converter decoding graph, so as to determine each word in the second decoded text, that is, the second decoded text is obtained; for another example, the time boundary of each word in the second decoded text may be obtained in a backtracking process after the viterbi algorithm is ended, so that according to the time boundary (including the start time and the end time, for example) of each word in the second decoded text, the audio sub-segment corresponding to each word in the second decoded text may be determined, that is, the corresponding relationship between the audio data and the second decoded text is obtained. For example, the specific technical details of the viterbi algorithm can refer to the related art in the field of natural language processing, and are not described herein.
Step S70: and determining a second score according to the first text, the second decoded text and the corresponding relation between the audio data and the second decoded text.
For example, in some embodiments, step S70 may include the following steps S71 through S73.
Step S71: a second text in the second decoded text corresponding to the first text is determined.
For example, in some embodiments, each word in the second decoded text is from the reference text. That is, if a set is constructed with individual words in the reference text as elements, each word in the second decoded text is an element in the set.
For example, in some embodiments, the first text may include at least one text segment. For example, in the case where the first text includes one text fragment, the same portion of the second decoded text as the text fragment may be extracted as the second text; for another example, in a case where the first text includes a plurality of text segments, the same portions as the respective text segments in the second decoded text may be extracted as the second text, respectively. For example, in some embodiments, the second text generally includes at least a portion of the text segment in the first text. For example, in some embodiments, the second text may include all of the text segments in the first text, i.e., the second text is the same as the first text.
Step S72: and determining an audio segment corresponding to the second text in the audio data based on the corresponding relation between the audio data and the second decoded text.
For example, from the correspondence of the audio data obtained in step S60 with the second decoded text, an audio segment corresponding to each word in the second decoded text may be determined. For example, in some embodiments, the audio segment corresponding to the second text may be determined based on words included in the second text and the audio segment corresponding to each word in the second decoded text.
Step S73: a second score is determined based on the first text and the audio segment corresponding to the second text.
For example, in some embodiments, the first text may include at least one text segment, in which case step S73 may include: determining an audio sub-segment corresponding to each word in each of the at least one text segment based on the first text and the audio segment corresponding to the second text; determining a word score Of each word according to an audio sub-segment corresponding to each word in each text segment based on a Pronunciation accuracy algorithm (GOP), and taking an average Of the word scores Of all words in each text segment as a segment score Of each text segment; and determining a second score based on the segment score of the at least one text segment.
For example, in some examples, the pronunciation accuracy algorithm may include: extracting characteristic information of the audio sub-segment corresponding to each word, for example, the characteristic information includes but is not limited to Mel-scale Frequency Cepstral Coefficients (MFCC) and the like; then inputting the characteristic information into a pre-trained phoneme evaluation model for phoneme evaluation to obtain a GOP value of each phoneme in the audio subsegment; and finally, determining the phoneme score of each phoneme in the audio subsegment based on the GOP value of each phoneme, and further determining the score of each word. On the basis of the above, the segment score of each text segment can be further determined, and further, the second score can be further determined. For example, the specific technical details of the pronunciation accuracy algorithm can refer to the related art in the field of natural language processing, and are not described herein again.
For example, determining the second score from the segment scores of the at least one text segment may include: in the case that the first text includes one text segment, scoring the segment of the text segment as a second score; in the case where the first text includes a plurality of text segments, an average (e.g., arithmetic average) or weighted average of the segment scores of the plurality of text segments is taken as the second score.
Step S80: a final score for the audio data is determined based on the first score and the second score.
For example, in some embodiments, step S80 may include: acquiring a first weight corresponding to the first score and a second weight corresponding to the second score; and determining a final score according to the first score, the first weight, the second score and the second weight. For example, in some examples, the final score may be expressed as:
Score_Final=W1*Score_1+W2*Score2,
where Score _ Final represents the Final Score, Score _1 represents the first Score, Score2 represents the second Score, W1 represents the first weight, W2 represents the second weight, and W1+ W2 equals 1.
For example, in some examples, the first Score _1 obtained in step S30 and the second Score _2 obtained in step S70 have the same value range; for example, the value ranges of the first Score _1 and the second Score _2 can be [0,100], and correspondingly, the value range of the Final Score _ Final is also [0,100 ]. It should be noted that the embodiments of the present disclosure include but are not limited thereto.
For example, in some examples, the first weight W1 may have a value in a range of [0.3,0.5], embodiments of the present disclosure including but not limited thereto. For example, the value of the first weight W1 may be set to 0.3, 0.35, 0.4, 0.45, 0.5, and the like. For example, the value of the first weight W1 may also be set according to actual needs, and the embodiment of the disclosure is not limited to this.
It should be understood that, in the case of applying the audio evaluation method shown in fig. 1 to the spoken language evaluation of the semi-open topic type, if an unknown word and/or an uncommon word exists in a reference text (e.g., a reference answer text), especially if the unknown word and/or the uncommon word is a first text in the reference text (e.g., a keyword in the reference answer text), the first score (which can be obtained based on steps S20 to S30) cannot be directly used as the final score, because there may be a problem that the first text cannot be identified in the process of determining the first score, so that the final score may be low if the first score is directly used as the final score. In order to solve the above problem, the audio evaluation method shown in fig. 1 also scores a first text (i.e., a keyword in a reference answer text) in a reference text (refer to steps S40 to S70) to obtain a second score, and determines a final score of audio data by combining the first score and the second score (refer to step S80), so that a more objective, more reasonable and more accurate evaluation result can be provided for the evaluation of the audio data, and the method has high practicability.
It should be noted that, although the embodiments of the present disclosure are described by taking english texts as examples, the embodiments of the present disclosure should not be considered as limiting the audio evaluation method provided by the embodiments of the present disclosure. The audio evaluation method provided by the embodiment of the disclosure can be suitable for evaluating the semi-open question types of various languages such as English, French, German, Spanish, Chinese, Japanese, Korean and the like.
It should be noted that, in the embodiment of the present disclosure, the flow of the audio evaluation method may include more or less operations, and the operations may be performed sequentially or in parallel. Although the flow of the audio evaluation method described above includes a plurality of operations occurring in a particular order, it should be clearly understood that the order of the plurality of operations is not limited. The above-described audio evaluation method may be performed once or may be performed a plurality of times according to a predetermined condition.
At least one embodiment of the present disclosure also provides an audio evaluation device. Fig. 2 is a schematic block diagram of an audio evaluation device according to at least one embodiment of the present disclosure. For example, as shown in fig. 2, the audio evaluation device 100 includes a memory 110 and a processor 120.
For example, the memory 110 is used for non-transitory storage of computer readable instructions, and the processor 120 is used for executing the computer readable instructions, and the computer readable instructions are executed by the processor 120 to execute the audio evaluation method provided by any embodiment of the disclosure.
For example, the memory 110 and the processor 120 may be in direct or indirect communication with each other. For example, in some examples, as shown in fig. 2, the audio evaluation device 100 may further include a system bus 130, and the memory 110 and the processor 120 may communicate with each other through the system bus 130, for example, the processor 120 may access the memory 110 through the system bus 130. For example, in other examples, components such as memory 110 and processor 120 may communicate over a network connection. The network may include a wireless network, a wired network, and/or any combination of wireless and wired networks. The network may include a local area network, the Internet, a telecommunications network, an Internet of Things (Internet of Things) based on the Internet and/or a telecommunications network, and/or any combination thereof, and/or the like. The wired network may communicate by using twisted pair, coaxial cable, or optical fiber transmission, for example, and the wireless network may communicate by using 3G/4G/5G mobile communication network, bluetooth, Zigbee, or WiFi, for example. The present disclosure is not limited herein as to the type and function of the network.
For example, the processor 120 may control other components in the audio evaluation device to perform desired functions. The processor 120 may be a device having data processing capability and/or program execution capability, such as a Central Processing Unit (CPU), Tensor Processor (TPU), or Graphics Processor (GPU). The Central Processing Unit (CPU) may be an X86 or ARM architecture, etc. The GPU may be separately integrated directly onto the motherboard, or built into the north bridge chip of the motherboard. The GPU may also be built into the Central Processing Unit (CPU).
For example, memory 110 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), USB memory, flash memory, and the like.
For example, one or more computer instructions may be stored on the memory 110 and executed by the processor 120 to implement various functions. Various applications and various data, such as audio data, reference text, first decoded text, second decoded text, first score, second score, final score, and various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.
For example, some of the computer instructions stored by memory 110, when executed by processor 120, may perform one or more steps in accordance with the audio profiling methods described above.
For example, as shown in fig. 2, the audio evaluation device 100 may further comprise an input interface 140 allowing external equipment to communicate with the audio evaluation device 100. For example, the input interface 140 may be used to receive instructions from an external computer device, from a user, and the like. The audio evaluation unit 100 may further comprise an output interface 150 for interconnecting the audio evaluation unit 100 and one or more external devices. For example, the audio evaluation device 100 may output the audio evaluation result and the like through the output interface 150. External devices that communicate with the audio evaluation device 100 through the input interface 140 and the output interface 150 may be included in an environment that provides any type of user interface with which a user may interact. Examples of user interface types include graphical user interfaces, natural user interfaces, and the like. For example, a graphical user interface may accept input from a user using an input device such as a keyboard, mouse, remote control, etc., and provide output on an output device such as a display. Furthermore, a natural user interface may enable a user to interact with the audio evaluation apparatus 100 in a manner that does not require the constraints imposed by input devices such as a keyboard, mouse, remote control, etc. Instead, natural user interfaces may rely on speech recognition, touch and stylus recognition, gesture recognition on and near the screen, air gestures, head and eye tracking, speech and semantics, vision, touch, gestures, and machine intelligence, among others.
For example, in some embodiments, the audio evaluation device 100 further comprises an audio acquisition apparatus as described in embodiments of the audio evaluation method.
In addition, although the audio evaluation device 100 is shown as a single system in fig. 2, it is understood that the audio evaluation device 100 may be a distributed system, and may be arranged as a cloud facility (including a public cloud or a private cloud). Thus, for example, several devices may communicate via a network connection and may collectively perform the tasks described as being performed by the audio evaluation device 100. For example, in some embodiments, a semi-open question type test question may be obtained through a client, then audio data of a user answering the test question is collected, and the audio data is uploaded to a server; the server returns an evaluation result (for example, final rating) to the client after executing the audio evaluation process based on the received audio data and the reference text pre-stored in the server, so as to provide the evaluation result to the user.
For example, for detailed description of the implementation process of the audio evaluation method, reference may be made to the related description in the above embodiment of the audio evaluation method, and repeated details are not repeated here.
For example, in some examples, the audio evaluation device may include, but is not limited to, a smartphone, a tablet, a Personal computer, a Personal Digital Assistant (PDA), a wearable device, a head-mounted display device, a server, and so forth.
It should be noted that the audio evaluation device provided in the embodiments of the present disclosure is illustrative and not restrictive, and the audio evaluation device may further include other conventional components or structures according to practical application needs, for example, in order to implement the necessary functions of the audio evaluation device, a person skilled in the art may set other conventional components or structures according to a specific application scenario, and the embodiments of the present disclosure are not limited thereto.
For technical effects of the audio evaluation device provided by the embodiment of the present disclosure, reference may be made to corresponding descriptions about the audio evaluation method in the foregoing embodiments, and details are not repeated here.
At least one embodiment of the present disclosure also provides a non-transitory storage medium. Fig. 3 is a schematic diagram of a non-transitory storage medium according to an embodiment of the disclosure. For example, as shown in fig. 3, the non-transitory storage medium 200 non-transitory stores computer-readable instructions 201, and when the non-transitory computer-readable instructions 201 are executed by a computer (including a processor), the instructions of the audio evaluation method provided by any embodiment of the disclosure can be executed.
For example, one or more computer instructions may be stored on the non-transitory storage medium 200. Some of the computer instructions stored on non-transitory storage medium 200 may be, for example, instructions for implementing one or more steps in the audio profiling method described above.
For example, the non-transitory storage medium may include a storage component of a tablet computer, a hard disk of a personal computer, a Random Access Memory (RAM), a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a compact disc read only memory (CD-ROM), a flash memory, or any combination of the above storage media, as well as other suitable storage media.
For technical effects of the non-transitory storage medium provided by the embodiments of the present disclosure, reference may be made to the corresponding description of the audio evaluation method in the foregoing embodiments, and details are not repeated here.
For the present disclosure, there are the following points to be explained:
(1) in the drawings of the embodiments of the present disclosure, only the structures related to the embodiments of the present disclosure are referred to, and other structures may refer to general designs.
(2) Without conflict, embodiments of the present disclosure and features of the embodiments may be combined with each other to arrive at new embodiments.
The above is only a specific embodiment of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present disclosure, and shall be covered by the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (16)

1. An audio evaluation method, comprising:
acquiring audio data and a reference text, wherein the reference text comprises a first text;
performing first decoding operation on the audio data based on a first pronunciation dictionary model and a first language model to obtain a first decoded text;
determining a first score according to the first decoded text and the reference text;
determining a second pronunciation dictionary model according to the reference text and the first pronunciation dictionary model;
determining a second language model according to the reference text, wherein the second language model is obtained by training based on the reference text;
performing a second decoding operation on the audio data based on the second pronunciation dictionary model and the second language model to obtain a second decoded text and a corresponding relation between the audio data and the second decoded text;
determining a second score according to the first text, the second decoded text and the corresponding relation between the audio data and the second decoded text; and
determining a final score for the audio data based on the first score and the second score.
2. The audio evaluation method of claim 1, wherein determining the second pronunciation dictionary model based on the reference text and the first pronunciation dictionary model comprises:
in response to any word in the reference text not appearing in the first pronunciation dictionary model, generating a pronunciation of the any word based on the any word, and adding the any word and the pronunciation of the any word to the first pronunciation dictionary model to obtain the second pronunciation dictionary model; and
taking the first pronunciation dictionary model as the second pronunciation dictionary model in response to all words in the reference text appearing in the first pronunciation dictionary model.
3. The audio evaluation method of claim 2, wherein generating a pronunciation for the any word based on the any word comprises:
processing the any word using a grapheme-to-phoneme conversion model to generate a pronunciation of the any word.
4. The audio evaluation method according to any one of claims 1 to 3, wherein performing the first decoding operation on the audio data based on the first vocalized dictionary model and the first language model to obtain the first decoded text comprises:
constructing a first weighted finite state transducer decoding graph based on an acoustic model, a context-dependent phonon model, the first pronunciation dictionary model and the first language model; and
performing the first decoding operation on the audio data using a Viterbi algorithm based on the first weighted finite State transducer decoding graph to obtain the first decoded text.
5. The audio evaluation method according to claim 4, wherein performing the second decoding operation on the audio data based on the second pronunciation dictionary model and the second language model to obtain the second decoded text and the correspondence between the audio data and the second decoded text comprises:
constructing a second weighted finite state transducer decoding graph based on the acoustic model, the context dependent phonon model, the second pronunciation dictionary model, and the second language model; and
and performing the second decoding operation on the audio data by using a Viterbi algorithm based on the second weighted finite state transducer decoding diagram to obtain the second decoded text and the corresponding relation between the audio data and the second decoded text.
6. The audio evaluation method of claim 4, wherein the acoustic model comprises a hidden Markov model, a Gaussian mixture model or a chain model based on a time-lapse neural network.
7. A method for audio evaluation according to any of claims 1-3, wherein the second language model comprises a univariate language model.
8. The audio evaluation method according to any one of claims 1 to 3, wherein determining the first score from the first decoded text and the reference text comprises:
determining an overlap and a longest common subsequence between the first decoded text and the reference text; and
obtaining the first score based on the degree of overlap and the longest common subsequence.
9. The audio evaluation method according to any one of claims 1 to 3, wherein determining the second score according to the first text, the second decoded text, and the correspondence between the audio data and the second decoded text comprises:
determining a second text corresponding to the first text in the second decoded text;
determining an audio segment corresponding to the second text in the audio data based on the corresponding relation between the audio data and the second decoded text; and
determining the second score based on the first text and the audio segment corresponding to the second text.
10. The audio evaluation method of claim 9, wherein the first text comprises at least one text segment,
determining the second score based on the first text and the audio segment corresponding to the second text, including:
determining an audio sub-segment corresponding to each word in each of the at least one text segment based on the first text and the audio segment corresponding to the second text;
determining a word score of each word according to an audio sub-segment corresponding to each word in each text segment based on a pronunciation accuracy algorithm, and taking an average value of the word scores of all the words in each text segment as a segment score of each text segment; and
determining the second score according to an average of the segment scores of the at least one text segment.
11. The audio evaluation method according to any one of claims 1 to 3, wherein determining the final score of the audio data based on the first score and the second score comprises:
acquiring a first weight corresponding to the first score and a second weight corresponding to the second score; and
determining the final score based on the first score, the first weight, the second score, and the second weight,
wherein the final score is expressed as:
Score_Final=W1*Score_1+W2*Score2,
wherein Score _ Final represents the Final Score, Score _1 represents the first Score, Score2 represents the second Score, W1 represents the first weight, W2 represents the second weight, and W1+ W2 ═ 1.
12. The audio evaluation method according to claim 11, wherein the first weight W1 has a value in the range of [0.3,0.5 ].
13. The audio evaluation method according to any of claims 1-3, wherein the first text comprises at least one of a number, a symbol unit, and a foreign word.
14. The audio evaluation method according to any one of claims 1 to 3, wherein the audio data comprises speech data answering a test question, the reference texts comprise at least one reference answer text corresponding to the test question, and each of the reference answer texts comprises the first text.
15. An audio evaluation device comprising:
a memory for non-transitory storage of computer readable instructions; and
a processor for executing the computer-readable instructions, wherein the computer-readable instructions, when executed by the processor, perform the audio evaluation method according to any of claims 1-14.
16. A non-transitory storage medium storing non-transitory computer readable instructions, wherein the non-transitory computer readable instructions, when executed by a computer, perform the instructions of the audio evaluation method of any of claims 1-14.
CN202011083263.5A 2020-10-12 2020-10-12 Audio evaluation method and device and non-transient storage medium Pending CN114420159A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011083263.5A CN114420159A (en) 2020-10-12 2020-10-12 Audio evaluation method and device and non-transient storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011083263.5A CN114420159A (en) 2020-10-12 2020-10-12 Audio evaluation method and device and non-transient storage medium

Publications (1)

Publication Number Publication Date
CN114420159A true CN114420159A (en) 2022-04-29

Family

ID=81260457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011083263.5A Pending CN114420159A (en) 2020-10-12 2020-10-12 Audio evaluation method and device and non-transient storage medium

Country Status (1)

Country Link
CN (1) CN114420159A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230112921A1 (en) * 2021-10-01 2023-04-13 Google Llc Transparent and Controllable Human-Ai Interaction Via Chaining of Machine-Learned Language Models

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230112921A1 (en) * 2021-10-01 2023-04-13 Google Llc Transparent and Controllable Human-Ai Interaction Via Chaining of Machine-Learned Language Models

Similar Documents

Publication Publication Date Title
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
CN105845134B (en) Spoken language evaluation method and system for freely reading question types
US20180137109A1 (en) Methodology for automatic multilingual speech recognition
CN109256152A (en) Speech assessment method and device, electronic equipment, storage medium
CN110364171A (en) A kind of audio recognition method, speech recognition system and storage medium
CN112712804A (en) Speech recognition method, system, medium, computer device, terminal and application
KR20190125154A (en) An apparatus for machine learning the psychological counseling data and a method thereof
CN110797010A (en) Question-answer scoring method, device, equipment and storage medium based on artificial intelligence
CN109741732A (en) Name entity recognition method, name entity recognition device, equipment and medium
CN112397056B (en) Voice evaluation method and computer storage medium
CN109697988B (en) Voice evaluation method and device
CN112466279B (en) Automatic correction method and device for spoken English pronunciation
WO2023093295A1 (en) Artificial intelligence-based audio processing method and apparatus, electronic device, computer program product, and computer-readable storage medium
KR101988165B1 (en) Method and system for improving the accuracy of speech recognition technology based on text data analysis for deaf students
JP2004094257A (en) Method and apparatus for generating question of decision tree for speech processing
CN110503956A (en) Audio recognition method, device, medium and electronic equipment
CN109697975B (en) Voice evaluation method and device
Larabi-Marie-Sainte et al. A new framework for Arabic recitation using speech recognition and the Jaro Winkler algorithm
CN114420159A (en) Audio evaluation method and device and non-transient storage medium
Azim et al. Large vocabulary Arabic continuous speech recognition using tied states acoustic models
CN115376547A (en) Pronunciation evaluation method and device, computer equipment and storage medium
Coto‐Solano Computational sociophonetics using automatic speech recognition
CN113990351A (en) Sound correction method, sound correction device and non-transient storage medium
CN113889115A (en) Dialect commentary method based on voice model and related device
CN113707178B (en) Audio evaluation method and device and non-transient storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination