WO2015062465A1 - 移动设备上的实时口语评价系统及方法 - Google Patents

移动设备上的实时口语评价系统及方法 Download PDF

Info

Publication number
WO2015062465A1
WO2015062465A1 PCT/CN2014/089644 CN2014089644W WO2015062465A1 WO 2015062465 A1 WO2015062465 A1 WO 2015062465A1 CN 2014089644 W CN2014089644 W CN 2014089644W WO 2015062465 A1 WO2015062465 A1 WO 2015062465A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
evaluated
text data
speech
pronunciation score
Prior art date
Application number
PCT/CN2014/089644
Other languages
English (en)
French (fr)
Inventor
王翌
林晖
胡哲人
Original Assignee
上海流利说信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海流利说信息技术有限公司 filed Critical 上海流利说信息技术有限公司
Priority to JP2016550920A priority Critical patent/JP6541673B2/ja
Priority to US15/033,210 priority patent/US20160253923A1/en
Priority to EP14859160.5A priority patent/EP3065119A4/en
Publication of WO2015062465A1 publication Critical patent/WO2015062465A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B19/00Teaching not covered by other main groups of this subclass
    • G09B19/06Foreign languages
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B5/00Electrically-operated educational appliances
    • G09B5/04Electrically-operated educational appliances with audible presentation of the material to be studied
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a real-time spoken language evaluation system and method on a mobile device.
  • Most of the existing spoken language evaluation systems use a computer as a client.
  • the user records through a microphone connected to the computer.
  • the audio data is transmitted to the server through the network, and is evaluated by an algorithm running on the server.
  • the evaluation algorithms are all running on computing resources. (CPU resources, memory resources, storage resources) are relatively sufficient on the server side of the computer.
  • the migration of the client of the evaluation system to the mobile device mostly adopts the following solutions: the voice data is collected by the mobile device client, the voice data is transmitted to the server through the network, the spoken language evaluation algorithm runs on the server, and the evaluation result is passed through The network is passed back to the mobile device client.
  • the present invention has been made in order to provide a real-time spoken language evaluation system and method on a mobile device that overcomes the above problems or at least partially solves the above problems, by reducing the spoken language evaluation system on the mobile device, not only reducing the spoken language Evaluate the system's reliance on the network, which can reduce the traffic loss of message transmission on the mobile device and the server side, and also provide users with instant oral evaluation feedback, so as to realize when and where to use the spoken language evaluation system to practice speaking and improve the user experience. effect.
  • a real-time spoken language evaluation system on a mobile device comprising: an acquisition module, configured to collect voice data of a voice to be evaluated, and a voice or a character string including at least one character in the voice to be evaluated a voice module; the recognition module is configured to identify the voice data collected by the collection module as text data; and the matching module is configured to match the text data recognized by the recognition module with the text data of the voice sample in the voice sample library to obtain a matching result; And an evaluation module, configured to obtain and output a pronunciation score of at least one character or string in the voice to be evaluated according to the matching evaluation strategy and the matching result obtained by the matching module, and/or a pronunciation score of the voice to be evaluated.
  • system further includes: a display module, configured to display text data of the voice samples in the voice sample library;
  • the collecting module is further configured to collect voice data that is input by the user according to the text data of the voice sample in the voice sample library displayed by the display module, and is used as the voice to be evaluated.
  • the system further includes: a score comparison module, a pronunciation score of the voice to be evaluated output by the evaluation module, and/or a pronunciation score of at least one character or string in the voice to be evaluated, and a predefined pronunciation a score threshold for comparing; a marking module, configured to mark, in a text data displayed by the display module, a pronunciation score lower than a predefined one in a case that a pronunciation score of the to-be-evaluated speech is lower than a predefined pronunciation score threshold Text data of the pronunciation score threshold; and/or, in the case where the pronunciation score of the character or the character string in the speech to be evaluated is lower than the predefined pronunciation score threshold, the pronunciation score is marked in the text data displayed by the display module is low A character or string that is at a predefined pronunciation score threshold.
  • the matching module is further configured to perform matching calculation on the text data recognized by the identification module and the text data of the voice sample in the voice sample library according to the Levenshtein Distance editing distance algorithm, to obtain a matching result.
  • the pre-defined evaluation strategy is: when the recognized text data matches the text data of the voice sample in the voice sample library, the posterior probability of the character or the character string in the text data is determined according to the voice data.
  • the pronunciation score of the character or the character string in the speech to be evaluated; the average score of the pronunciation scores of all characters or strings in the speech to be evaluated is taken as the pronunciation score of the speech to be evaluated.
  • the system further includes: a storage module, configured to store the voice sample library, where the voice sample library includes at least one voice sample.
  • a real-time spoken language evaluation method on a terminal device includes: collecting voice data of a voice to be evaluated, wherein the voice to be evaluated includes a voice of a voice or a character string of at least one character Identifying the collected voice data as text data; matching the identified text data with text data of the voice samples in the voice sample library to obtain a matching result; and according to a predefined evaluation strategy and the matching result, Acquiring and outputting a pronunciation score of at least one character or character string in the speech to be evaluated, and/or a pronunciation score of the speech to be evaluated.
  • the method further includes: displaying text data of the voice sample in the voice sample library;
  • the step of collecting the voice data of the voice to be evaluated is: collecting voice data input by the user according to the text data of the voice sample in the displayed voice sample library as the voice to be evaluated.
  • the method further includes: comparing the output pronunciation score of the to-be-evaluated speech, and/or the pronunciation score of at least one character or character string in the speech to be evaluated, with a predefined pronunciation score threshold; In a case where the pronunciation score of the to-be-evaluated speech is lower than a predefined pronunciation score threshold, text data whose pronunciation score is lower than a predefined pronunciation score threshold is marked in the displayed text data; and/or, to be evaluated In the case where the pronunciation score of at least one character or character string in the voice is lower than a predefined pronunciation score threshold, a character or a character string whose pronunciation score is lower than a predefined pronunciation score threshold is marked in the displayed text data.
  • the step of matching the identified text data with the text data of the voice samples in the voice sample library to obtain a matching result is: editing the distance algorithm according to the Levenshtein Distance, and identifying the obtained text data and the voice sample database The text data of the speech sample is matched and calculated to obtain a matching result.
  • the voice data of the voice to be evaluated is collected by a real-time spoken language evaluation system on the mobile device; then the collected voice data is recognized as text data; and then the recognized text data and the voice in the voice sample library are The text data of the sample is matched to obtain a matching result; and according to the predefined evaluation strategy and the matching result, the pronunciation score of the speech to be evaluated, and/or the pronunciation score of at least one character or string in the speech to be evaluated is obtained and output.
  • the network's dependence reduces the traffic loss of messaging on the mobile device and the server side, and can give users instant feedback evaluation, so as to achieve the effect of practicing spoken language when and where the spoken language evaluation system can be used.
  • FIG. 1 is a block diagram schematically showing the structure of a real-time spoken language evaluation system 100 on a mobile device according to an embodiment of the present invention
  • FIG. 2 schematically illustrates a flow diagram of a real-time spoken language evaluation method 200 on a mobile device in accordance with an embodiment of the present invention.
  • means for carrying out a specified function are intended to cover any means of performing the function, including, for example, (a) a combination of circuit elements that perform the function, or (b) any form of software, thus including firmware, Microcode, etc., combined with appropriate circuitry, to execute software that implements functionality.
  • the functions provided by the various modules are combined in the manner claimed by the claims, and it should be understood that any module, component, or component that can provide these functions is equivalent to the modules defined in the claims.
  • the real-time spoken language evaluation system 100 on the mobile device may mainly include: an acquisition module 110, an identification module 130, a matching module 150, and an evaluation module 170, which should be understood as represented in FIG.
  • the connection relationship of each module is only an example, and those skilled in the art can fully adopt other connection relationships, as long as each module can implement the functions of the present invention under such a connection relationship.
  • the functions of the respective modules can be realized by using dedicated hardware or hardware capable of performing processing in combination with appropriate software.
  • Such hardware or special purpose hardware may include application specific integrated circuits (ASICs), various other circuits, various processors, and the like.
  • ASICs application specific integrated circuits
  • this functionality may be provided by a single dedicated processor, a single shared processor, or multiple independent processors, some of which may be shared.
  • a processor should not be understood to refer to hardware capable of executing software, but may implicitly include, without limitation, digital signal processor (DSP) hardware, read-only memory (ROM) for storing software, random Access memory (RAM), as well as non-volatile storage devices.
  • DSP digital signal processor
  • ROM read-only memory
  • RAM random Access memory
  • the collecting module 110 is configured to collect the number of voices of the voice to be evaluated According to the voice in which the voice or the character string of at least one character is included in the voice to be evaluated.
  • the voice to be evaluated may include any one or more combinations of Chinese words, English words, and Arabic numerals. It is of course understood that the language of the voice to be evaluated is not limited in the embodiment of the present invention. Types of.
  • the acquisition module 110 is responsible for recording the voice to be evaluated and saving the voice data of the voice to be evaluated.
  • the collection module 110 can be an existing microphone, and the user can input the voice to be evaluated to the system 100 through the microphone.
  • the content of the speech to be evaluated may be the following English sentence: "Welcome to Liu Li shuo! My name is Peter. I'm an English teacher at Liu Li shuo.”.
  • the system 100 converts the voice data of the voice to be evaluated into an audio file in a .wav format through the collection module 110, where the WAV format is a sound waveform file format. It should be understood that the specific structure of the acquisition module 110 is not limited in the embodiment of the present invention.
  • the identification module 130 is configured to identify the voice data collected by the collection module 110 as text data.
  • the voice data of the voice to be evaluated exemplified above can be identified by the recognition module 130 as the following text data: WELCOME TO LIU LI SHUO! MY NAME IS PETER.I’M AN ENGLISH TEACHER AT LIU LI SHUO.
  • the recognition module 130 adopts a speech recognition model, which is a Hidden Markov Model (HMM) with a mixed Gaussian distribution as an output probability distribution.
  • HMM Hidden Markov Model
  • the identification module 130 can identify the voice data collected by the acquisition module 110 as text data by using a fixed point operation.
  • the fixed-point operation is performed in the following manner, and of course, it is not limited to this:
  • Method 1 In the existing speech recognition algorithm, there are many floating-point operations, and fixed-point DSP can be used.
  • the fixed-point DSP performs integer arithmetic or fractional operation.
  • the numerical format does not include the order code.
  • the fixed-point DSP is 16-bit. Or 24-bit data width) to achieve floating-point operations, and then through the number of calibration methods to achieve floating-point conversion to fixed-point numbers.
  • the scaling of the number is to determine the position of the decimal point in the fixed point number.
  • the Q notation is a commonly used calibration method.
  • the representation mechanism is: the set point is x, the floating point number is y, then the conversion relationship between the fixed point number of the Q representation and the floating point number is:
  • Method 2 (1) define and simplify the algorithm structure; (2) determine the function that needs to be quantized Key variables; (3) collecting statistical information on key variables; (4) determining accurate representations of key variables; and (5) determining fixed-point formats for the remaining variables.
  • a fixed point operation can be used instead of a general floating point operation, and an integer number is used instead of a general floating point number to represent the output probability of the recognition result. Since the fixed point operation can be adopted in the embodiment of the present invention, the fixed point operation does not need to define a large number of parameters with respect to the floating point operation, so that the identification module 130 can occupy less system resources (CPU resources, memory resources, storage resources). In the case of the completion of the identification process. It will of course be understood that the specific type of recognition model employed by the recognition module 130 for character recognition is not limited in the embodiment of the present invention.
  • the matching module 150 is configured to match the text data recognized by the recognition module 130 with the text data of the voice samples in the voice sample library to obtain a matching result.
  • the text data of the voice sample in the voice sample library in the embodiment of the present invention may be text data pre-stored in the voice sample library, for example, the following text data is pre-defined: WELCOME TO LIU LI SHUO! MY NAME IS PETER.I’M AN ENGLISH TEACHER AT LIU LI SHUO, stored in the voice sample library.
  • the matching module 150 is further configured to perform matching calculation on the text data recognized by the recognition module 130 and the text data of the voice samples in the voice sample library according to the Levenshtein Distance edit distance algorithm to obtain a match. result.
  • the matching result may include: the text data identified by the recognition module 130 and the text data of the voice sample in the voice sample library are matched, and the text data recognized by the recognition module 130 does not match the text data of the voice sample in the voice sample library. It is of course understood that the matching algorithm employed by the matching module 150 is not limited in the embodiment of the present invention.
  • the evaluation module 170 is configured to obtain and output a pronunciation score of at least one character or string in the voice to be evaluated according to the matching result obtained by the matching evaluation strategy and the matching module 150, and/or The pronunciation score of the speech to be evaluated.
  • the pre-defined evaluation strategy is: when the recognized text data matches the text data of the voice sample in the voice sample library, the character or string in the text data is recognized. Posterior probability as a character or string in the speech to be evaluated The pronunciation score of the pronunciation score, and the average score of the pronunciation scores of all characters or strings in the speech to be evaluated as the pronunciation score of the speech to be evaluated.
  • the posterior probability of the character or the character string obtained based on the voice data is p (between 0 and 1), and the pronunciation score of the character or the character string is p ⁇ 100.
  • the evaluation module 170 can obtain the pronunciation score of the entire English sentence of "Welcome to Liu Li shuo! My name is Peter. I'm an English teacher at Liu Li shuo.”, and/or obtain The pronunciation score of each word in the above English sentence. That is, a unigram language model composed of sentence words can be used in the embodiment of the present invention.
  • the real-time spoken language rating system 100 on the mobile device may also include one or more optional modules to implement additional or additional functionality, however, these optional modules are for the purposes of the present invention. Not necessarily indispensable, the real-time spoken language rating system 100 on a mobile device in accordance with an embodiment of the present invention may fully accomplish the objectives of the present invention without these optional modules. Although these optional modules are not shown in Fig. 1, their connection relationship with each of the above modules can be easily obtained by those skilled in the art in accordance with the following teachings.
  • the system 100 further includes: a display module, configured to display text data of the voice samples in the voice sample library, for example, displaying the following English sentence "Welcome to Liu Li shuo! My name is Peter. I'm an English teacher at Liu Li shuo";
  • the collecting module 110 is further configured to collect voice data that is input by the user according to the text data of the voice sample in the voice sample library displayed by the display module.
  • the acquisition module 110 collects voice data of the following English sentence "Welcome to Liu Li shuo! My name is Peter. I'm an English teacher at Liu Li shuo".
  • system 100 further includes: a score comparison module and a markup module, wherein
  • the score comparison module is configured to compare the pronunciation score of the to-be-evaluated speech output by the evaluation module 170, and/or the pronunciation score of at least one character or character string in the speech to be evaluated, with a predefined pronunciation score threshold; optionally
  • the pre-defined pronunciation score threshold may be set to 60 points, although it is understood that the specific values are not limited in the embodiment of the present invention.
  • the tagging module is used to lower the pronunciation score of the speech to be evaluated below a predefined pronunciation score threshold
  • a predefined pronunciation score threshold In the case of a value, the text data whose pronunciation score is lower than the predefined pronunciation score threshold is marked in the text data displayed by the display module; and/or the pronunciation score of at least one character or string in the speech to be evaluated is lower than
  • a predefined pronunciation score threshold a character or character string whose pronunciation score is lower than a predefined pronunciation score threshold is marked in the text data displayed by the display module.
  • the score comparison module compares that the pronunciation score of "Welcome” is lower than the predefined pronunciation score threshold, "Welcome” may be marked in the entire English sentence, optionally, Set the color of "Welcome” to red.
  • the system 100 further includes: a storage module, configured to store a library of voice samples, wherein the voice sample library includes at least one voice sample, for example, the content of the voice sample is: “Welcome to Liu Li shuo! My name is Peter.I'm an English teacher at Liu Li shuo.”.
  • a storage module configured to store a library of voice samples, wherein the voice sample library includes at least one voice sample, for example, the content of the voice sample is: “Welcome to Liu Li shuo! My name is Peter.I'm an English teacher at Liu Li shuo.”
  • the mobile device By implementing the spoken language evaluation system on the client of the mobile device, the mobile device not only reduces the dependence of the mobile device on the network, but also reduces the traffic loss of the message transmission on the mobile device and the server end, and can provide the user with instant spoken language. Evaluate feedback so that when and where you can use the spoken language evaluation system to practice oral English.
  • the present invention in accordance with a real-time spoken language evaluation system 100 on a mobile device according to an embodiment of the present invention as described above, the present invention also provides a real-time spoken language evaluation method 200 on a mobile device.
  • a flow diagram of a real-time spoken language evaluation method 200 on a mobile device in accordance with an embodiment of the present invention is schematically illustrated.
  • the method 200 includes steps S210, S230, S250, and S270.
  • the method 200 begins in step S210, in which voice data of a voice to be evaluated is collected.
  • the voice to be evaluated includes a voice of a voice or a character string of at least one character.
  • the voice to be evaluated may include any one or more combinations of Chinese words, English words, and Arabic numerals, which may be understood.
  • the language type of the speech to be evaluated is not limited in the embodiment of the present invention.
  • the user can input the voice to be evaluated to the system 100 through the microphone.
  • the content of the speech to be evaluated may be the following English sentence: "Welcome to Liu Li shuo! My name is Peter. I'm an English teacher at Liu Li shuo.”
  • the system 100 converts the voice data of the voice to be evaluated into an audio file in a .wav format through the collection module 110 and saves it.
  • the medium WAV format is the sound waveform file format.
  • step S230 the collected voice data is recognized as text data. That is, the voice data of the voice to be evaluated exemplified above can be identified as the following text data by step S230: WELCOME TO LIU LI SHUO! MY NAME IS PETER.I’M AN ENGLISH TEACHER AT LIU LI SHUO.
  • the speech recognition model is a Hidden Markov Model (HMM) with a mixed Gaussian distribution as an output probability distribution. That is, in the embodiment of the present invention, a fixed point operation is used instead of a general floating point operation, and an integer number is used instead of a general floating point number to represent the output probability of the recognition result. It will of course be understood that the specific type of recognition model employed for character recognition is not limited in embodiments of the present invention.
  • HMM Hidden Markov Model
  • step S250 the recognized text data is matched with the text data of the speech samples in the speech sample library to obtain a matching result.
  • the text data of the voice sample in the voice sample library in the embodiment of the present invention may be text data pre-stored in the voice sample library, for example, the following text data is pre-defined: WELCOME TO LIU LI SHUO! MY NAME IS PETER.I’M AN ENGLISH TEACHER AT LIU LI SHUO, stored in the voice sample library.
  • the Levenshtein Distance edit distance algorithm performs matching calculation on the recognized text data and the text data of the voice samples in the voice sample library to obtain a matching result.
  • the matching result includes: the recognized text data is matched with the text data of the speech sample in the speech sample library, and the recognized text data does not match the text data of the speech sample in the speech sample library.
  • the matching algorithm employed is not limited in the embodiments of the present invention.
  • step S270 based on the predefined evaluation strategy and the matching result, a pronunciation score of at least one character or character string in the speech to be evaluated, and/or a pronunciation score of the speech to be evaluated is obtained and output.
  • the pre-defined evaluation strategy is: when the recognized text data matches the text data of the voice sample in the voice sample library, the character or string in the text data is recognized.
  • the posterior probability as the pronunciation score of the character or string in the speech to be evaluated, and the average score of the pronunciation scores of all characters or strings in the speech to be evaluated As the pronunciation score of the speech to be evaluated.
  • the posterior probability of the character or the character string obtained based on the voice data is p (between 0 and 1), and the pronunciation score of the character or the character string is p ⁇ 100.
  • the pronunciation score of the entire English sentence can be obtained by "Selcome to Liu Li shuo! My name is Peter. I'm an English teacher at Liu Li shuo.”
  • the pronunciation score of each word in an English sentence That is, a unigram language model composed of sentence words can be used in the embodiment of the present invention.
  • the real-time spoken language evaluation method 200 on the mobile device may further comprise one or more optional steps to implement additional or additional functions, however these optional steps are for the purposes of the present invention. It is not indispensable that the real-time spoken language evaluation method 200 on a mobile device according to an embodiment of the present invention can fully achieve the object of the present invention without these optional steps.
  • These optional steps are not shown in Figure 2, but their sequential execution with the various steps described above can be readily derived by those skilled in the art in light of the teachings below. It should be noted that these optional steps, together with the order of execution of the above steps, may be selected according to actual needs, unless otherwise specified.
  • the method 200 further includes: displaying text data of the text data of the voice sample in the voice sample library, for example, displaying the following English sentence "Welcome to Liu Li shuo! My name is Peter. I'm an English teacher at Liu Li shuo .”;
  • the step of collecting the voice data of the voice to be evaluated is: collecting voice data of the voice to be evaluated according to the input of the voice sample in the displayed voice sample library by the user.
  • the method 200 further includes comparing the outputted pronunciation score of the to-be-evaluated speech, and/or the pronunciation score of at least one character or character string in the speech to be evaluated, with a predefined pronunciation score threshold.
  • the predefined pronunciation score threshold may be set to 60 points, although it is understood that the specific values are not limited in the embodiments of the present invention.
  • the pronunciation score of the to-be-evaluated speech is lower than a predefined pronunciation score threshold
  • text data whose pronunciation score is lower than a predefined pronunciation score threshold is marked in the displayed text data; and/or, the pronunciation score of at least one character or string in the speech to be evaluated is lower than a predefined pronunciation score
  • a threshold a character or a character string whose pronunciation score is lower than a predetermined pronunciation score threshold is marked in the displayed text data.
  • modules in the devices in the various embodiments can be adaptively changed and placed in one or more different devices than the embodiment.
  • Several of the modules in the embodiments may be combined into one module or unit or component, and they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the steps of any of the methods disclosed in this specification or all of the modules of any device may be combined in any combination, except where the features and/or processes are mutually exclusive.
  • Each feature disclosed in this specification can be replaced by an alternative feature that provides the same, equivalent or similar purpose, unless stated otherwise.
  • the various device embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof.
  • a microprocessor or digital signal processor may be used in practice to implement some or all of the functionality of some or all of the modules in accordance with embodiments of the present invention.
  • DSP digital signal processor
  • the invention can also be implemented as a device program (e.g., a computer program and a computer program product) for performing the methods described herein.
  • the word “comprising” or “comprises” or “comprising” does not exclude the presence of a
  • the use of the terms “a” or “an” The invention can be implemented by means of hardware comprising several different modules or by means of a suitably programmed computer or processor. In the device claim enumerating several modules, several of these can be implemented by the same hardware module.
  • the use of the terms “first”, “second”, and “third”, etc., does not denote any order, and these terms may be interpreted as a name.
  • the terms “connected,” “coupled,” and the like, when used in this specification, are defined to be operatively connected in any desired form, eg, mechanically, electronically, digitally, analogly, directly, indirectly, through software. Connect by hardware or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Educational Technology (AREA)
  • Educational Administration (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Quality & Reliability (AREA)
  • Machine Translation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

一种移动设备上的实时口语评价系统及方法,系统包括:采集模块(110),用于采集待评价语音的语音数据;识别模块(130),用于将采集模块(110)采集到的语音数据识别为文本数据;匹配模块(150),用于将识别模块(130)识别得到的文本数据与语音样本库中语音样本的文本数据进行匹配,得到匹配结果;以及评价模块(170),用于根据预先定义的评价策略和匹配模块(150)匹配得到的匹配结果,得到并输出待评价语音中至少一个字符或字符串的发音得分,和/或待评价语音的发音得分。通过将口语评价系统在移动设备上完成,不但减少了口语评价系统对网络的依赖,而且还能够给用户即时口语评价反馈,提高用户体验效果。

Description

移动设备上的实时口语评价系统及方法 技术领域
本发明涉及计算机技术领域,尤其涉及一种移动设备上的实时口语评价系统及方法。
背景技术
现有的口语评价系统大都以计算机作为客户端,用户通过与计算机相连的麦克风录音,音频数据通过网络传到服务器端,并由在服务器端运行的算法进行评价,评价算法都是运行在计算资源(CPU资源、内存资源、存储资源)相对充足的服务器端的计算机上。
随着移动设备的普及,用户开始从计算机客户端向移动设备客户端迁移。将评价系统的客户端移植到移动设备上大都采用了如下解决方案:由移动设备客户端搜集语音数据,通过网络将语音数据传到服务器,在服务器上运行的口语评价算法,并将评价结果经由网络传回到移动设备客户端。
由于现有的方案依赖于网络连接,一方面,通过网络传输语音数据需要耗费流量,另一方面,移动设备不是在任何时候都有可靠的网络连接。以上两点都容易给口语评价系统带来负面的用户体验,而且,搭建和维护口语评价系统的服务器也会增加额外的成本。
发明内容
鉴于上述问题,提出了本发明,以便提供一种克服上述问题或者至少部分地解决上述问题的移动设备上的实时口语评价系统及方法,通过将口语评价系统在移动设备上完成,不但减少了口语评价系统对网络的依赖,即能够减少移动设备和服务器端的消息传输的流量损耗,而且还能够给用户即时口语评价反馈,从而实现何时何地都能使用该口语评价系统练习口语,提高用户体验效果。
依据本发明的一个方面,提供了一种移动设备上的实时口语评价系统,其包括:采集模块,用于采集待评价语音的语音数据,待评价语音中包括至少一个字符的语音或字符串的语音;识别模块,用于将采集模块采集到的语音数据识别为文本数据;匹配模块,用于将识别模块识别得到的文本数据与语音样本库中语音样本的文本数据进行匹配,得到匹配结果;以及评价模块,用于根据预先定义的评价策略和所述匹配模块匹配得到的匹配结果,得到并输出待评价语音中至少一个字符或字符串的发音得分,和/或待评价语音的发音得分。
可选地,系统还包括:显示模块,用于显示所述语音样本库中语音样本的文本数据;
所述采集模块进一步用于采集用户按照所述显示模块显示的语音样本库中语音样本的文本数据输入的、作为待评价语音的语音数据。
可选地,系统还包括:得分比较模块,用于将评价模块输出的待评价语音的发音得分,和/或所述待评价语音中至少一个字符或字符串的发音得分,与预先定义的发音得分阈值进行比较;标记模块,用于在所述待评价语音的发音得分低于预先定义的发音得分阈值的情况下,在所述显示模块显示的文本数据中标记出发音得分低于预先定义的发音得分阈值的文本数据;和/或,在待评价语音中字符或字符串的发音得分低于预先定义的发音得分阈值的情况下,在所述显示模块显示的文本数据中标记出发音得分低于预先定义的发音得分阈值的字符或字符串。
可选地,匹配模块进一步用于根据Levenshtein Distance编辑距离算法,对所述识别模块识别得到的文本数据与语音样本库中语音样本的文本数据进行匹配计算,得到匹配结果。
可选地,预先定义的评价策略为:在识别得到的文本数据与语音样本库中语音样本的文本数据匹配的情况下,将根据语音数据识别得到文本数据中字符或字符串的后验概率作为待评价语音中字符或字符串的发音得分;将待评价语音中所有字符或字符串的发音得分的平均分作为待评价语音的发音得分。
可选地,系统还包括:存储模块,用于存储所述语音样本库,所述语音样本库中包括至少一个语音样本。
依据本发明的另一个方面,还提供了一种终端设备上的实时口语评价方法,其包括:采集待评价语音的语音数据,所述待评价语音中包括至少一个字符的语音或字符串的语音;将采集到的所述语音数据识别为文本数据;将识别得到的文本数据与语音样本库中语音样本的文本数据进行匹配,得到匹配结果;以及根据预先定义的评价策略和所述匹配结果,得到并输出所述待评价语音中至少一个字符或字符串的发音得分,和/或所述待评价语音的发音得分。
可选地,在所述采集待评价语音的语音数据的步骤之前,所述方法还包括:显示语音样本库中语音样本的文本数据;
相应地,所述采集待评价语音的语音数据的步骤为:采集用户按照显示的语音样本库中语音样本的文本数据输入的、作为待评价语音的语音数据。
可选地,方法还包括:将输出的所述待评价语音的发音得分,和/或所述待评价语音中的至少一个字符或字符串的发音得分,与预先定义的发音得分阈值进行比较;在所述待评价语音的发音得分低于预先定义的发音得分阈值的情况下,在显示的文本数据中标记出发音得分低于预先定义的发音得分阈值的文本数据;和/或,在待评价语音中的至少一个字符或字符串的发音得分低于预先定义的发音得分阈值的情况下,在显示的文本数据中标记出发音得分低于预先定义的发音得分阈值的字符或字符串。
可选地,所述将识别得到的文本数据与语音样本库中语音样本的文本数据进行匹配,得到匹配结果的步骤为:根据Levenshtein Distance编辑距离算法,对识别得到的文本数据与语音样本库中语音样本的文本数据进行匹配计算,得到匹配结果。
在本发明的实施例中,通过移动设备上的实时口语评价系统采集待评价语音的语音数据;然后将采集到的语音数据识别为文本数据;然后将识别得到的文本数据与语音样本库中语音样本的文本数据进行匹配,得到匹配结果;再根据预先定义的评价策略和匹配结果,得到并输出待评价语音的发音得分,和/或待评价语音中的至少一个字符或字符串的发音得分。通过将口语评价系统在移动设备的客户端上完成,不但减少了移动设备对网 络的依赖,减少了移动设备和服务器端的消息传递的流量损耗,而且能够给用户即时口语评价反馈,从而达到何时何地都能使用该口语评价系统练习口语的效果。
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。
附图说明
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:
图1示意性地示出了根据本发明的实施例的移动设备上的实时口语评价系统100的结构框图;以及
图2示意性地示出了根据本发明的实施例的移动设备上的实时口语评价方法200的流程图。
具体实施方式
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。
应当理解,本领域技术人员能够设想出尽管没有在本说明书中明确描述或者记载、但是实现了本发明并且包含在本发明精神、原理与范围内的各种结构。
本说明书中引述的所有例子与条件性语言都是出于说明和教导的目的,以帮助读者理解发明人对现有技术作出贡献的原理与概念,并且应该被理解为不限于这些具体引述的例子与条件。
此外,本说明书中引述本发明的原理、各方面以及各实施例及其具体 例子的所有描述和说明都意在涵盖其结构上与功能上的等价物或等效物。另外,这样的等价物或等效物应当包含当前已知的、以及将来开发的等价物或等效物,即,不管结构如何、都执行相同功能的研发成果。
本领域技术人员应该理解,说明书附图中呈现的框图表示实现本发明的结构或电路的示意性图示。类似地,应该理解,说明书附图中呈现的任何流程图等表示实际可以由各种计算机或者处理器执行的各种处理,而不管在图中是否明确显示了此类计算机或者处理器。
在权利要求书中,用来执行指定功能的模块意在涵盖执行该功能的任何方式,包括例如(a)执行该功能的电路元件的组合、或者(b)任何形式的软件,因此包括固件、微代码等等,其与适当电路组合,用来执行实现功能的软件。由各种模块提供的功能被以权利要求所主张的方式组合在一起,由此应当认为,可以提供这些功能的任何模块、部件、或元件都等价于权利要求中限定的模块。
说明书中的术语“实施例”意味着结合该实施例描述的具体特征、结构等等被包含在本发明的至少一个实施例中,因此,在说明书各处出现的术语“在实施例中”不一定都指相同的实施例。
如图1所示,根据本发明的实施例的移动设备上的实时口语评价系统100可以主要包括:采集模块110、识别模块130、匹配模块150和评价模块170,应当理解,图1中所表示的各个模块的连接关系仅为示例,本领域技术人员完全可以采用其它的连接关系,只要在这样的连接关系下各个模块也能够实现本发明的功能即可。
在本说明书中,各个模块的功能可以通过使用专用硬件、或者能够与适当的软件相结合来执行处理的硬件来实现。这样的硬件或专用硬件可以包括专用集成电路(ASIC)、各种其它电路、各种处理器等。当由处理器实现时,该功能可以由单个专用处理器、单个共享处理器、或者多个独立的处理器(其中某些可能被共享)来提供。另外,处理器不应该被理解为专指能够执行软件的硬件,而是可以隐含地包括、而不限于数字信号处理器(DSP)硬件、用来存储软件的只读存储器(ROM)、随机存取存储器(RAM)、以及非易失存储设备。
根据本发明的实施例,采集模块110,用于采集待评价语音的语音数 据,其中待评价语音中包括至少一个字符的语音或字符串的语音。可选地,待评价语音中可以包括:中文词语、英文词语和阿拉伯数字中的任意一种或多种组合,当然可以理解的是,在本发明的实施例中并不限定待评价语音的语言类型。
在本发明的实施例中,采集模块110负责录入待评价语音,并保存待评价语音的语音数据。可选地,该采集模块110可以是现有的麦克风,用户可通过麦克风向系统100输入待评价语音。例如:待评价语音的内容可以为以下英语句子:“Welcome to Liu Li shuo!My name is Peter.I’m an English teacher at Liu Li shuo.”。可选地,系统100通过采集模块110将待评价语音的语音数据转化为.wav格式的音频文件并保存,其中WAV格式即为声音波形文件格式。当然可以理解的是,在本发明的实施例中并不限定采集模块110的具体结构。
根据本发明的实施例,识别模块130,用于将采集模块110采集到的语音数据识别为文本数据。
也就是,通过识别模块130可以将上述举例说明的待评价语音的语音数据识别为以下文本数据:WELCOME TO LIU LI SHUO!MY NAME IS PETER.I’M AN ENGLISH TEACHER AT LIU LI SHUO。
可选地,在本发明的实施例中,识别模块130采用语音识别模型是以混合高斯分布为输出概率分布的隐马尔科夫模型(Hidden Markov Model,HMM)。
识别模块130可以采用定点运算将采集模块110采集到的语音数据识别为文本数据。例如采用以下方式进行定点运算,当然并不限于此:
方式一、在现有的语音识别的算法中,有许多的浮点运算,可以用定点DSP(定点DSP完成的是整数运算或小数运算,数值格式中不包含阶码,通常定点DSP是16位或24位数据宽度)来实现浮点运算,然后通过数的定标方法来实现浮点数转换为定点数。数的定标就是决定小数点在定点数中的位置。Q表示法是一种常用的定标方法,其表示机制是:设定点数是x,浮点数是y,则Q表示法的定点数与浮点数的转换关系为:
浮点数y转换为定点数x:x=(int)y×2Q
方式二、(1)定义和简化算法结构;(2)确定需要量化的函数中的 关键变量;(3)收集关键变量的统计信息;(4)确定关键变量的精确表示;(5)确定其余变量的定点格式。
由此可知,在本发明的实施例中可以采用定点运算代替一般的浮点运算,并使用整型数代替一般的浮点数来代表识别结果的输出概率。由于本发明的实施例中可以采用定点运算,该定点运算相对于浮点运算不需要定义很多的参数,从而使得识别模块130可以在占用较少的系统资源(CPU资源、内存资源、存储资源)的情况下,完成识别过程。当然可以理解的是,在本发明的实施例中并不限定识别模块130字符识别所采用的识别模型的具体类型。
根据本发明的实施例,匹配模块150,用于将识别模块130识别得到的文本数据与语音样本库中语音样本的文本数据进行匹配,得到匹配结果。
可选地,在本发明的实施例中语音样本库中语音样本的文本数据可以是预先存储在语音样本库中的文本数据,例如预先将以下文本数据:WELCOME TO LIU LI SHUO!MY NAME IS PETER.I’M AN ENGLISH TEACHER AT LIU LI SHUO,存储在语音样本库中。
可选地,在本发明的实施例中,匹配模块150进一步用于根据Levenshtein Distance编辑距离算法,对识别模块130识别得到的文本数据与语音样本库中语音样本的文本数据进行匹配计算,得到匹配结果。其中该匹配结果可以包括:识别模块130识别得到的文本数据与语音样本库中语音样本的文本数据匹配和识别模块130识别得到的文本数据与语音样本库中语音样本的文本数据不匹配。当然可以理解的是,在本发明的实施例中并不限定匹配模块150所采用的匹配算法。
根据本发明的实施例,评价模块170,用于根据预先定义的评价策略和匹配模块150匹配得到的匹配结果,得到并输出待评价语音中的至少一个字符或字符串的发音得分,和/或待评价语音的发音得分。
可选地,在本发明的实施例中,预先定义的评价策略为:在识别得到的文本数据与语音样本库中语音样本的文本数据匹配的情况下,将识别得到文本数据中字符或字符串的后验概率作为待评价语音中字符或字符串 的发音得分,以及将待评价语音中所有字符或字符串的发音得分的平均分作为待评价语音的发音得分。
可选地,在本发明的实施例中,基于语音数据识别得到字符或字符串的后验概率为p(介于0到1),则该字符或字符串的发音得分为p×100。
以上述举例的英语句子为例,通过评价模块170可以得到“Welcome to Liu Li shuo!My name is Peter.I’m an English teacher at Liu Li shuo.”整个英语句子的发音得分,和/或得到上述英语句子中每个单词的发音得分。即,在本发明的实施例中可以使用由句子单词组成的一元语言模型(unigram language model)。
根据本发明的实施例,所述移动设备上的实时口语评价系统100还可以包括一个或者多个可选模块,以实现额外或者附加的功能,然而这些可选模块对于实现本发明的目的而言并非是不可或缺的,根据本发明的实施例的移动设备上的实时口语评价系统100完全可以在没有这些可选模块的情况下,实现本发明的目的。这些可选模块尽管未在图1中示出,但它们与上述各模块之间的连接关系可以由本领域技术人员根据下述教导而容易地得出。
可选地,在本发明的实施例中,系统100还包括:显示模块,用于显示语音样本库中语音样本的文本数据,例如显示以下英语句子“Welcome to Liu Li shuo!My name is Peter.I’m an English teacher at Liu Li shuo”;
相应地,采集模块110进一步用于采集用户按照显示模块显示的语音样本库中语音样本的文本数据输入的、作为待评价语音的语音数据。
也就是,采集模块110采集用户朗读以下英语句子“Welcome to Liu Li shuo!My name is Peter.I’m an English teacher at Liu Li shuo”的语音数据。
可选地,在本发明的实施例中,系统100还包括:得分比较模块和标记模块,其中
上述得分比较模块用于将评价模块170输出的待评价语音的发音得分,和/或待评价语音中的至少一个字符或字符串的发音得分,与预先定义的发音得分阈值进行比较;可选地,预先定义的发音得分阈值可以设置为60分,当然可以理解的是,在本发明的实施例中并不限定其具体值。
标记模块用于在待评价语音的发音得分低于预先定义的发音得分阈 值的情况下,在显示模块显示的文本数据中标记出发音得分低于预先定义的发音得分阈值的文本数据;和/或,在待评价语音中的至少一个字符或字符串的发音得分低于预先定义的发音得分阈值的情况下,在显示模块显示的文本数据中标记出发音得分低于预先定义的发音得分阈值的字符或字符串。
以上述举例的英语句子为例,在得分比较模块比较出“Welcome”的发音得分低于预先定义的发音得分阈值的情况下,可以在整个英语句子中将“Welcome”标记出,可选地,将“Welcome”的颜色设置为红颜色。
可选地,在本发明的实施例中,系统100还包括:存储模块,用于存储语音样本库,其中语音样本库中包括至少一个语音样本,例如该语音样本的内容为:“Welcome to Liu Li shuo!My name is Peter.I’m an English teacher at Liu Li shuo.”。
通过本发明的实施例,通过将口语评价系统在移动设备的客户端上完成,不但减少了移动设备对网络的依赖,减少了移动设备和服务器端的消息传递的流量损耗,而且能够给用户即时口语评价反馈,从而达到何时何地都能使用该口语评价系统练习口语的效果。
根据本发明的第二方面,与如上所述的根据本发明的实施例的移动设备上的实时口语评价系统100相对应,本发明还提供了一种移动设备上的实时口语评价方法200。
参考图2,其中示意性地示出了根据本发明的实施例的移动设备上的实时口语评价方法200的流程图。如图2所示,所述方法200包括步骤S210、S230、S250、S270,方法200始于步骤S210,其中,采集待评价语音的语音数据。其中待评价语音中包括至少一个字符的语音或字符串的语音,可选地,待评价语音中可以包括:中文词语、英文词语和阿拉伯数字中的任意一种或多种组合,当然可以理解的是,在本发明的实施例中并不限定待评价语音的语言类型。
可选地,用户可通过麦克风对系统100输入待评价语音。例如:待评价语音的内容可以为以下英语句子:“Welcome to Liu Li shuo!My name is Peter.I’m an English teacher at Liu Li shuo.”。可选地,系统100通过采集模块110将待评价语音的语音数据转化为.wav格式的音频文件并保存,其 中WAV格式即为声音波形文件格式。
随后,在步骤S230中,将采集到的语音数据识别为文本数据。也就是,通过步骤S230可以将上述举例说明的待评价语音的语音数据识别为以下文本数据:WELCOME TO LIU LI SHUO!MY NAME IS PETER.I’M AN ENGLISH TEACHER AT LIU LI SHUO。
可选地,在本发明的实施例中,采用语音识别模型是以混合高斯分布为输出概率分布的隐马尔科夫模型(Hidden Markov Model,HMM)。即在本发明的实施例中采用定点运算代替一般的浮点运算,并使用整型数代替一般的浮点数来代表识别结果的输出概率。当然可以理解的是,在本发明的实施例中并不限定字符识别所采用的识别模型的具体类型。
随后,在步骤S250中,将识别得到的文本数据与语音样本库中语音样本的文本数据进行匹配,得到匹配结果。
可选地,在本发明的实施例中语音样本库中语音样本的文本数据可以是预先存储在语音样本库中的文本数据,例如预先将以下文本数据:WELCOME TO LIU LI SHUO!MY NAME IS PETER.I’M AN ENGLISH TEACHER AT LIU LI SHUO,存储在语音样本库中。
可选地,在本发明的实施例中,在步骤S250中,根据Levenshtein Distance编辑距离算法对识别得到的文本数据与语音样本库中语音样本的文本数据进行匹配计算,得到匹配结果。例如:该匹配结果包括:识别得到的文本数据与语音样本库中语音样本的文本数据匹配和识别得到的文本数据与语音样本库中语音样本的文本数据不匹配。当然可以理解的是,在本发明的实施例中并不限定所采用的匹配算法。
随后,在步骤S270中,根据预先定义的评价策略和所述匹配结果,得到并输出待评价语音中的至少一个字符或字符串的发音得分,和/或待评价语音的发音得分。
可选地,在本发明的实施例中,预先定义的评价策略为:在识别得到的文本数据与语音样本库中语音样本的文本数据匹配的情况下,将识别得到文本数据中字符或字符串的后验概率作为待评价语音中字符或字符串的发音得分,以及将待评价语音中所有字符或字符串的发音得分的平均分 作为待评价语音的发音得分。
可选地,在本发明的实施例中,基于语音数据识别得到字符或字符串的后验概率为p(介于0到1),则该字符或字符串的发音得分为p×100。
以上述举例的英语句子为例,通过步骤S270可以得到“Welcome to Liu Li shuo!My name is Peter.I’m an English teacher at Liu Li shuo.”整个英语句子的发音得分,和/或得到上述英语句子中每个单词的发音得分。即,在本发明的实施例中可以使用由句子单词组成的一元语言模型(unigram language model)。
根据本发明的实施例,所述移动设备上的实时口语评价方法200还可以包括一个或者多个可选步骤,以实现额外或者附加的功能,然而这些可选步骤对于实现本发明的目的而言并非是不可或缺的,根据本发明的实施例的移动设备上的实时口语评价方法200完全可以在没有这些可选步骤的情况下,实现本发明的目的。这些可选步骤未在图2中示出,但它们与上述各步骤之间的先后执行可以由本领域技术人员根据下述教导而容易地得出。需要指出的是,只要没有特别说明,这些可选步骤连同上述步骤的执行顺序可以根据实际需要进行选择。
可选地,方法200还包括:显示语音样本库中语音样本的文本数据的文本数据,例如显示以下英语句子“Welcome to Liu Li shuo!My name is Peter.I’m an English teacher at Liu Li shuo.”;
相应地,所述采集待评价语音的语音数据(S210)的步骤为:采集用户按照显示的语音样本库中语音样本的输入的、作为待评价语音的语音数据。
也就是,可以通过步骤S210采集用户朗读以下英语句子“Welcome to Liu Li shuo!My name is Peter.I’m an English teacher at Liu Li shuo”的语音数据。
可选地,方法200还包括:将输出的所述待评价语音的发音得分,和/或所述待评价语音中的至少一个字符或字符串的发音得分,与预先定义的发音得分阈值进行比较;可选地,预先定义的发音得分阈值可以设置为60分,当然可以理解的是,在本发明的实施例中并不限定其具体值。
在所述待评价语音的发音得分低于预先定义的发音得分阈值的情况 下,在显示的文本数据中标记出发音得分低于预先定义的发音得分阈值的文本数据;和/或,在待评价语音中的至少一个字符或字符串的发音得分低于预先定义的发音得分阈值的情况下,在显示的文本数据中标记出发音得分低于预先定义的发音得分阈值的字符或字符串。
以上述举例的英语句子为例,在比较出“Welcome”的发音得分低于预先定义的发音得分阈值的情况下,可以在整个英语句子中将“Welcome”标记出,可选地,将“Welcome”的颜色设置为红颜色。
由于上述各方法实施例与前述各装置实施例相对应,因此不再对各方法实施例进行详细描述。
在本说明书中,说明了大量的具体细节。然而,应当理解,本发明的实施例可以在没有这些具体细节的情况下实施。在一些实施例中,并未详细示出公知的方法、结构和技术,以便不使读者混淆对本说明书的原理的理解。
本领域技术人员可以理解,可以对各实施例中的装置中的模块进行自适应性地改变,并且把它们设置在与该实施例不同的一个或多个装置中。可以把实施例中的若干模块组合成一个模块或单元或组件,还可以把它们分成多个子模块或子单元或子组件。除了特征和/或处理相互排斥的情况之外,可以采用任何组合,对本说明书中公开的任何方法的所有步骤或者任何装置的所有模块进行组合。除非另外明确陈述,本说明书中公开的每个特征都可以由提供相同、等同或相似目的替代特征来代替。
本发明的各个装置实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的装置中的一些或者全部模块的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的装置程序(例如,计算机程序和计算机程序产品)。
应当注意,上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不偏离所附权利要求的范围的情况下,可设计出各种替代实施例。在权利要求书中,特征的排序并不意味着特征的任何特定顺序,并且特别地,方法权利要求中各步骤的顺序并不意味着这些步骤必 须按照该顺序来执行。相反地,这些步骤可以以任何适当的顺序执行。同样,装置权利要求中各模块执行处理的顺序也不应受权利要求中各模块的排序限制,而是可以以任何适当的顺序执行处理。在权利要求书中,不应将位于括号内的任何参考标记理解成对权利要求的限制。术语“包括”或“包含”不排除存在未列在权利要求中的模块或步骤。位于模块或步骤之前的术语“一”或“一个”不排除存在多个这样的模块或步骤。本发明可以借助于包括若干不同模块的硬件或者借助于适当编程的计算机或处理器来实现。在列举了若干模块的装置权利要求中,这些模块中的若干项可以通过同一个硬件模块来实现。术语“第一”、“第二”、以及“第三”等的使用不表示任何顺序,可将这些术语解释为名称。术语“连接”、“耦接”等在本说明书中使用时定义为以任何期望形式进行可操作地连接,例如,机械地、电子地、数字地、模拟地、直接地、间接地、通过软件、通过硬件等方式进行连接。

Claims (10)

  1. 一种移动设备上的实时口语评价系统(100),其包括:
    采集模块(110),用于采集待评价语音的语音数据,所述待评价语音中包括至少一个字符的语音或字符串的语音;
    识别模块(130),用于将所述采集模块(110)采集到的语音数据识别为文本数据;
    匹配模块(150),用于将所述识别模块(130)识别得到的文本数据与语音样本库中语音样本的文本数据进行匹配,得到匹配结果;以及
    评价模块(170),用于根据预先定义的评价策略和所述匹配模块(150)匹配得到的匹配结果,得到并输出所述待评价语音中至少一个字符或字符串的发音得分,和/或所述待评价语音的发音得分。
  2. 根据权利要求1所述的系统,所述系统还包括:显示模块,用于显示所述语音样本库中语音样本的文本数据;
    所述采集模块(110)进一步用于采集用户按照所述显示模块显示的语音样本库中语音样本的文本数据输入的、作为待评价语音的语音数据。
  3. 根据权利要求2所述的系统,所述系统还包括:
    得分比较模块,用于将所述评价模块(170)输出的待评价语音的发音得分,和/或所述待评价语音中至少一个字符或字符串的发音得分,与预先定义的发音得分阈值进行比较;
    标记模块,用于在所述待评价语音的发音得分低于预先定义的发音得分阈值的情况下,在所述显示模块显示的文本数据中标记出发音得分低于预先定义的发音得分阈值的文本数据;和/或,
    在待评价语音中字符或字符串的发音得分低于预先定义的发音得分阈值的情况下,在所述显示模块显示的文本数据中标记出发音得分低于预先定义的发音得分阈值的字符或字符串。
  4. 根据权利要求1所述的系统,其中,所述匹配模块(150)进一步用于根据Levenshtein Distance编辑距离算法,对所述识别模块(130)识别得到的文本数据与语音样本库中语音样本的文本数据进行匹配计算,得 到匹配结果。
  5. 根据权利要求1~4任一所述系统,其中,所述预先定义的评价策略为:在识别得到的文本数据与语音样本库中语音样本的文本数据匹配的情况下,将根据语音数据识别得到文本数据中字符或字符串的后验概率作为待评价语音中字符或字符串的发音得分;
    将待评价语音中所有字符或字符串的发音得分的平均分作为待评价语音的发音得分。
  6. 根据权利要求1~4任一所述的系统,其中,所述系统还包括:
    存储模块,用于存储所述语音样本库,所述语音样本库中包括至少一个语音样本。
  7. 一种终端设备上的实时口语评价方法(200),其包括:
    采集待评价语音的语音数据,所述待评价语音中包括至少一个字符的语音或字符串的语音(S210);
    将采集到的所述语音数据识别为文本数据(S230);
    将识别得到的文本数据与语音样本库中语音样本的文本数据进行匹配,得到匹配结果(S250);以及
    根据预先定义的评价策略和所述匹配结果,得到并输出所述待评价语音中至少一个字符或字符串的发音得分,和/或所述待评价语音的发音得分(S270)。
  8. 根据权利要求7所述的方法,在所述采集待评价语音的语音数据(S210)的步骤之前,所述方法还包括:显示语音样本库中语音样本的文本数据;
    所述采集待评价语音的语音数据(S210)的步骤为:
    采集用户按照显示的语音样本库中语音样本的文本数据输入的、作为待评价语音的语音数据。
  9. 根据权利要求8所述的方法,所述方法还包括:
    将输出的所述待评价语音的发音得分,和/或所述待评价语音中的至少一个字符或字符串的发音得分,与预先定义的发音得分阈值进行比较;
    在所述待评价语音的发音得分低于预先定义的发音得分阈值的情况 下,在显示的文本数据中标记出发音得分低于预先定义的发音得分阈值的文本数据;和/或,在待评价语音中的至少一个字符或字符串的发音得分低于预先定义的发音得分阈值的情况下,在显示的文本数据中标记出发音得分低于预先定义的发音得分阈值的字符或字符串。
  10. 根据权利要求7~9任一所述的方法,其中,所述将识别得到的文本数据与语音样本库中语音样本的文本数据进行匹配,得到匹配结果的步骤为:
    根据Levenshtein Distance编辑距离算法,对识别得到的文本数据与语音样本库中语音样本的文本数据进行匹配计算,得到匹配结果。
PCT/CN2014/089644 2013-10-30 2014-10-28 移动设备上的实时口语评价系统及方法 WO2015062465A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2016550920A JP6541673B2 (ja) 2013-10-30 2014-10-28 モバイル機器におけるリアルタイム音声評価システム及び方法
US15/033,210 US20160253923A1 (en) 2013-10-30 2014-10-28 Real-time spoken language assessment system and method on mobile devices
EP14859160.5A EP3065119A4 (en) 2013-10-30 2014-10-28 Real-time oral english evaluation system and method on mobile device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310524873.8 2013-10-30
CN201310524873.8A CN104599680B (zh) 2013-10-30 2013-10-30 移动设备上的实时口语评价系统及方法

Publications (1)

Publication Number Publication Date
WO2015062465A1 true WO2015062465A1 (zh) 2015-05-07

Family

ID=53003339

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/089644 WO2015062465A1 (zh) 2013-10-30 2014-10-28 移动设备上的实时口语评价系统及方法

Country Status (5)

Country Link
US (1) US20160253923A1 (zh)
EP (1) EP3065119A4 (zh)
JP (1) JP6541673B2 (zh)
CN (1) CN104599680B (zh)
WO (1) WO2015062465A1 (zh)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9911410B2 (en) * 2015-08-19 2018-03-06 International Business Machines Corporation Adaptation of speech recognition
CN105513612A (zh) * 2015-12-02 2016-04-20 广东小天才科技有限公司 语言词汇的音频处理方法及装置
JP7028179B2 (ja) * 2016-09-29 2022-03-02 日本電気株式会社 情報処理装置、情報処理方法およびコンピュータ・プログラム
CN108154735A (zh) * 2016-12-06 2018-06-12 爱天教育科技(北京)有限公司 英语口语测评方法及装置
CN107578778A (zh) * 2017-08-16 2018-01-12 南京高讯信息科技有限公司 一种口语评分的方法
CN108053839B (zh) * 2017-12-11 2021-12-21 广东小天才科技有限公司 一种语言练习成果的展示方法及麦克风设备
CN108831212B (zh) * 2018-06-28 2020-10-23 深圳语易教育科技有限公司 一种口语教学辅助装置及方法
CN109493852A (zh) * 2018-12-11 2019-03-19 北京搜狗科技发展有限公司 一种语音识别的评测方法及装置
US11640767B1 (en) * 2019-03-28 2023-05-02 Emily Anna Bridges System and method for vocal training
CN110349583A (zh) * 2019-07-15 2019-10-18 高磊 一种基于语音识别的游戏教育方法及系统
CN110634471B (zh) * 2019-09-21 2020-10-02 龙马智芯(珠海横琴)科技有限公司 一种语音质检方法、装置、电子设备和存储介质
CN110797049B (zh) * 2019-10-17 2022-06-07 科大讯飞股份有限公司 一种语音评测方法及相关装置
CN110827794B (zh) * 2019-12-06 2022-06-07 科大讯飞股份有限公司 语音识别中间结果的质量评测方法和装置
CN111415684B (zh) * 2020-03-18 2023-12-22 歌尔微电子股份有限公司 语音模组的测试方法、装置及计算机可读存储介质
WO2022003104A1 (en) * 2020-07-01 2022-01-06 Iliescu Alexandru System and method for interactive and handsfree language learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002050803A2 (en) * 2000-12-18 2002-06-27 Digispeech Marketing Ltd. Method of providing language instruction and a language instruction system
US20090087822A1 (en) * 2007-10-02 2009-04-02 Neurolanguage Corporation Computer-based language training work plan creation with specialized english materials
CN101551947A (zh) * 2008-06-11 2009-10-07 俞凯 辅助口语语言学习的计算机系统
CN101551952A (zh) * 2009-05-21 2009-10-07 无敌科技(西安)有限公司 发音评测装置及其方法
CN101739869A (zh) * 2008-11-19 2010-06-16 中国科学院自动化研究所 一种基于先验知识的发音评估与诊断系统
CN102800314A (zh) * 2012-07-17 2012-11-28 广东外语外贸大学 具有反馈指导的英语句子识别与评价系统及其方法
CN103065626A (zh) * 2012-12-20 2013-04-24 中国科学院声学研究所 英语口语考试系统中的朗读题自动评分方法和设备

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002175095A (ja) * 2000-12-08 2002-06-21 Tdk Corp 発音学習システム
JP2006133521A (ja) * 2004-11-05 2006-05-25 Kotoba No Kabe Wo Koete:Kk 語学学習機
US8272874B2 (en) * 2004-11-22 2012-09-25 Bravobrava L.L.C. System and method for assisting language learning
JP2006208644A (ja) * 2005-01-27 2006-08-10 Toppan Printing Co Ltd 語学会話力測定サーバシステム及び語学会話力測定方法
JP4165898B2 (ja) * 2005-06-15 2008-10-15 学校法人早稲田大学 文章評価装置及び文章評価プログラム
JP2007148170A (ja) * 2005-11-29 2007-06-14 Cai Media Kyodo Kaihatsu:Kk 外国語学習支援システム
GB2458461A (en) * 2008-03-17 2009-09-23 Kai Yu Spoken language learning system
CN101246685B (zh) * 2008-03-17 2011-03-30 清华大学 计算机辅助语言学习系统中的发音质量评价方法
JP2010282058A (ja) * 2009-06-05 2010-12-16 Tokyobay Communication Co Ltd 外国語学習補助方法及び装置
US9361908B2 (en) * 2011-07-28 2016-06-07 Educational Testing Service Computer-implemented systems and methods for scoring concatenated speech responses
CA2923003C (en) * 2012-09-06 2021-09-07 Rosetta Stone Ltd. A method and system for reading fluency training

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002050803A2 (en) * 2000-12-18 2002-06-27 Digispeech Marketing Ltd. Method of providing language instruction and a language instruction system
US20090087822A1 (en) * 2007-10-02 2009-04-02 Neurolanguage Corporation Computer-based language training work plan creation with specialized english materials
CN101551947A (zh) * 2008-06-11 2009-10-07 俞凯 辅助口语语言学习的计算机系统
CN101739869A (zh) * 2008-11-19 2010-06-16 中国科学院自动化研究所 一种基于先验知识的发音评估与诊断系统
CN101551952A (zh) * 2009-05-21 2009-10-07 无敌科技(西安)有限公司 发音评测装置及其方法
CN102800314A (zh) * 2012-07-17 2012-11-28 广东外语外贸大学 具有反馈指导的英语句子识别与评价系统及其方法
CN103065626A (zh) * 2012-12-20 2013-04-24 中国科学院声学研究所 英语口语考试系统中的朗读题自动评分方法和设备

Also Published As

Publication number Publication date
CN104599680B (zh) 2019-11-26
EP3065119A4 (en) 2017-04-19
JP2016536652A (ja) 2016-11-24
US20160253923A1 (en) 2016-09-01
JP6541673B2 (ja) 2019-07-10
CN104599680A (zh) 2015-05-06
EP3065119A1 (en) 2016-09-07

Similar Documents

Publication Publication Date Title
WO2015062465A1 (zh) 移动设备上的实时口语评价系统及方法
CN107680582B (zh) 声学模型训练方法、语音识别方法、装置、设备及介质
CN107195295B (zh) 基于中英文混合词典的语音识别方法及装置
CN105632486B (zh) 一种智能硬件的语音唤醒方法和装置
WO2020024690A1 (zh) 语音标注方法、装置及设备
CN107016994B (zh) 语音识别的方法及装置
WO2020224119A1 (zh) 用于语音识别的音频语料筛选方法、装置及计算机设备
TWI532035B (zh) 語言模型的建立方法、語音辨識方法及電子裝置
CN111128223B (zh) 一种基于文本信息的辅助说话人分离方法及相关装置
TWI391915B (zh) 語音變異模型建立裝置、方法及應用該裝置之語音辨識系統和方法
US10515292B2 (en) Joint acoustic and visual processing
US8972260B2 (en) Speech recognition using multiple language models
WO2015090215A1 (zh) 区分地域性口音的语音数据识别方法、装置和服务器
CN111341305B (zh) 一种音频数据标注方法、装置及系统
WO2018223796A1 (zh) 语音识别方法、存储介质及语音识别设备
KR20160119274A (ko) 핫워드 적합성을 결정하는 방법 및 장치
WO2021120602A1 (zh) 节奏点检测方法、装置及电子设备
CN105551485B (zh) 语音文件检索方法及系统
CN109377981B (zh) 音素对齐的方法及装置
TW200926140A (en) Method and system of generating and detecting confusion phones of pronunciation
CN109686383A (zh) 一种语音分析方法、装置及存储介质
CN109448704A (zh) 语音解码图的构建方法、装置、服务器和存储介质
CN104347071B (zh) 生成口语考试参考答案的方法及系统
CN109102800A (zh) 一种确定歌词显示数据的方法和装置
CN106782517A (zh) 一种语音音频关键词过滤方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14859160

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2016550920

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 15033210

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

REEP Request for entry into the european phase

Ref document number: 2014859160

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2014859160

Country of ref document: EP