WO2017054122A1 - Speech recognition system and method, client device and cloud server - Google Patents

Speech recognition system and method, client device and cloud server Download PDF

Info

Publication number
WO2017054122A1
WO2017054122A1 PCT/CN2015/091042 CN2015091042W WO2017054122A1 WO 2017054122 A1 WO2017054122 A1 WO 2017054122A1 CN 2015091042 W CN2015091042 W CN 2015091042W WO 2017054122 A1 WO2017054122 A1 WO 2017054122A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
module
speech
feature
user
Prior art date
Application number
PCT/CN2015/091042
Other languages
French (fr)
Chinese (zh)
Inventor
李强生
Original Assignee
深圳市全圣时代科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市全圣时代科技有限公司 filed Critical 深圳市全圣时代科技有限公司
Priority to PCT/CN2015/091042 priority Critical patent/WO2017054122A1/en
Priority to CN201580031165.8A priority patent/CN106537493A/en
Publication of WO2017054122A1 publication Critical patent/WO2017054122A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/12Speech classification or search using dynamic programming techniques, e.g. dynamic time warping [DTW]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting

Definitions

  • the present invention relates to the field of voice recognition, and in particular, to a voice recognition system and method, and a client device and a cloud server with voice recognition function.
  • LVCSR Large Vocabulary Continuous Speech Recognition
  • speech recognition is a computer that recognizes which text corresponds to a certain piece of speech based on the language information contained in the continuous sound signal of the person. process.
  • the influence of many recognizer backgrounds on the performance of speech recognizers is eliminated or weakened by the database method. That is to say, when there is already a speech recognizer that recognizes standard Mandarin, it needs to have a certain dialect background.
  • the method is to collect a large number of first speech databases related to the dialect, and then use the existing acoustic model training method to retrain the acoustic model, or use the existing speaker adaptive method to acoustic models. Make adaptive.
  • the disadvantages of this method are: (1) The workload of collecting the database with dialect background is very huge. For so many dialects in Chinese, the collection of the database is a huge project. (2) This method cannot balance standard Mandarin with pronunciation The commonality between the background Mandarins is only solved by the data-driven method, which is equivalent to completely reconstructing a speech recognizer, which brings difficulties in resource sharing and compatibility between speech recognizers of different dialect backgrounds.
  • the present invention provides a voice recognition system and method, and a client device and a cloud server with voice recognition function.
  • An embodiment of the present invention provides a voice recognition system, including at least: a voice input module, configured to input a user's voice in real time when a real-time call or voice input function is enabled; and a feature extraction module for inputting the user voice Extracting a voice feature; a model training module, configured to establish a corresponding acoustic and language model according to the voice feature and a preset rule; and an update module, configured to save and update the acoustic and language model into a model database.
  • a voice input module configured to input a user's voice in real time when a real-time call or voice input function is enabled
  • a feature extraction module for inputting the user voice Extracting a voice feature
  • a model training module configured to establish a corresponding acoustic and language model according to the voice feature and a preset rule
  • an update module configured to save and update the acoustic and language model into a model database.
  • Another embodiment of the present invention further provides a voice recognition method, including: inputting a user's voice in real time based on enabling a real-time call or voice input function; extracting a voice feature from the input user voice; according to the voice feature and a preset Rules, establishing corresponding acoustic and language models; and saving and updating the acoustic and language models into a model database in real time.
  • Yet another embodiment of the present invention provides a client device including the above-described voice recognition system.
  • Each cloud master module includes: a feature extraction module, configured to extract a voice feature from a user voice input from a client device that is enabling real-time call or voice entry function; a model training module, configured to Pre-defined rules to establish corresponding acoustic and language models; and update modules for saving and updating the acoustic and language models into a model database.
  • the speech recognition system and method of the present invention records or saves real-time call and recorded information in real time and serves as a sample for speech model training, thereby continuously updating the model database according to different pronunciation characteristics of the user.
  • the user's individual needs can be satisfied, and a variety of voices, such as English or local dialects, can be supported, and the recognition degree is improved.
  • FIG. 1 is a system frame diagram of a voice recognition system according to a first embodiment of the present invention
  • FIG. 2 is a functional block diagram of the speech recognition system of Figure 1;
  • FIG. 3 is a functional block diagram of a voice recognition system according to a second embodiment of the present invention.
  • FIG. 4 is a flowchart of a voice recognition method according to an embodiment of the present invention.
  • FIG. 5 is a flowchart of a voice recognition method according to another embodiment of the present invention.
  • FIG. 6 is a specific flowchart of step S409 in FIG. 5;
  • FIG. 7 is a flowchart of a voice recognition method according to still another embodiment of the present invention.
  • FIG. 1 is a system architecture diagram of a voice recognition system 100 according to a first embodiment of the present invention.
  • the voice recognition system 100 is implemented by the client device 200 and the cloud server 300 to complete the whole process of identifying the front end, model training, and identifying the back end through the cloud server 300, and the final voice recognition result. Delivered to the client device 200.
  • the data processing capacity of the client device 200 can be alleviated, and the deployment is very convenient, and most of the work of the subsequent upgrade is also completed in the cloud server 300.
  • the voice recognition system 100 includes at least a voice input module 10, a feature extraction module 20, a model training module 30, and an update module 40.
  • the voice input module 10 is disposed on the client device 200, such as a microphone and its processing circuit.
  • the feature extraction module 20, the model training module 30, the update module 40, and the like are integrated in the cloud server 300.
  • the voice input module 10 is configured to input the voice of the user in real time when the client device 200 enables the real-time call or voice input function.
  • the client device 200 can be a mobile phone, an in-vehicle device, a computer, a mobile phone, a smart home device, a wearable device, and the like.
  • the user's voice can also be saved locally or saved in the cloud.
  • the feature extraction module 20 is configured to extract a voice feature from the input user voice.
  • the feature extraction module 20 saves the extracted voice features in a first voice database 21 in real time, and the first voice database 21 may be a local database or a cloud database.
  • the speech feature refers to feature data of the user's voice.
  • the model training module 30 is configured to establish a corresponding acoustic and language model according to the voice feature and a preset rule, so as to match the extracted voice feature with the acoustic and language models in a subsequent recognition process. Compare and get the best recognition results.
  • the preset rule is in Dynamic Time Warping (DTW), Hidden Markov Model (HMM) theory, and Vector Quantization (VQ) technology.
  • the model training module 30 periodically extracts the voice features from the first voice database 21 for model training.
  • the The model training module 30 can also extract specific speech features in the first speech database 21 in real time for real-time model training, or quantitative (eg, 100) extraction of the specific speech features, and the present invention is not limited by these embodiments.
  • the update module 40 is configured to save and update the acoustic and language models into a model database 41 in real time, whereby a larger acoustic and language model database 41 can be acquired, which improves the degree of recognition.
  • the cloud server 300 includes a plurality of private cloud master modules corresponding to different users, and each private cloud master module The feature extraction module 20, the model training module 30, the update module 40, and the like are included.
  • the specific voice feature extracted by the feature extraction module 20 is saved under the corresponding private cloud module.
  • the model training module 30 performs acoustic and language model training on the specific speech features and updates the model through the update module 40.
  • the voice recognition function can be enabled by means of account authentication.
  • the voice recognition system 100 can also be integrated in a client device 200, such as an in-vehicle device, a computer, a mobile phone, a smart home device, a wearable device, etc., for The user turns on offline voice recognition.
  • a client device 200 such as an in-vehicle device, a computer, a mobile phone, a smart home device, a wearable device, etc.
  • the first voice database 21 and the model database 41 are both local databases. In this way, the above voice recognition function can be implemented without a network connection.
  • the mobile phone is usually not recorded in real time or recorded by a pad (may be other device) during recording, or saved as a speech model training.
  • Sample The present invention continuously records and stores the real-time call and recording information as a sample of the speech model training, so that the model database 41 can be continuously updated according to different pronunciation characteristics of the user. Thereby, the user's individual needs can be satisfied, and a variety of voices, such as English or local dialects, can be supported, and the recognition degree is improved.
  • the present invention also provides a private cloud master module for different users, for the user to enable the voice recognition function by means of account authentication, thereby improving the privacy performance of the user voice information.
  • the speech recognition system 100a is substantially the same as the speech recognition system 100 of the first embodiment, except that the speech recognition system 100a further includes an identification module 50 for the identification.
  • the module 50 is configured to determine whether the voice feature can be identified according to the acoustic and language models in the model database 41a, and if not, generate a recognition result carrying the control command; otherwise, store other unrecognizable voice features to In the first voice database 21a. At this time, the first voice database 21a only needs to save the voice features that are not recognized, which saves space.
  • the model training module 30 further includes a manual labeling unit 31, configured to manually map the unrecognizable voice feature with the matching degree lower than the threshold to the preset standard voice according to a user command, and The speech features and the standard speech data and their mapping relationship are updated in a second speech database 33 for use by the recognition module 50.
  • the identification module 50 is further configured to identify the voice data and output the recognition result according to the currently input user voice data and the second voice database 33.
  • the identification module 50 includes a first decoding unit 51 and a second decoding unit 52, and the first decoding unit 51 is configured to perform a matching degree calculation on the currently extracted voice feature and the acoustic and language models. If the matching degree is greater than or equal to the threshold, it is judged that the corresponding voice feature can be identified and the recognition result is output, otherwise, the voice feature is judged not to be recognized.
  • the second decoding unit 52 is configured to identify the voice of the user according to the currently input user voice and the second voice database 33, and output a corresponding standard voice.
  • the manual labeling unit 31 includes a prompting subunit 311, a selecting subunit 313, an input subunit 315, and an confirming subunit 317.
  • the prompting sub-unit 311 is configured to periodically prompt the user to view the unrecognizable voice features stored in the first voice database 21.
  • the selection sub-unit 313 is configured to allow a user to select a standard voice corresponding to the unrecognizable voice feature, wherein the standard voice is pre-stored in the first voice database 21. For example, the user can listen to the unknown Other specific speech, and then standard speech matching the speech feature is selected based on the provided standard speech.
  • the input subunit 315 is configured to allow a user to input a standard voice corresponding to the unrecognizable voice feature.
  • the confirmation subunit 317 is configured to allow a user to confirm a mapping relationship between the voice feature and the standard voice, and store the mapping relationship in the second voice database 33 after the confirmation is completed.
  • the feature extraction module 20, the model training module 30, the update module 40, and the identification module 50 are integrated in the cloud server 300a, and the identification module 50 respectively identifies voice data under different cloud modules. .
  • the speech recognition system 100a provided by the second embodiment only performs remodeling training on unrecognizable speech data, which can reduce data redundancy and improve recognition speed and efficiency.
  • the voice recognition system 100a may further include an execution module 60, configured to generate a text in a specific format or play a corresponding standard voice according to the recognition result, and control a corresponding client according to the control command. device.
  • the speech recognition system 100a may further comprise a download module 70 for the user to update the acoustics and language in the corresponding private cloud module. The model is downloaded locally to implement speech recognition locally.
  • the identification module 50 may also store all the voice features in the first voice database 21 for the model training module 30 while identifying the voice features.
  • the speech feature is extracted from the first speech database 21 at a timing to perform model training.
  • an embodiment of the present invention provides a voice recognition method, where the method includes the following steps:
  • Step S401 inputting the voice of the user in real time based on enabling the real-time call or voice input function.
  • the real-time call or voice input function is implemented by using a mobile phone, an in-vehicle device, a computer, a mobile phone, a smart home device, a wearable device, and the like.
  • the user's voice can also be saved in real time for subsequent calls.
  • Step S403 extracting a voice feature from the input user voice.
  • the extracted voice features are saved in a first voice database 21 in real time.
  • the first voice database 21 may be a local database or a cloud database
  • the voice feature refers to feature data of the user voice.
  • Step S405 establishing corresponding acoustic and language models according to the voice features and preset rules, for matching and comparing the extracted voice features with the acoustic and language models in the subsequent recognition process to obtain the best Identification result.
  • step S407 the acoustic and language models are saved and updated in real time into a model database 41, whereby a larger acoustic and language model database 41 can be acquired, and the degree of recognition is improved.
  • step S401 is performed on the client device, for example, by using a microphone and its processing circuit for voice input.
  • the step S403, the step S405, and the step S407 are performed in the cloud server 300.
  • the cloud server further includes multiple private cloud accounts corresponding to different users, and each private cloud master account may be separately Steps S403 to S407 are performed, and when the user enables the voice recognition function, the method can be performed by using an account authentication method.
  • the steps S401-S407 can be performed on the client device 200, and the first voice database 21 and the model database 41 are local databases.
  • the voice recognition method further includes:
  • step S409 it is determined whether the voice feature can be identified according to the acoustic and language models in the model database 41. If the voice feature can be identified, step S411 is executed to generate a recognition result of the carry control command. Otherwise, step S413 is performed, and the process cannot be performed.
  • the other recognized speech features are stored in the first speech database 21.
  • the step S409 includes the following sub-steps:
  • Sub-step S409a performing matching degree calculation on the voice feature with the acoustic and language models. If the matching degree is greater than or equal to the threshold, performing sub-step S409b, determining that the corresponding voice feature can be identified and outputting the recognition result; otherwise, Performing sub-step S409c, it is determined that the voice feature cannot be recognized.
  • Sub-step S409d manually, according to a user command, manually mapping the unrecognizable voice feature whose matching degree is lower than the threshold value with a preset standard voice, and mapping the voice feature to the standard voice data and a mapping relationship thereof
  • the update is in a second voice database 33.
  • the first voice database 21 only stores the voice features that are not recognized, so the voice recognition system 100 only needs to perform model training on the unrecognizable voice data, which can reduce data. Redundancy increases recognition speed and efficiency.
  • the method further includes:
  • Step S415 generating text in a specific format or playing a corresponding standard voice according to the recognition result, and controlling a corresponding client device according to the control command;
  • Step S417 downloading the updated acoustic and language models in the corresponding private cloud module to the local to implement voice recognition locally.
  • voice features are identified, all of the voice features may also be stored in the first voice database 21, from the first voice database in a timed, real-time or quantitative manner. 21 extracting the speech features to perform model training.
  • the speech recognition system and method of the present invention records or saves real-time call and recording information in real time, and as a sample of speech model training, thereby continuously updating the model database 41 according to different pronunciation characteristics of the user.
  • the user's individual needs can be satisfied, and a variety of voices, such as English or local dialects, can be supported, and the recognition degree is improved.
  • the present invention also provides a private cloud master module (account) for different users, for the user to enable the voice recognition function by means of account authentication, thereby improving the privacy performance of the user voice information.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Disclosed is a speech recognition system, at least comprising: a speech input module configured to input speech of a user in real time upon activation of a real-time call or speech entry function; a feature extraction module configured to extract a speech feature from the inputted speech of the user; a model training module configured to establish, according to the speech feature and a preset rule, a corresponding acoustic and language model; and an updating module configured to save and update the acoustic and language model in a model database. Also provided are a speech recognition method, a client device and a cloud server.

Description

语音识别系统及方法、客户端设备及云端服务器Speech recognition system and method, client device and cloud server 技术领域Technical field
本发明涉及语音识别领域,尤其涉及一种语音识别系统及方法和具有语音识别功能的客户端设备及云端服务器。The present invention relates to the field of voice recognition, and in particular, to a voice recognition system and method, and a client device and a cloud server with voice recognition function.
背景技术Background technique
“大词汇连续语音识别”(Large Vocabulary Continuous Speech Recognition,LVCSR,简称“语音识别”),就是由计算机根据人的连续声音信号中所蕴涵的语言信息,识别出某段语音对应的是哪些文字的过程。"Large Vocabulary Continuous Speech Recognition" (LVCSR, referred to as "speech recognition") is a computer that recognizes which text corresponds to a certain piece of speech based on the language information contained in the continuous sound signal of the person. process.
大词汇连续汉语语音识别器已经取得了很大的进展,对标准普通话,识别器的准确率可以达到95%以上。但是,汉语的方言问题是汉语语音识别面临的主要问题。由于在中国大部分人的普通话都带有一定的方言背景,在这样的情况下,大部分的语音识别器的性能都会大大下降,甚至无法使用。Great vocabulary continuous Chinese speech recognizer has made great progress. For standard Mandarin, the accuracy of the recognizer can reach more than 95%. However, the dialect problem in Chinese is the main problem facing Chinese speech recognition. Since most people in China have a certain dialect background, in most cases, the performance of most speech recognizers will be greatly reduced or even impossible to use.
当前包括苹果公司的Siri、中国的科大讯飞等设备和软件可以提供语音输入功能,但是语音识别受用户个人发音的影响,导致语音识别时准确率受到很大影响,进而影响了语音识别功能的适用。另外,大量的非智能客户端设备,在使用时其自带的语音操控功能,也由于语音输入时识别率的问题,而影响到其语音功能的适用,例如汽车中的语音操作功能、蓝牙耳机、门铃等设备的语音操控等。Currently, devices such as Apple's Siri and China's Keda Xunfei can provide voice input functions, but voice recognition is affected by the user's personal pronunciation, which leads to a great impact on the accuracy of speech recognition, which in turn affects the speech recognition function. Be applicable. In addition, a large number of non-intelligent client devices, when used, have their own voice control functions, and also affect the recognition of voice functions due to the problem of recognition rate during voice input, such as voice operation functions in cars, Bluetooth headsets. , voice control of devices such as doorbells, etc.
目前很多识别器对方言背景对语音识别器性能造成的影响是用数据库方法去消除或减弱的,就是说,当已经有一个对标准普通话进行识别的语音识别器,需要对带某种方言背景的普通话进行识别时,采用的方法为:收集大量与该方言有关的第一语音数据库,然后利用已有的声学模型训练方法去重新训练声学模型,或利用已有的说话人自适应方法对声学模型进行自适应。这种方法的缺点是:(1)收集带方言背景的数据库的工作量非常巨大,对于汉语这么多的方言,数据库的收集更是一件巨大的工程。(2)这种方法无法兼顾标准普通话和带发音 背景普通话之间的共性,仅是通过数据驱动的方法去解决问题,相当于完全重新构建一个语音识别器,给不同方言背景的语音识别器之间的资源共享和兼容带来困难。At present, the influence of many recognizer backgrounds on the performance of speech recognizers is eliminated or weakened by the database method. That is to say, when there is already a speech recognizer that recognizes standard Mandarin, it needs to have a certain dialect background. In the case of Mandarin recognition, the method is to collect a large number of first speech databases related to the dialect, and then use the existing acoustic model training method to retrain the acoustic model, or use the existing speaker adaptive method to acoustic models. Make adaptive. The disadvantages of this method are: (1) The workload of collecting the database with dialect background is very huge. For so many dialects in Chinese, the collection of the database is a huge project. (2) This method cannot balance standard Mandarin with pronunciation The commonality between the background Mandarins is only solved by the data-driven method, which is equivalent to completely reconstructing a speech recognizer, which brings difficulties in resource sharing and compatibility between speech recognizers of different dialect backgrounds.
发明内容Summary of the invention
为了解决上述技术问题,本发明提供一种语音识别系统及方法和具有语音识别功能的客户端设备及云端服务器。In order to solve the above technical problem, the present invention provides a voice recognition system and method, and a client device and a cloud server with voice recognition function.
本发明一实施例提供一种语音识别系统,至少包括:语音输入模块,用于当启用实时通话或语音录入功能时,实时输入用户的语音;特征提取模块,用于从所输入的用户语音中提取语音特征;模型训练模块,用于根据所述语音特征以及预设的规则,建立对应的声学和语言模型;以及更新模块,用于保存并更新所述声学和语言模型到一个模型数据库中。An embodiment of the present invention provides a voice recognition system, including at least: a voice input module, configured to input a user's voice in real time when a real-time call or voice input function is enabled; and a feature extraction module for inputting the user voice Extracting a voice feature; a model training module, configured to establish a corresponding acoustic and language model according to the voice feature and a preset rule; and an update module, configured to save and update the acoustic and language model into a model database.
本发明另一实施例还提供一种语音识别方法,包括:基于启用实时通话或语音录入功能实时输入用户的语音;从所输入的用户语音中提取语音特征;根据所述语音特征以及预设的规则,建立对应的声学和语言模型;以及实时保存并更新所述声学和语言模型到一个模型数据库中。Another embodiment of the present invention further provides a voice recognition method, including: inputting a user's voice in real time based on enabling a real-time call or voice input function; extracting a voice feature from the input user voice; according to the voice feature and a preset Rules, establishing corresponding acoustic and language models; and saving and updating the acoustic and language models into a model database in real time.
本发明又一实施例提供一种客户端设备,其包括上述的语音识别系统。Yet another embodiment of the present invention provides a client device including the above-described voice recognition system.
发明再一实施例提供一种云端服务器,其包括对应不同用户的多个私有云主模块。每个云主模块包括:特征提取模块,用于从来自于正在启用实时通话或语音录入功能的客户端设备所输入的用户语音中提取语音特征;模型训练模块,用于根据所述语音特征以及预设的规则,建立对应的声学和语言模型;以及更新模块,用于保存并更新所述声学和语言模型到一个模型数据库中。Yet another embodiment of the present invention provides a cloud server including a plurality of private cloud master modules corresponding to different users. Each cloud master module includes: a feature extraction module, configured to extract a voice feature from a user voice input from a client device that is enabling real-time call or voice entry function; a model training module, configured to Pre-defined rules to establish corresponding acoustic and language models; and update modules for saving and updating the acoustic and language models into a model database.
本发明的语音识别系统和方法通过实时记录或保存实时通话和录音信息,并作为语音模型训练的样本,从而能够根据用户不同的发音特点持续更新模型数据库。由此,可以满足用户的个性化需求,而且能够支持多种语音,例如英语或者地方方言等,提高了识别度。The speech recognition system and method of the present invention records or saves real-time call and recorded information in real time and serves as a sample for speech model training, thereby continuously updating the model database according to different pronunciation characteristics of the user. Thereby, the user's individual needs can be satisfied, and a variety of voices, such as English or local dialects, can be supported, and the recognition degree is improved.
附图说明DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述 中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art description will be briefly described below. Obviously, the following description The drawings in the drawings are only some of the embodiments of the present invention, and those skilled in the art can obtain other drawings according to the drawings without any inventive labor.
图1是本发明第一实施例提供的语音识别系统的系统框架图;1 is a system frame diagram of a voice recognition system according to a first embodiment of the present invention;
图2是图1的语音识别系统的功能模块图;Figure 2 is a functional block diagram of the speech recognition system of Figure 1;
图3是本发明第二实施例提供的语音识别系统的功能模块图;3 is a functional block diagram of a voice recognition system according to a second embodiment of the present invention;
图4是本发明一实施例提供的语音识别方法的流程图;4 is a flowchart of a voice recognition method according to an embodiment of the present invention;
图5是本发明另一实施例提供的语音识别方法的流程图;FIG. 5 is a flowchart of a voice recognition method according to another embodiment of the present invention; FIG.
图6是图5中的步骤S409的具体流程图;FIG. 6 is a specific flowchart of step S409 in FIG. 5;
图7是本发明又一实施例提供的语音识别方法的流程图。FIG. 7 is a flowchart of a voice recognition method according to still another embodiment of the present invention.
具体实施方式detailed description
下面结合附图和具体实施方式对本发明的技术方案作进一步更详细的描述。显然,所描述的实施例仅仅是本发明的一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动的前提下所获得的所有其他实施例,都应属于本发明保护的范围。The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings and specific embodiments. It is apparent that the described embodiments are only a part of the embodiments of the invention, and not all of them. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts shall fall within the scope of the present invention.
第一实施例First embodiment
请参阅图1,其为本发明第一实施例提供的一种语音识别系统100的系统架构图。在本实施例中,所述语音识别系统100由客户端设备200与云端服务器300共同实现,以能够通过云端服务器300完成识别前端、模型训练和识别后端的全过程,并将最终的语音识别结果下发至客户端设备200。如此,可减轻客户端设备200的数据处理量,部署起来非常方便,且后续升级的大部分工作也都在云端服务器300完成。Please refer to FIG. 1 , which is a system architecture diagram of a voice recognition system 100 according to a first embodiment of the present invention. In this embodiment, the voice recognition system 100 is implemented by the client device 200 and the cloud server 300 to complete the whole process of identifying the front end, model training, and identifying the back end through the cloud server 300, and the final voice recognition result. Delivered to the client device 200. In this way, the data processing capacity of the client device 200 can be alleviated, and the deployment is very convenient, and most of the work of the subsequent upgrade is also completed in the cloud server 300.
具体的,请参阅图2,所述语音识别系统100至少包括语音输入模块10、特征提取模块20、模型训练模块30以及更新模块40。本实施例中,所述语音输入模块10设置在客户端设备200上,例如是麦克风及其处理电路。所述特征提取模块20、模型训练模块30、更新模块40等集成在所述云端服务器300中。Specifically, referring to FIG. 2, the voice recognition system 100 includes at least a voice input module 10, a feature extraction module 20, a model training module 30, and an update module 40. In this embodiment, the voice input module 10 is disposed on the client device 200, such as a microphone and its processing circuit. The feature extraction module 20, the model training module 30, the update module 40, and the like are integrated in the cloud server 300.
所述语音输入模块10用于当客户端设备200启用实时通话或语音录入功能时,实时输入用户的语音。所述客户端设备200可以是手机、车载设备、电脑、手机、智能家居设备以及可穿戴设备等等。所述用户的语音也可进行本地保存或云端保存。 The voice input module 10 is configured to input the voice of the user in real time when the client device 200 enables the real-time call or voice input function. The client device 200 can be a mobile phone, an in-vehicle device, a computer, a mobile phone, a smart home device, a wearable device, and the like. The user's voice can also be saved locally or saved in the cloud.
所述特征提取模块20用于从所输入的用户语音中提取语音特征。本实施例中,所述特征提取模块20将提取到的语音特征实时保存在一个第一语音数据库21中,所述第一语音数据库21可以是本地数据库,也可以是云端数据库。所述语音特征指所述用户语音的特征数据。The feature extraction module 20 is configured to extract a voice feature from the input user voice. In this embodiment, the feature extraction module 20 saves the extracted voice features in a first voice database 21 in real time, and the first voice database 21 may be a local database or a cloud database. The speech feature refers to feature data of the user's voice.
所述模型训练模块30用于根据所述语音特征以及预设的规则,建立对应的声学和语言模型,以供在后续识别过程中,将提取的语音特征与所述声学和语言模型进行匹配与比较,得到最佳的识别结果。本实施例中,所述预设的规则是动态时间规整((Dynamic time warping,简称DTW)、隐形马尔可夫(Hidden Markov Model,HMM)理论、矢量量化(Vector Quantization,简称VQ)技术中的至少其中之一者。此外,本实施例中,所述模型训练模块30定时从所述第一语音数据库21中提取所述语音特征,以进行模型训练。当然,在其他实施例中,所述模型训练模块30也可实时提取第一语音数据库21中的特定语音特征,以进行实时的模型训练,或者定量(例如100条)提取所述特定语音特征,本发明不以此些实施例为限The model training module 30 is configured to establish a corresponding acoustic and language model according to the voice feature and a preset rule, so as to match the extracted voice feature with the acoustic and language models in a subsequent recognition process. Compare and get the best recognition results. In this embodiment, the preset rule is in Dynamic Time Warping (DTW), Hidden Markov Model (HMM) theory, and Vector Quantization (VQ) technology. In addition, in this embodiment, the model training module 30 periodically extracts the voice features from the first voice database 21 for model training. Of course, in other embodiments, the The model training module 30 can also extract specific speech features in the first speech database 21 in real time for real-time model training, or quantitative (eg, 100) extraction of the specific speech features, and the present invention is not limited by these embodiments.
所述更新模块40用于实时保存并更新所述声学和语言模型到一个模型数据库41中,由此,能够获取更庞大的声学和语言模型数据库41,提高了识别度。The update module 40 is configured to save and update the acoustic and language models into a model database 41 in real time, whereby a larger acoustic and language model database 41 can be acquired, which improves the degree of recognition.
此外,为了能够对用户的语音信息进行保密,且针对不同的用户语音特点提供个性化的模型训练,所述云端服务器300包括对应于不同用户的多个私有云主模块,每个私有云主模块包括所述特征提取模块20、模型训练模块30以及更新模块40等等。其中,所述特征提取模块20提取的特定语音特征保存到对应的私有云模块下。同时,所述模型训练模块30对所述特定语音特征进行声学和语言模型训练,并通过更新模块40更新所述模型。当用户启用所述语音识别系统100时,可通过帐号鉴权的方式启用所述语音识别功能。In addition, in order to be able to keep the user's voice information confidential and provide personalized model training for different user voice features, the cloud server 300 includes a plurality of private cloud master modules corresponding to different users, and each private cloud master module The feature extraction module 20, the model training module 30, the update module 40, and the like are included. The specific voice feature extracted by the feature extraction module 20 is saved under the corresponding private cloud module. At the same time, the model training module 30 performs acoustic and language model training on the specific speech features and updates the model through the update module 40. When the user activates the voice recognition system 100, the voice recognition function can be enabled by means of account authentication.
可以理解的是,在其他实施例中,所述语音识别系统100还可集成在一个客户端设备200中,例如:车载设备、电脑、手机、智能家居设备以及可穿戴设备等等中,以供用户开启离线语音识别功能。此时,所述第一语音数据库21以及模型数据库41均为本地数据库。采用此种方式,能够在无网络连接的情况下,实现上述语音识别功能。It can be understood that in other embodiments, the voice recognition system 100 can also be integrated in a client device 200, such as an in-vehicle device, a computer, a mobile phone, a smart home device, a wearable device, etc., for The user turns on offline voice recognition. At this time, the first voice database 21 and the model database 41 are both local databases. In this way, the above voice recognition function can be implemented without a network connection.
总的来说,在传统语音识别技术中,通常不会将手机实时通话或利用pad(可以是其他设备)录音过程中的语音进行记录或保存,以作为语音模型训练 的样本。而本发明通过实时记录或保存实时通话和录音信息,并作为语音模型训练的样本,从而能够根据用户不同的发音特点持续更新模型数据库41。由此,可以满足用户的个性化需求,而且能够支持多种语音,例如英语或者地方方言等,提高了识别度。此外,本发明还提供了针对不同用户的私有云主模块,供用户通过账户鉴权的方式启用语音识别功能,从而能够提高对用户语音信息的保密性能。In general, in traditional speech recognition technology, the mobile phone is usually not recorded in real time or recorded by a pad (may be other device) during recording, or saved as a speech model training. Sample. The present invention continuously records and stores the real-time call and recording information as a sample of the speech model training, so that the model database 41 can be continuously updated according to different pronunciation characteristics of the user. Thereby, the user's individual needs can be satisfied, and a variety of voices, such as English or local dialects, can be supported, and the recognition degree is improved. In addition, the present invention also provides a private cloud master module for different users, for the user to enable the voice recognition function by means of account authentication, thereby improving the privacy performance of the user voice information.
第二实施例Second embodiment
请参阅图3,本发明第二实施例提供的语音识别系统100a与第一实施例的语音识别系统100基本相同,不同之处在于:所述语音识别系统100a进一步包括识别模块50,所述识别模块50用于根据模型数据库41a中的所述声学和语言模型,判断是否能够识别所述语音特征,如果能够识别,则生成携带控制命令的识别结果,否则,将无法识别的其他语音特征存储到第一语音数据库21a中。此时,所述第一语音数据库21a仅需保存无法识别的所述语音特征,节省了占用空间。所述模型训练模块30也进一步包括一个手动标注单元31,用于根据用户命令,手动将所述匹配度低于所述阈值的无法识别的语音特征与预设的标准语音进行映射,并将所述语音特征与所述标准语音数据及其映射关系更新在一个第二语音数据库33中,供所述识别模块50采用。对应的,所述识别模块50还用于根据当前输入的用户语音数据以及所述第二语音数据库33,识别所述语音数据并输出识别结果。Referring to FIG. 3, the speech recognition system 100a according to the second embodiment of the present invention is substantially the same as the speech recognition system 100 of the first embodiment, except that the speech recognition system 100a further includes an identification module 50 for the identification. The module 50 is configured to determine whether the voice feature can be identified according to the acoustic and language models in the model database 41a, and if not, generate a recognition result carrying the control command; otherwise, store other unrecognizable voice features to In the first voice database 21a. At this time, the first voice database 21a only needs to save the voice features that are not recognized, which saves space. The model training module 30 further includes a manual labeling unit 31, configured to manually map the unrecognizable voice feature with the matching degree lower than the threshold to the preset standard voice according to a user command, and The speech features and the standard speech data and their mapping relationship are updated in a second speech database 33 for use by the recognition module 50. Correspondingly, the identification module 50 is further configured to identify the voice data and output the recognition result according to the currently input user voice data and the second voice database 33.
更具体的,所述识别模块50包括第一解码单元51以及第二解码单元52,所述第一解码单元51用于将当前提取的语音特征与所述声学和语言模型进行匹配度计算。如果匹配度大于等于阈值,则判断能够识别对应的所述语音特征并输出识别结果,否则,判断无法识别所述语音特征。所述第二解码单元52用于根据当前输入的用户语音以及所述第二语音数据库33,识别所述用户的语音,并输出对应的标准语音。More specifically, the identification module 50 includes a first decoding unit 51 and a second decoding unit 52, and the first decoding unit 51 is configured to perform a matching degree calculation on the currently extracted voice feature and the acoustic and language models. If the matching degree is greater than or equal to the threshold, it is judged that the corresponding voice feature can be identified and the recognition result is output, otherwise, the voice feature is judged not to be recognized. The second decoding unit 52 is configured to identify the voice of the user according to the currently input user voice and the second voice database 33, and output a corresponding standard voice.
本实施例中,所述手动标注单元31包括提示子单元311、选择子单元313、输入子单元315以及确认子单元317。所述提示子单元311用于周期性提示用户查看存储在第一语音数据库21中的无法识别的语音特征。所述选择子单元313用于供用户选择对应于所述无法识别的语音特征的标准语音,其中所述标准语音预先存储在所述第一语音数据库21中。例如,用户可以通过听取所述无法识 别的特定的语音,然后根据所提供的标准语音,选择与所述语音特征相匹配的标准语音。所述输入子单元315,用于供用户输入对应于所述无法识别的语音特征的标准语音。可以理解的是,可仅选择所述选择子单元313以及所述输入子单元315其中之一者进行设置,当标准语音中无对应的选项时,可通过语音输入的方式,确定对应的标准语音。所述确认子单元317用于供用户确认所述语音特征与所述标准语音之间的映射关系,并于确认完成后,将所述映射关系存储到所述第二语音数据库33中。In this embodiment, the manual labeling unit 31 includes a prompting subunit 311, a selecting subunit 313, an input subunit 315, and an confirming subunit 317. The prompting sub-unit 311 is configured to periodically prompt the user to view the unrecognizable voice features stored in the first voice database 21. The selection sub-unit 313 is configured to allow a user to select a standard voice corresponding to the unrecognizable voice feature, wherein the standard voice is pre-stored in the first voice database 21. For example, the user can listen to the unknown Other specific speech, and then standard speech matching the speech feature is selected based on the provided standard speech. The input subunit 315 is configured to allow a user to input a standard voice corresponding to the unrecognizable voice feature. It can be understood that only one of the selection sub-unit 313 and the input sub-unit 315 can be selected for setting. When there is no corresponding option in the standard speech, the corresponding standard speech can be determined by means of voice input. . The confirmation subunit 317 is configured to allow a user to confirm a mapping relationship between the voice feature and the standard voice, and store the mapping relationship in the second voice database 33 after the confirmation is completed.
在第二实施例中,所述特征提取模块20、模型训练模块30、更新模块40以及识别模块50等集成在所述云端服务器300a中,所述识别模块50分别识别不同云模块下的语音数据。In the second embodiment, the feature extraction module 20, the model training module 30, the update module 40, and the identification module 50 are integrated in the cloud server 300a, and the identification module 50 respectively identifies voice data under different cloud modules. .
第二实施例提供的语音识别系统100a仅对无法识别的语音数据进行再次模型训练,能够减少数据冗余度,提高了识别速度和效率。The speech recognition system 100a provided by the second embodiment only performs remodeling training on unrecognizable speech data, which can reduce data redundancy and improve recognition speed and efficiency.
此外,所述语音识别系统100a(或100)可进一步包括执行模块60,用于根据所述识别结果,生成特定格式的文本或播放对应的标准语音,并根据所述控制命令控制对应的客户端设备。而为了能够在不同的客户端设备200中运行所述语音识别系统100a,所述语音识别系统100a还可进一步包括下载模块70,用于供用户将对应私有云模块中的更新后的声学和语言模型下载到本地,以在本地实现语音识别。In addition, the voice recognition system 100a (or 100) may further include an execution module 60, configured to generate a text in a specific format or play a corresponding standard voice according to the recognition result, and control a corresponding client according to the control command. device. In order to be able to run the speech recognition system 100a in different client devices 200, the speech recognition system 100a may further comprise a download module 70 for the user to update the acoustics and language in the corresponding private cloud module. The model is downloaded locally to implement speech recognition locally.
可以理解是,在其他实施例中,所述识别模块50对所述语音特征进行识别的同时,还可将全部所述语音特征存储在所述第一语音数据库21中,以供模型训练模块30定时从所述第一语音数据库21提取所述语音特征,从而进行模型训练。It can be understood that, in other embodiments, the identification module 50 may also store all the voice features in the first voice database 21 for the model training module 30 while identifying the voice features. The speech feature is extracted from the first speech database 21 at a timing to perform model training.
请参阅图4,本发明的一个实施例提供一种语音识别方法,所述方法包括以下步骤:Referring to FIG. 4, an embodiment of the present invention provides a voice recognition method, where the method includes the following steps:
步骤S401,基于启用实时通话或语音录入功能,实时输入用户的语音。具体的,所述实时通话或语音录入功能通过手机、车载设备、电脑、手机、智能家居设备以及可穿戴设备等等实现。同时,所述用户的语音也可进行实时保存,供后续调用。Step S401, inputting the voice of the user in real time based on enabling the real-time call or voice input function. Specifically, the real-time call or voice input function is implemented by using a mobile phone, an in-vehicle device, a computer, a mobile phone, a smart home device, a wearable device, and the like. At the same time, the user's voice can also be saved in real time for subsequent calls.
步骤S403,从所输入的用户语音中提取语音特征。本实施例中,提取到的语音特征被实时保存在一个第一语音数据库21中。其中,所述第一语音数据库 21可以是本地数据库,也可以是云端数据库,所述语音特征指所述用户语音的特征数据。Step S403, extracting a voice feature from the input user voice. In this embodiment, the extracted voice features are saved in a first voice database 21 in real time. Wherein the first voice database 21 may be a local database or a cloud database, and the voice feature refers to feature data of the user voice.
步骤S405,根据所述语音特征以及预设的规则,建立对应的声学和语言模型,以供在后续识别过程中,将提取的语音特征与所述声学和语言模型进行匹配与比较,得到最佳的识别结果。Step S405, establishing corresponding acoustic and language models according to the voice features and preset rules, for matching and comparing the extracted voice features with the acoustic and language models in the subsequent recognition process to obtain the best Identification result.
步骤S407,实时保存并更新所述声学和语言模型到一个模型数据库41中,由此,能够获取更庞大的声学和语言模型数据库41,提高了识别度。In step S407, the acoustic and language models are saved and updated in real time into a model database 41, whereby a larger acoustic and language model database 41 can be acquired, and the degree of recognition is improved.
在本实施例中,步骤S401在客户端设备上执行,例如是通过麦克风及其处理电路进行语音输入。所述步骤S403、步骤S405、步骤S407在云端服务器300中执行。而为了能够对用户的语音信息进行保密,且针对不同的用户语音特点提供个性化的模型训练,所述云端服务器还包括对应于不同用户的多个私有云账户,每个私有云主账户可分别执行所述步骤S403~S407,当用户启用所述语音识别功能时,可通过帐号鉴权的方式进行。In this embodiment, step S401 is performed on the client device, for example, by using a microphone and its processing circuit for voice input. The step S403, the step S405, and the step S407 are performed in the cloud server 300. In order to be able to keep the user's voice information confidential and provide personalized model training for different user voice features, the cloud server further includes multiple private cloud accounts corresponding to different users, and each private cloud master account may be separately Steps S403 to S407 are performed, and when the user enables the voice recognition function, the method can be performed by using an account authentication method.
可以理解的是,在其他实施例中,所述步骤S401~S407均可在客户端设备200上执行,且所述第一语音数据库21和模型数据库41为本地数据库。It can be understood that, in other embodiments, the steps S401-S407 can be performed on the client device 200, and the first voice database 21 and the model database 41 are local databases.
请参阅图5,在又一实施例中,除了上述步骤S401~S407,所述语音识别方法进一步包括:Referring to FIG. 5, in another embodiment, in addition to the foregoing steps S401-S407, the voice recognition method further includes:
步骤S409,根据模型数据库41中的所述声学和语言模型,判断是否能够识别所述语音特征,如果能够识别,则执行步骤S411,生成携带控制命令的识别结果,否则,执行步骤S413,将无法识别的其他语音特征存储到所述第一语音数据库21中。In step S409, it is determined whether the voice feature can be identified according to the acoustic and language models in the model database 41. If the voice feature can be identified, step S411 is executed to generate a recognition result of the carry control command. Otherwise, step S413 is performed, and the process cannot be performed. The other recognized speech features are stored in the first speech database 21.
具体的,请参阅图6,所述步骤S409包括以下子步骤:Specifically, referring to FIG. 6, the step S409 includes the following sub-steps:
子步骤S409a,将所述语音特征与所述声学和语言模型进行匹配度计算,如果匹配度大于等于阈值,则执行子步骤S409b,判断能够识别对应的所述语音特征并输出识别结果,否则,执行子步骤S409c,判断无法识别所述语音特征。Sub-step S409a, performing matching degree calculation on the voice feature with the acoustic and language models. If the matching degree is greater than or equal to the threshold, performing sub-step S409b, determining that the corresponding voice feature can be identified and outputting the recognition result; otherwise, Performing sub-step S409c, it is determined that the voice feature cannot be recognized.
子步骤S409d,根据用户命令,手动将所述匹配度低于所述阈值的无法识别的语音特征与预设的标准语音进行映射,并将所述语音特征与所述标准语音数据及其映射关系更新在一个第二语音数据库33中。Sub-step S409d, manually, according to a user command, manually mapping the unrecognizable voice feature whose matching degree is lower than the threshold value with a preset standard voice, and mapping the voice feature to the standard voice data and a mapping relationship thereof The update is in a second voice database 33.
此时,所述第一语音数据库21仅保存无法识别的所述语音特征,因此语音识别系统100仅需要对无法识别的语音数据进行再次模型训练,能够减少数据 冗余度,提高了识别速度和效率。At this time, the first voice database 21 only stores the voice features that are not recognized, so the voice recognition system 100 only needs to perform model training on the unrecognizable voice data, which can reduce data. Redundancy increases recognition speed and efficiency.
请参阅图7,在又一实施例中,结合步骤S401~S413,所述方法进一步包括:Referring to FIG. 7, in another embodiment, in combination with steps S401-S413, the method further includes:
步骤S415,根据所述识别结果,生成特定格式的文本或播放对应的标准语音,并根据所述控制命令控制对应的客户端设备;Step S415, generating text in a specific format or playing a corresponding standard voice according to the recognition result, and controlling a corresponding client device according to the control command;
步骤S417,将对应私有云模块中的更新后的声学和语言模型下载到本地,以在本地实现语音识别。Step S417, downloading the updated acoustic and language models in the corresponding private cloud module to the local to implement voice recognition locally.
又,在其他实施例中,对所述语音特征进行识别的同时,还可将全部所述语音特征存储在所述第一语音数据库21中,以定时、实时或定量从所述第一语音数据库21提取所述语音特征,从而进行模型训练。Moreover, in other embodiments, while the voice features are identified, all of the voice features may also be stored in the first voice database 21, from the first voice database in a timed, real-time or quantitative manner. 21 extracting the speech features to perform model training.
本发明的语音识别系统和方法通过实时记录或保存实时通话和录音信息,并作为语音模型训练的样本,从而能够根据用户不同的发音特点持续更新模型数据库41。由此,可以满足用户的个性化需求,而且能够支持多种语音,例如英语或者地方方言等,提高了识别度。此外,本发明还提供了针对不同用户的私有云主模块(账户),供用户通过账户鉴权的方式启用语音识别功能,从而能够提高对用户语音信息的保密性能。The speech recognition system and method of the present invention records or saves real-time call and recording information in real time, and as a sample of speech model training, thereby continuously updating the model database 41 according to different pronunciation characteristics of the user. Thereby, the user's individual needs can be satisfied, and a variety of voices, such as English or local dialects, can be supported, and the recognition degree is improved. In addition, the present invention also provides a private cloud master module (account) for different users, for the user to enable the voice recognition function by means of account authentication, thereby improving the privacy performance of the user voice information.
需要说明的是,通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到本发明可借助软件加必需的硬件平台的方式来实现,当然也可以全部通过硬件来实施。基于这样的理解,本发明的技术方案对背景技术做出贡献的全部或者部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例或者实施例的某些部分所述的方法。It should be noted that, through the description of the above embodiments, those skilled in the art can clearly understand that the present invention can be implemented by means of software plus a necessary hardware platform, and of course, all can be implemented by hardware. Based on such understanding, all or part of the technical solution of the present invention contributing to the background art may be embodied in the form of a software product, which may be stored in a storage medium such as a ROM/RAM, a magnetic disk, an optical disk, or the like. A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in various embodiments of the present invention or in some portions of the embodiments.
以上所揭露的仅为本发明实施例中的较佳实施例而已,当然不能以此来限定本发明之权利范围,因此依本发明权利要求所作的等同变化,仍属本发明所涵盖的范围。 The above are only the preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, and the equivalent changes made by the claims of the present invention are still within the scope of the present invention.

Claims (13)

  1. 一种语音识别系统,其特征在于,所述系统至少包括:A speech recognition system, characterized in that the system comprises at least:
    语音输入模块,用于当启用实时通话或语音录入功能时,实时输入用户的语音;a voice input module for inputting a user's voice in real time when real-time call or voice input function is enabled;
    特征提取模块,用于从所输入的用户语音中提取语音特征;a feature extraction module, configured to extract a voice feature from the input user voice;
    模型训练模块,用于根据所述语音特征以及预设的规则,建立对应的声学和语言模型;以及a model training module, configured to establish a corresponding acoustic and language model according to the voice feature and a preset rule;
    更新模块,用于保存并更新所述声学和语言模型到一个模型数据库中。An update module for saving and updating the acoustic and language models into a model database.
  2. 如权利要求1所述的语音识别系统,其特征在于,所述特征提取模块将提取到的语音特征实时保存在一个第一语音数据库中,所述模型训练模块定时或定量从所述第一语音数据库中提取所述语音特征以进行模型训练。The speech recognition system according to claim 1, wherein the feature extraction module saves the extracted speech features in a first speech database in real time, and the model training module periodically or quantitatively derives from the first speech. The speech features are extracted from the database for model training.
  3. 如权利要求2所述的语音识别系统,其特征在于,所述特征提取模块、模型训练模块以及更新模块集成在一个云端服务器中,所述云端服务器包括对应不同用户的多个私有云模块,所述特征提取模块提取的特定语音特征保存到对应的私有云模块下,并通过所述模型训练模块和更新模块建立模型和更新,所述识别模块则分别识别不同云模块下的语音数据。The speech recognition system according to claim 2, wherein the feature extraction module, the model training module, and the update module are integrated in a cloud server, and the cloud server includes a plurality of private cloud modules corresponding to different users. The specific voice feature extracted by the feature extraction module is saved to the corresponding private cloud module, and the model and update are established by the model training module and the update module, and the recognition module respectively identifies the voice data under different cloud modules.
  4. 如权利要求1所述的语音识别系统,进一步包括:The speech recognition system of claim 1 further comprising:
    识别模块,用于根据模型数据库中的所述声学和语言模型,判断是否能够识别所述语音特征,如果能够识别,则生成携带控制命令的识别结果,否则,将无法识别的其他语音特征存储到一个第一语音数据库中,以供所述模型训练模块重新进行模型训练。An identification module, configured to determine, according to the acoustic and language models in the model database, whether the voice feature can be identified, and if not, generate a recognition result carrying a control command; otherwise, store other unrecognizable voice features to In a first voice database, the model training module is re-trained.
  5. 如权利要求4所述的语音识别系统,其特征在于,至少包括:A speech recognition system according to claim 4, comprising at least:
    第一解码单元,用于将所述语音特征与所述声学和语言模型进行匹配度计算,如果匹配度大于等于阈值,则判断能够识别对应的所述语音特征并输出识别结果,否则,判断无法识别所述语音特征;以及a first decoding unit, configured to perform a matching degree calculation on the voice feature and the acoustic and language model, and if the matching degree is greater than or equal to the threshold, determine that the corresponding voice feature can be identified and output the recognition result; otherwise, the determination cannot be Identifying the speech feature;
    所述模型训练模块进一步包括一手动标注单元,用于根据用户命令,手动将所述匹配度低于所述阈值的无法识别的语音特征与预设的标准语音进行映射匹配,并将所述语音特征与所述标准语音数据及其映射关系保存在一个第二语音数据库中。The model training module further includes a manual labeling unit, configured to manually map the unrecognizable voice features with the matching degree lower than the threshold to the preset standard voice according to a user command, and match the voice The feature and the standard voice data and their mapping relationship are stored in a second voice database.
  6. 如权利要求5所述的语音识别系统,其特征在于,所述手动标注单元包 括:A speech recognition system according to claim 5, wherein said manual labeling unit package include:
    提示子单元,用于周期性提示用户查看存储在第一语音数据库中的无法识别的语音特征;a prompting subunit, configured to periodically prompt the user to view unrecognized voice features stored in the first voice database;
    选择子单元,用于供用户选择对应于所述无法识别的语音特征的标准语音,其中所述标准语音预先存储在所述第一语音数据库中;和/或Selecting a subunit for the user to select a standard voice corresponding to the unrecognizable voice feature, wherein the standard voice is pre-stored in the first voice database; and/or
    输入子单元,用于供用户输入对应于所述无法识别的语音特征的标准语音;以及An input subunit for the user to input a standard voice corresponding to the unrecognizable voice feature;
    确认子单元,用于供用户确认所述无法识别的语音特征与所述标准语音之间的映射关系,并存储到所述第二语音数据库。The confirmation subunit is configured to allow a user to confirm a mapping relationship between the unrecognizable voice feature and the standard voice, and store the data in the second voice database.
  7. 如权利要求5所述的语音识别系统,其特征在于,所述识别模块还包括第二解码单元,用于根据当前输入的用户语音以及所述第二语音数据库,识别所述用户的语音,并输出对应的标准语音。The speech recognition system according to claim 5, wherein the identification module further comprises a second decoding unit, configured to identify the voice of the user according to the currently input user voice and the second voice database, and Output the corresponding standard voice.
  8. 如权利要求4所述的语音识别系统,其特征在于,所述识别模块对所述语音特征进行识别的同时,将所述语音特征存储在所述第一语音数据库中,以供模型训练模块从所述第一语音数据库提取所述语音特征,从而进行模型训练。The speech recognition system according to claim 4, wherein said identification module stores said speech feature in said first speech database while said speech feature is being recognized, for the model training module to The first speech database extracts the speech features to perform model training.
  9. 如权利要求4所述的语音识别系统,其特征在于,通过一个云端服务器的各个私有云模块分别实现所述特征提取模块、模型训练模块、更新模块以及识别模块的功能,其中每一个私有云模块对应一个用户,所述特征提取模块提取的特定语音特征保存到对应的私有云模块下The speech recognition system according to claim 4, wherein the functions of the feature extraction module, the model training module, the update module, and the identification module are respectively implemented by each private cloud module of a cloud server, wherein each private cloud module Corresponding to a user, the specific voice feature extracted by the feature extraction module is saved under the corresponding private cloud module.
  10. 如权利要求1所述的语音识别系统,进一步包括:The speech recognition system of claim 1 further comprising:
    下载模块,用于供用户将对应私有云模块中的声学和语言模型下载到本地,以在本地实现语音识别。The download module is configured for the user to download the acoustic and language models in the corresponding private cloud module to the local to implement voice recognition locally.
  11. 一种语音识别方法,包括:A speech recognition method comprising:
    基于启用实时通话或语音录入功能实时输入用户的语音;Enter the user's voice in real time based on enabling real-time call or voice entry;
    从所输入的用户语音中提取语音特征;Extracting voice features from the input user voice;
    根据所述语音特征以及预设的规则,建立对应的声学和语言模型;以及Corresponding acoustic and language models are established according to the voice features and preset rules;
    实时保存并更新所述声学和语言模型到一个模型数据库中。The acoustic and language models are saved and updated in real time into a model database.
  12. 一种客户端设备,其包括如权利要求1~9项任一项所述的语音识别系统。A client device comprising the speech recognition system according to any one of claims 1 to 9.
  13. 一种云端服务器,其包括对应不同用户的多个私有云主模块,每个云 主模块包括:A cloud server including a plurality of private cloud main modules corresponding to different users, each cloud The main module includes:
    特征提取模块,用于从来自于正在启用实时通话或语音录入功能的客户端设备所输入的用户语音中提取语音特征;a feature extraction module, configured to extract a voice feature from a user voice input from a client device that is enabling real-time call or voice input function;
    模型训练模块,用于根据所述语音特征以及预设的规则,建立对应的声学和语言模型;以及a model training module, configured to establish a corresponding acoustic and language model according to the voice feature and a preset rule;
    更新模块,用于保存并更新所述声学和语言模型到一个模型数据库中。 An update module for saving and updating the acoustic and language models into a model database.
PCT/CN2015/091042 2015-09-29 2015-09-29 Speech recognition system and method, client device and cloud server WO2017054122A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2015/091042 WO2017054122A1 (en) 2015-09-29 2015-09-29 Speech recognition system and method, client device and cloud server
CN201580031165.8A CN106537493A (en) 2015-09-29 2015-09-29 Speech recognition system and method, client device and cloud server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/091042 WO2017054122A1 (en) 2015-09-29 2015-09-29 Speech recognition system and method, client device and cloud server

Publications (1)

Publication Number Publication Date
WO2017054122A1 true WO2017054122A1 (en) 2017-04-06

Family

ID=58358136

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/091042 WO2017054122A1 (en) 2015-09-29 2015-09-29 Speech recognition system and method, client device and cloud server

Country Status (2)

Country Link
CN (1) CN106537493A (en)
WO (1) WO2017054122A1 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107871506A (en) * 2017-11-15 2018-04-03 北京云知声信息技术有限公司 The awakening method and device of speech identifying function
CN108917283A (en) * 2018-07-12 2018-11-30 四川虹美智能科技有限公司 A kind of intelligent refrigerator control method, system, intelligent refrigerator and cloud server
US10388272B1 (en) 2018-12-04 2019-08-20 Sorenson Ip Holdings, Llc Training speech recognition systems using word sequences
US10573312B1 (en) 2018-12-04 2020-02-25 Sorenson Ip Holdings, Llc Transcription generation from multiple speech recognition systems
US11017778B1 (en) 2018-12-04 2021-05-25 Sorenson Ip Holdings, Llc Switching between speech recognition systems
CN112908296A (en) * 2021-02-18 2021-06-04 上海工程技术大学 Dialect identification method
US11170761B2 (en) 2018-12-04 2021-11-09 Sorenson Ip Holdings, Llc Training of speech recognition systems
US20220028384A1 (en) * 2018-12-11 2022-01-27 Qingdao Haier Washing Machine Co., Ltd. Voice control method, cloud server and terminal device
CN114596845A (en) * 2022-04-13 2022-06-07 马上消费金融股份有限公司 Training method of voice recognition model, voice recognition method and device
US11488604B2 (en) 2020-08-19 2022-11-01 Sorenson Ip Holdings, Llc Transcription of audio

Families Citing this family (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108008843A (en) * 2017-03-25 2018-05-08 深圳雷柏科技股份有限公司 A kind of wireless speech mouse and voice operating system
CN108806691B (en) * 2017-05-04 2020-10-16 有爱科技(深圳)有限公司 Voice recognition method and system
CN106991961A (en) * 2017-06-08 2017-07-28 无锡职业技术学院 A kind of artificial intelligence LED dot matrix display screens control device and its control method
CN107146617A (en) * 2017-06-15 2017-09-08 成都启英泰伦科技有限公司 A kind of novel voice identification equipment and method
CN109102801A (en) * 2017-06-20 2018-12-28 京东方科技集团股份有限公司 Audio recognition method and speech recognition equipment
CN107180629B (en) * 2017-06-28 2020-04-28 长春煌道吉科技发展有限公司 Voice acquisition and recognition method and system
CN107342076B (en) * 2017-07-11 2020-09-22 华南理工大学 Intelligent home control system and method compatible with abnormal voice
CN107731231B (en) * 2017-09-15 2020-08-14 瑞芯微电子股份有限公司 Method for supporting multi-cloud-end voice service and storage device
CN108717851B (en) * 2018-03-28 2021-04-06 深圳市三诺数字科技有限公司 Voice recognition method and device
CN108520751A (en) * 2018-03-30 2018-09-11 四川斐讯信息技术有限公司 A kind of speech-sound intelligent identification equipment and speech-sound intelligent recognition methods
CN108597500A (en) * 2018-03-30 2018-09-28 四川斐讯信息技术有限公司 A kind of intelligent wearable device and the audio recognition method based on intelligent wearable device
CN108682416B (en) * 2018-04-11 2021-01-01 深圳市卓翼科技股份有限公司 Local adaptive speech training method and system
CN108766441B (en) * 2018-05-29 2020-11-10 广东声将军科技有限公司 Voice control method and device based on offline voiceprint recognition and voice recognition
CN110609880A (en) * 2018-06-15 2019-12-24 北京搜狗科技发展有限公司 Information query method and device and electronic equipment
CN109036387A (en) * 2018-07-16 2018-12-18 中央民族大学 Video speech recognition methods and system
CN108877410A (en) * 2018-08-07 2018-11-23 深圳市漫牛医疗有限公司 A kind of deaf-mute's sign language exchange method and deaf-mute's sign language interactive device
CN109065076B (en) * 2018-09-05 2020-11-27 深圳追一科技有限公司 Audio label setting method, device, equipment and storage medium
CN108986792B (en) * 2018-09-11 2021-02-12 苏州思必驰信息科技有限公司 Training and scheduling method and system for voice recognition model of voice conversation platform
CN109493650A (en) * 2018-12-05 2019-03-19 安徽智训机器人技术有限公司 A kind of language teaching system and method based on artificial intelligence
CN110033765A (en) * 2019-04-11 2019-07-19 中国联合网络通信集团有限公司 A kind of method and terminal of speech recognition
CN110047467B (en) * 2019-05-08 2021-09-03 广州小鹏汽车科技有限公司 Voice recognition method, device, storage medium and control terminal
CN110211609A (en) * 2019-06-03 2019-09-06 四川长虹电器股份有限公司 A method of promoting speech recognition accuracy
CN110415678A (en) * 2019-06-13 2019-11-05 百度时代网络技术(北京)有限公司 Customized voice broadcast client, server, system and method
CN110517664B (en) * 2019-09-10 2022-08-05 科大讯飞股份有限公司 Multi-party identification method, device, equipment and readable storage medium
CN113066482A (en) * 2019-12-13 2021-07-02 阿里巴巴集团控股有限公司 Voice model updating method, voice data processing method, voice model updating device, voice data processing device and storage medium
CN111292746A (en) * 2020-02-07 2020-06-16 普强时代(珠海横琴)信息技术有限公司 Voice input conversion system based on human-computer interaction
CN113938556B (en) * 2020-07-14 2023-03-10 华为技术有限公司 Incoming call prompting method and device and electronic equipment
CN112002326A (en) * 2020-10-28 2020-11-27 深圳市一恒科电子科技有限公司 Interaction method and robot equipment
CN112634867B (en) * 2020-12-11 2024-10-15 平安科技(深圳)有限公司 Model training method, dialect recognition method, device, server and storage medium
CN113593525B (en) * 2021-01-26 2024-08-06 腾讯科技(深圳)有限公司 Accent classification model training and accent classification method, apparatus and storage medium
CN116030790A (en) * 2021-10-22 2023-04-28 华为技术有限公司 Distributed voice control method and electronic equipment
CN113707135B (en) * 2021-10-27 2021-12-31 成都启英泰伦科技有限公司 Acoustic model training method for high-precision continuous speech recognition
CN116597827A (en) * 2023-05-23 2023-08-15 苏州科帕特信息科技有限公司 Target language model determining method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079885A (en) * 2007-06-26 2007-11-28 中兴通讯股份有限公司 A system and method for providing automatic voice identification integrated development platform
US20100312555A1 (en) * 2009-06-09 2010-12-09 Microsoft Corporation Local and remote aggregation of feedback data for speech recognition
CN102543073A (en) * 2010-12-10 2012-07-04 上海上大海润信息系统有限公司 Shanghai dialect phonetic recognition information processing method
CN104239456A (en) * 2014-09-02 2014-12-24 百度在线网络技术(北京)有限公司 User characteristic data extraction method and user characteristic data extraction device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5320064B2 (en) * 2005-08-09 2013-10-23 モバイル・ヴォイス・コントロール・エルエルシー Voice-controlled wireless communication device / system
CN101075433A (en) * 2007-04-18 2007-11-21 上海山思智能科技有限公司 Artificial intelligent controlling method for discriminating robot speech

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079885A (en) * 2007-06-26 2007-11-28 中兴通讯股份有限公司 A system and method for providing automatic voice identification integrated development platform
US20100312555A1 (en) * 2009-06-09 2010-12-09 Microsoft Corporation Local and remote aggregation of feedback data for speech recognition
CN102543073A (en) * 2010-12-10 2012-07-04 上海上大海润信息系统有限公司 Shanghai dialect phonetic recognition information processing method
CN104239456A (en) * 2014-09-02 2014-12-24 百度在线网络技术(北京)有限公司 User characteristic data extraction method and user characteristic data extraction device

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107871506A (en) * 2017-11-15 2018-04-03 北京云知声信息技术有限公司 The awakening method and device of speech identifying function
CN108917283A (en) * 2018-07-12 2018-11-30 四川虹美智能科技有限公司 A kind of intelligent refrigerator control method, system, intelligent refrigerator and cloud server
US11145312B2 (en) 2018-12-04 2021-10-12 Sorenson Ip Holdings, Llc Switching between speech recognition systems
US11170761B2 (en) 2018-12-04 2021-11-09 Sorenson Ip Holdings, Llc Training of speech recognition systems
US10672383B1 (en) 2018-12-04 2020-06-02 Sorenson Ip Holdings, Llc Training speech recognition systems using word sequences
US10971153B2 (en) 2018-12-04 2021-04-06 Sorenson Ip Holdings, Llc Transcription generation from multiple speech recognition systems
US11017778B1 (en) 2018-12-04 2021-05-25 Sorenson Ip Holdings, Llc Switching between speech recognition systems
US11935540B2 (en) 2018-12-04 2024-03-19 Sorenson Ip Holdings, Llc Switching between speech recognition systems
US10388272B1 (en) 2018-12-04 2019-08-20 Sorenson Ip Holdings, Llc Training speech recognition systems using word sequences
US10573312B1 (en) 2018-12-04 2020-02-25 Sorenson Ip Holdings, Llc Transcription generation from multiple speech recognition systems
US11594221B2 (en) 2018-12-04 2023-02-28 Sorenson Ip Holdings, Llc Transcription generation from multiple speech recognition systems
US20220028384A1 (en) * 2018-12-11 2022-01-27 Qingdao Haier Washing Machine Co., Ltd. Voice control method, cloud server and terminal device
US11967320B2 (en) * 2018-12-11 2024-04-23 Qingdao Haier Washing Machine Co., Ltd. Processing voice information with a terminal device and a cloud server to control an operation
US11488604B2 (en) 2020-08-19 2022-11-01 Sorenson Ip Holdings, Llc Transcription of audio
CN112908296A (en) * 2021-02-18 2021-06-04 上海工程技术大学 Dialect identification method
CN114596845A (en) * 2022-04-13 2022-06-07 马上消费金融股份有限公司 Training method of voice recognition model, voice recognition method and device

Also Published As

Publication number Publication date
CN106537493A (en) 2017-03-22

Similar Documents

Publication Publication Date Title
WO2017054122A1 (en) Speech recognition system and method, client device and cloud server
KR102360924B1 (en) speech classifier
AU2016216737B2 (en) Voice Authentication and Speech Recognition System
US10074363B2 (en) Method and apparatus for keyword speech recognition
US9454958B2 (en) Exploiting heterogeneous data in deep neural network-based speech recognition systems
US10629186B1 (en) Domain and intent name feature identification and processing
US20160372116A1 (en) Voice authentication and speech recognition system and method
US9443527B1 (en) Speech recognition capability generation and control
US20190074003A1 (en) Methods and Systems for Voice-Based Programming of a Voice-Controlled Device
CN107657017A (en) Method and apparatus for providing voice service
WO2016165590A1 (en) Speech translation method and device
CN111341325A (en) Voiceprint recognition method and device, storage medium and electronic device
TW201907388A (en) Robust language identification method and system
CN109545197B (en) Voice instruction identification method and device and intelligent terminal
KR102443087B1 (en) Electronic device and voice recognition method thereof
US11676572B2 (en) Instantaneous learning in text-to-speech during dialog
JP7568851B2 (en) Filtering other speakers' voices from calls and audio messages
CN106653002A (en) Literal live broadcasting method and platform
JP7526846B2 (en) voice recognition
KR102312993B1 (en) Method and apparatus for implementing interactive message using artificial neural network
US10866948B2 (en) Address book management apparatus using speech recognition, vehicle, system and method thereof
JP2024507603A (en) Audio data processing methods, devices, electronic devices, media and program products
CN109887490A (en) The method and apparatus of voice for identification
CN113990288B (en) Method for automatically generating and deploying voice synthesis model by voice customer service
JP4440502B2 (en) Speaker authentication system and method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15905034

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15905034

Country of ref document: EP

Kind code of ref document: A1