WO2016101577A1 - Voice recognition method, client and terminal device - Google Patents

Voice recognition method, client and terminal device Download PDF

Info

Publication number
WO2016101577A1
WO2016101577A1 PCT/CN2015/082972 CN2015082972W WO2016101577A1 WO 2016101577 A1 WO2016101577 A1 WO 2016101577A1 CN 2015082972 W CN2015082972 W CN 2015082972W WO 2016101577 A1 WO2016101577 A1 WO 2016101577A1
Authority
WO
WIPO (PCT)
Prior art keywords
scene
voice recognition
keyword information
corrected
polyphonic
Prior art date
Application number
PCT/CN2015/082972
Other languages
French (fr)
Chinese (zh)
Inventor
谢志华
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2016101577A1 publication Critical patent/WO2016101577A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • This paper relates to the field of communication technologies, and in particular, to a method, a client and a terminal device for voice recognition.
  • each voice recognition software As voice becomes one of the well-known interaction methods, each voice recognition software is emerging in the market, and the quality of voice recognition software is also uneven.
  • One of the standards for measuring the quality of voice recognition software is the voice recognition rate.
  • each speech recognition engine provider provides the function of natural semantic understanding, each engine provider provides different capabilities, and there is no way to fully understand the semantics of different people and different scenarios. Therefore, how to correctly identify the user's semantics according to the current terminal scene, improve the accuracy of speech recognition, and finally achieve the best voice user experience is very meaningful and valuable.
  • the embodiments of the present invention provide a method, a client, and a terminal device for voice recognition, to solve the technical problem of how to make the voice result better meet the current user's expectation and improve the user's voice interaction experience.
  • a method for voice recognition is provided, which is applied to a terminal device side, and the method includes: acquiring an original voice recognition result of a voice input by a user, and parsing according to the original voice recognition result.
  • a voice recognition scene of the user a scene from which the keyword information to be corrected and the polyphonic word in each of the keyword information are acquired from the original speech recognition result; the keyword information corrected according to the need and the polyphonic word in each keyword information Generating one or more junk words including the polyphonic words; acquiring the speech recognition scene or the need in the terminal device according to the range of the speech recognition scene or the keyword information to be corrected Correcting the actual information corresponding to the keyword information, matching the junk word with the actual information, filtering out the correct polyphonic word, and filling the correct polyphonic word into the keyword information to be corrected, Obtaining the correct keyword; generating a final speech recognition result conforming to the current speech recognition scene according to the correct keyword.
  • the parsing the voice recognition scene of the user according to the original voice recognition result comprises: matching, according to a preset voice recognition result and a scene correspondence table, the matching corresponding to the original voice recognition result The user's speech recognition scene.
  • the acquiring the keyword information to be corrected and the polyphonic word in each keyword information from the original voice recognition result according to the voice recognition scene including: identifying a scene according to the voice recognition and presetting a scene key information extraction table, obtaining keyword information to be corrected from the original voice recognition result; determining whether there is a polyphonic word in the keyword information to be corrected, and if so, acquiring each keyword information Multi-tone words.
  • the generating, according to the keyword information that needs to be corrected and the polyphonic word in each keyword information, one or more junk words including polyphonic words including: converting the polyphonic words into corresponding pinyin, And then extracting one or more Chinese characters corresponding to the pinyin according to the homophone correspondence table; filling the Chinese characters into the keyword information to be corrected to replace the polyphonic words to form one or more garbage containing polyphonic words word.
  • the format of the speech recognition result and the scene correspondence table and the scene key information extraction table is an XML extensible markup language.
  • a voice recognition client is further provided, which is applied to a terminal device side, where the client includes: a scene resolution module, configured to obtain an original voice recognition result of a voice input by a user, And parsing the voice recognition scene of the user according to the original voice recognition result; the multi-tone word extraction module is configured to: obtain, according to the voice recognition scene, keyword information and each of the original voice recognition results that need to be corrected a polyphonic word in the keyword information; a junk word generating module, configured to be based on the keyword information to be corrected and each key The polyphonic word in the word information generates one or more junk words including the polyphonic word; the multi-tone word correction module is configured to acquire the terminal according to the range of the speech recognition scene or the keyword information to be corrected The actual information corresponding to the voice recognition scene or the keyword information that needs to be corrected in the device, the garbage word is matched with the actual information, and the correct multi-phonetic word is filtered out, and the correct multi-phonetic word is Filling
  • the scene resolution module is configured to match the scene recognition table according to the preset voice recognition result, and obtain a voice recognition scene of the user corresponding to the original voice recognition result.
  • the multi-tone word extraction module is configured to: obtain, according to the voice recognition scene and a preset scene key information extraction table, the keyword information that needs to be corrected from the original voice recognition result; determine the need to correct Whether there is a polyphonic word in the keyword information, and if so, the polyphonic word in each keyword information is obtained.
  • the polyphonic word correction module is configured to convert the multi-phonetic word into a corresponding pinyin, and then extract one or more Chinese characters corresponding to the pinyin according to the homophone correspondence table;
  • the polyphonic words are replaced in the keyword information to be corrected to form one or more junk words containing polyphonic words.
  • the format of the speech recognition result and the scene correspondence table and the scene key information extraction table is an XML extensible markup language.
  • a terminal device comprising a voice recognition client as described above.
  • the recognition result fed back by the engine provider is further optimized with the current scene to optimize the recognition rate, so that the recognition result is more in line with the current user's expectation, and the user's voice communication is improved. Experience each other.
  • FIG. 1 is a flowchart of a method for voice recognition on a terminal device side according to an embodiment of the present invention
  • FIG. 2 is a second flowchart of a method for voice recognition on a terminal device side according to an embodiment of the present invention.
  • FIG. 3 is a structural block diagram of a terminal device for voice recognition according to an embodiment of the present invention.
  • a method, a client, and a terminal device for voice recognition applied to a terminal device side are provided, and an original voice recognition result of a voice input by a user is obtained, and a voice of the user is parsed according to the original voice recognition result.
  • Identifying a scene wherein the original speech recognition result is obtained by the cloud server according to the voice input input by the user; and the keyword information to be corrected and the multi-tone word in each keyword information are obtained from the original speech recognition result according to the speech recognition scene;
  • the keyword information and the polyphonic words in each keyword information generate one or more junk words containing polyphonic words; and acquire the speech recognition scene in the terminal device according to the range of the speech recognition scene or the keyword information to be corrected Or the actual information corresponding to the corrected keyword information, and match the junk word with the actual information, filter out the correct polyphonic word, fill the correct polyphonic word into the keyword information to be corrected, and obtain the correct keyword;
  • Generate a current speech recognition field based on the correct keyword
  • the final speech recognition result is further optimized by combining the current speech recognition scene and the actual information in the terminal device, thereby converting the original speech recognition result into a final speech conforming to the current speech recognition scene and the terminal device. Identify the results and improve the speech recognition rate.
  • FIG. 1 a flow of a method for voice recognition on a terminal device side in an embodiment of the present invention
  • One of the diagrams, including the steps are as follows:
  • Step S101 Acquire an original voice recognition result of the voice input by the user, and parse the voice recognition scene of the user according to the original voice recognition result, where the original voice recognition result is recognized by the cloud server according to the voice input by the user. get.
  • step S101 according to the preset voice recognition result and the scene correspondence table, the voice recognition scene of the user corresponding to the original voice recognition result is obtained, where the voice recognition scene is used to indicate the voice in the scene.
  • the voice recognition scene may include: a call scene, a music scene, and the like, and the voice recognition result and the scene correspondence table respectively correspond to different expressions representing the same scene as a unified scene.
  • the original speech recognition results returned by each engine provider's speech recognition may be different, some return "call”, and some may return “call”, and some may return "Call", etc., in the present embodiment, the speech recognition result and the scene correspondence table will correspond to different expressions of the same scene into the same unified scene, thereby finally obtaining the recognition result in step S101.
  • the speech recognition result and the format correspondence table format XML (Extensible Markup Language) format are as follows:
  • the cloud server can utilize the function of natural semantic understanding in the related art to identify the voice input by the user, and obtain the original voice recognition result.
  • Step S103 Acquire, according to the voice recognition scene, keyword information that needs to be corrected and a polyphonic word in each of the keyword information from the original voice recognition result;
  • the keyword information extraction table is obtained according to the voice recognition scene and the preset scene key information extraction table, and the keyword information that needs to be corrected is obtained from the original voice recognition result; and then it is determined whether there is a plurality of keyword information that needs to be corrected.
  • the phonetic word if any, acquires the polyphonic word in each keyword message.
  • step S103 the keyword information that may need to be corrected is extracted from the original speech recognition result, and then it is determined whether there is a polyphonic word in the keyword information, and if so, the keyword and Multi-tone words are extracted. Judging whether there is a polyphonic word in the keyword information, according to the polysyllabic dictionary, each keyword in the keyword information can be queried in the polysyllabic dictionary to confirm whether it is a polyphonic word, and finally all the keywords whose confirmation is a polyphonic word are separately saved. Come down. For example, if you determine that the scene is a call, you need to extract the contact information as may be needed. The keyword information to be corrected, and then determine whether the contact recognition result contains a polyphonic word, and if so, the contact and the polyphonic word need to be extracted.
  • the format of the scene key information extraction table is an XML format
  • the code examples are as follows:
  • Step S105 generating one or more junk words including polyphonic words according to the keyword information to be corrected and the polyphonic words in each keyword information;
  • step S103 converting the multi-phonetic word extracted in step S103 into a corresponding pinyin, and then extracting all possible Chinese characters corresponding to the pinyin according to the homophone correspondence table, and then filling all the Chinese characters into the keyword information that needs to be corrected.
  • Multi-tone words to form one or more junk words containing polyphonic words.
  • Step S107 Acquire, according to the range of the voice recognition scene or the keyword information that needs to be corrected, the actual information corresponding to the voice recognition scene or the keyword information that needs to be corrected in the terminal device, and Matching the junk word with the actual information, filtering out the correct polyphonic word, and filling the correct polyphonic word into the keyword information to be corrected to obtain a correct keyword;
  • the actual information is the actual contact information list.
  • the voice recognition scene is a call scene or the range of the keyword information to be corrected belongs to the contact
  • the actual information is the actual contact information list.
  • the voice recognition scene is a call scene or the range of the keyword information to be corrected belongs to the contact
  • the actual information is the actual contact information list.
  • Step S109 Generate a final speech recognition result that conforms to the current speech recognition scene according to the correct keyword.
  • step S107 according to the voice recognition scene information and the keyword membership range, the actual information corresponding to the terminal device is extracted, for example, the current call scene, and the current keyword is the contact category, and the current mobile phone is extracted. contact information. Then, in step S107, the above junk words are compared one by one in the real contact information set, and if an exact match is found, the junk words are retained, and then the final contact recognition result is obtained in step S109.
  • the multi-phone recognition rate is improved by adding junk words.
  • the speech recognition mode in the embodiment of the present invention is applicable to all scenes related to terminal key information recognition, such as contacts, music names, artists, albums.
  • terminal key information recognition such as contacts, music names, artists, albums.
  • the above embodiment can generate a recognition result that is more accurate and more in line with the user's expectation, thereby improving the voice recognition rate and improving the user's voice interaction experience.
  • Step S201 Parse the scene of the original speech recognition result to obtain a unique speech recognition scene
  • Step S203 determining whether it is a call scene, if yes, proceeding to step S205, extracting a contact term in step S205, and acquiring a polyphonic word of the name in the contact term; if not, processing according to other scenarios.
  • Step S207 converting the multi-tone word obtained above into pinyin
  • Step S209 query all Chinese characters matching the above pinyin, if there is a matching Chinese character, perform step S211; if there is no matching meaning, press non-multiple word processing;
  • Step S211 replacing the obtained garbage Chinese characters with the polyphonic words in the original keyword to generate a name garbage word
  • Step S215 the junk word is filtered in the actual contact list. If the correct contact information is obtained, step S217 is performed; if the correct contact information is not obtained, the speech recognition result is given, such as the recognition failure;
  • Step S217 recombining the correct contact information and the voice recognition scene information to generate a final voice recognition result.
  • FIG. 3 is a schematic structural diagram of a client for voice recognition applied to a terminal device side according to an embodiment of the present invention, where the client includes:
  • the scene parsing module 301 is configured to obtain an original voice recognition result of the voice input by the user, and parse the voice recognition scene of the user according to the original voice recognition result, where the original voice recognition result is determined by the cloud server according to the user The input speech recognition is obtained;
  • the multi-syllable word extraction module 302 is configured to acquire, from the original speech recognition result, keyword information that needs to be corrected and a polyphonic word in each of the keyword information according to the speech recognition scene;
  • the junk word generating module 303 is configured to generate one or more junk words including the polyphonic words according to the keyword information that needs to be corrected and the polyphonic words in each keyword information;
  • the multi-word correction module 304 is configured to acquire, according to the range of the voice recognition scene or the keyword information that needs to be corrected, the keyword information corresponding to the voice recognition scene or the need to be corrected in the terminal device. Actual information, matching the junk word with the actual information, filtering out the correct polyphonic word, and filling the correct polyphonic word into the need for correction In the key word information, the correct keyword is obtained;
  • the result processing module 305 is configured to generate a final speech recognition result that conforms to the current speech recognition scene based on the correct keyword.
  • the scenario parsing module 301 is configured to match, according to a preset voice recognition result, a scene correspondence table, and obtain a voice recognition scene of the user corresponding to the original voice recognition result. .
  • the polyphonic word extraction module 302 is configured to obtain a key that needs to be corrected from the original speech recognition result according to the speech recognition scene and a preset scene key information extraction table. Word information; determining whether there is a polysyllabic word in the keyword information to be corrected, and if so, acquiring the polyphonic word in each keyword information.
  • the polyphonic word correction module 304 is configured to convert the polyphonic word into a corresponding pinyin, and then extract one or more corresponding to the pinyin according to the homophone correspondence table. a Chinese character; filling the Chinese character into the keyword information to be corrected to replace the polyphonic word to form one or more junk words containing polyphonic words.
  • the format of the speech recognition result and the scene correspondence table and the scene key information extraction table is an XML extensible markup language.
  • a terminal device comprising a voice recognition client as described above.
  • all or part of the steps of the above embodiments may also be implemented using an integrated circuit.
  • the steps may be separately fabricated into individual integrated circuit modules, or a plurality of modules or steps may be fabricated into a single integrated circuit module.
  • the devices/function modules/functional units in the above embodiments may be implemented by a general-purpose computing device, which may be centralized on a single computing device or distributed over a network of multiple computing devices.
  • each device/function module/functional unit in the above embodiment When each device/function module/functional unit in the above embodiment is implemented in the form of a software function module and sold or used as a stand-alone product, it can be stored in a computer readable storage medium.
  • the above mentioned computer readable storage medium may be a read only memory, a magnetic disk or an optical disk or the like.
  • the above technical solution makes the recognition result more in line with the current user's expectation, and improves the user's voice interaction experience.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A voice recognition method, a client and a terminal device, the method comprising: acquiring an original voice recognition result of a voice input by a user, and analyzing a voice recognition scene of the user on the basis of the original voice recognition result; on the basis of the voice recognition scene, acquiring, from the original voice recognition result, keyword information required to be corrected and polyphones in each piece of keyword information; generating one or more junk words comprising the polyphones on the basis of the keyword information required to be corrected and the polyphones in each piece of keyword information; on the basis of the voice recognition scene, acquiring actual information, in the terminal device, corresponding to the voice recognition scene, matching the junk words with the actual information, screening out correct polyphones, and filling the correct polyphones in the keyword information required to be corrected, so as to obtain correct keywords; and generating, on the basis of the correct keywords, a final voice recognition result consistent with the current voice recognition scene. The technical solution described above promotes the experience of users in voice interaction.

Description

语音识别的方法、客户端及终端设备Voice recognition method, client and terminal device 技术领域Technical field
本文涉及通信技术领域,尤其涉及一种语音识别的方法、客户端及终端设备。This paper relates to the field of communication technologies, and in particular, to a method, a client and a terminal device for voice recognition.
背景技术Background technique
随着语音成为大众熟知的交互方式之一,每种语音识别软件在市面上不断涌现,语音识别软件的质量也参差不齐,而衡量语音识别软件质量的标准之一就是语音识别率。虽然现在在云端识别情况下,每种语音识别引擎提供商提供了自然语义理解的功能,但每种引擎提供商提供的能力不一,都还没有办法完全理解不同人,不同场景下的语义。所以,如何根据当前终端场景下正确识别用户的语义,提高语音识别准确率,最终实现最佳的语音用户体验就显得很有意义和价值。As voice becomes one of the well-known interaction methods, each voice recognition software is emerging in the market, and the quality of voice recognition software is also uneven. One of the standards for measuring the quality of voice recognition software is the voice recognition rate. Although in the cloud recognition situation, each speech recognition engine provider provides the function of natural semantic understanding, each engine provider provides different capabilities, and there is no way to fully understand the semantics of different people and different scenarios. Therefore, how to correctly identify the user's semantics according to the current terminal scene, improve the accuracy of speech recognition, and finally achieve the best voice user experience is very meaningful and valuable.
相关技术中大多数引擎提供商采用的方法一般都是在云端服务器上用语言模型等,经过一定的算法,对用户的语音进行处理,最终得到用户的意图并告知该用户,但很多时候,由于某些特定说法具有多义性,云端服务器也没有办法得到唯一的结果,就会造成反馈给用户的结果可能与用户期望的实际结果有差距,从而给用户的感觉是识别不准,用户体验不佳。Most of the methods used by the engine providers in the related art generally use a language model on the cloud server, and process the voice of the user through a certain algorithm, and finally get the user's intention and inform the user, but in many cases, Some specific statements are ambiguous, and the cloud server has no way to get the only result. The result of the feedback to the user may be different from the actual result expected by the user, so that the user's feeling is that the recognition is not accurate, and the user experience is not good.
发明内容Summary of the invention
发明的实施例提供了一种语音识别的方法、客户端及终端设备,以解决如何让语音结果更佳符合当前用户的期望,提升用户的语音交互体验的技术问题。The embodiments of the present invention provide a method, a client, and a terminal device for voice recognition, to solve the technical problem of how to make the voice result better meet the current user's expectation and improve the user's voice interaction experience.
依据本发明实施例的一个方面,提供了一种语音识别的方法,应用于终端设备侧,所述方法包括:获取用户输入的语音的原始语音识别结果,并根据所述原始语音识别结果解析出所述用户的语音识别场景;根据所述语音识 别场景,从所述原始语音识别结果中获取需要校正的关键词信息和每个所述关键词信息中的多音字;根据所述需要校正的关键词信息和每个关键词信息中的多音字生成一个或多个包含所述多音字的垃圾词;根据所述语音识别场景或所述需要校正的关键词信息所属的范围,获取所述终端设备中的与所述语音识别场景或所述需要校正的关键词信息对应的实际信息,并将所述垃圾词与所述实际信息进行匹配,筛选出正确的多音字,将所述正确的多音字填充到所述需要校正的关键词信息中,得到正确的关键词;根据所述正确的关键词生成符合当前语音识别场景的最终语音识别结果。According to an aspect of the embodiments of the present invention, a method for voice recognition is provided, which is applied to a terminal device side, and the method includes: acquiring an original voice recognition result of a voice input by a user, and parsing according to the original voice recognition result. a voice recognition scene of the user; a scene from which the keyword information to be corrected and the polyphonic word in each of the keyword information are acquired from the original speech recognition result; the keyword information corrected according to the need and the polyphonic word in each keyword information Generating one or more junk words including the polyphonic words; acquiring the speech recognition scene or the need in the terminal device according to the range of the speech recognition scene or the keyword information to be corrected Correcting the actual information corresponding to the keyword information, matching the junk word with the actual information, filtering out the correct polyphonic word, and filling the correct polyphonic word into the keyword information to be corrected, Obtaining the correct keyword; generating a final speech recognition result conforming to the current speech recognition scene according to the correct keyword.
可选地,所述根据所述原始语音识别结果解析出所述用户的语音识别场景,包括:根据预先设置的语音识别结果与场景对应表,匹配得到与所述原始语音识别结果对应的所述用户的语音识别场景。Optionally, the parsing the voice recognition scene of the user according to the original voice recognition result comprises: matching, according to a preset voice recognition result and a scene correspondence table, the matching corresponding to the original voice recognition result The user's speech recognition scene.
可选地,所述根据所述语音识别场景,从所述原始语音识别结果中获取需要校正的关键词信息和每个关键词信息中的多音字,包括:根据所述语音识别场景和预先设置的场景关键信息提取表,从所述原始语音识别结果中获取需要校正的关键词信息;判断所述需要校正的关键词信息中是否存在多音字,如果有,则获取每个关键词信息中的多音字。Optionally, the acquiring the keyword information to be corrected and the polyphonic word in each keyword information from the original voice recognition result according to the voice recognition scene, including: identifying a scene according to the voice recognition and presetting a scene key information extraction table, obtaining keyword information to be corrected from the original voice recognition result; determining whether there is a polyphonic word in the keyword information to be corrected, and if so, acquiring each keyword information Multi-tone words.
可选地,所述根据所述需要校正的关键词信息和每个关键词信息中的多音字生成一个或多个包含多音字的垃圾词,包括:将所述多音字转换成对应的拼音,然后根据同音字对应表,提取出所述拼音对应的一个或多个汉字;将所述汉字填充到所述需要校正的关键词信息中替换多音字以组成得到一个或多个包含多音字的垃圾词。Optionally, the generating, according to the keyword information that needs to be corrected and the polyphonic word in each keyword information, one or more junk words including polyphonic words, including: converting the polyphonic words into corresponding pinyin, And then extracting one or more Chinese characters corresponding to the pinyin according to the homophone correspondence table; filling the Chinese characters into the keyword information to be corrected to replace the polyphonic words to form one or more garbage containing polyphonic words word.
可选地,所述语音识别结果与场景对应表和场景关键信息提取表的格式为XML可扩展标记语言。Optionally, the format of the speech recognition result and the scene correspondence table and the scene key information extraction table is an XML extensible markup language.
依据本发明实施例的另一个方面,还提供了一种语音识别的客户端,应用于终端设备侧,所述客户端包括:场景解析模块,设置为获取用户输入的语音的原始语音识别结果,并根据所述原始语音识别结果解析出所述用户的语音识别场景;多音字提取模块,设置为根据所述语音识别场景,从所述原始语音识别结果中获取需要校正的关键词信息和每个所述关键词信息中的多音字;垃圾词生成模块,设置为根据所述需要校正的关键词信息和每个关键 词信息中的多音字生成一个或多个包含所述多音字的垃圾词;多音字校正模块,设置为根据所述语音识别场景或所述需要校正的关键词信息所属的范围,获取所述终端设备中的与所述语音识别场景或所述需要校正的关键词信息对应的实际信息,将所述垃圾词与所述实际信息进行匹配,筛选出正确的多音字,将所述正确的多音字填充到所述需要校正的关键词信息中,得到正确的关键词;结果处理模块,设置为根据所述正确的关键词生成符合当前语音识别场景的最终语音识别结果。According to another aspect of the present invention, a voice recognition client is further provided, which is applied to a terminal device side, where the client includes: a scene resolution module, configured to obtain an original voice recognition result of a voice input by a user, And parsing the voice recognition scene of the user according to the original voice recognition result; the multi-tone word extraction module is configured to: obtain, according to the voice recognition scene, keyword information and each of the original voice recognition results that need to be corrected a polyphonic word in the keyword information; a junk word generating module, configured to be based on the keyword information to be corrected and each key The polyphonic word in the word information generates one or more junk words including the polyphonic word; the multi-tone word correction module is configured to acquire the terminal according to the range of the speech recognition scene or the keyword information to be corrected The actual information corresponding to the voice recognition scene or the keyword information that needs to be corrected in the device, the garbage word is matched with the actual information, and the correct multi-phonetic word is filtered out, and the correct multi-phonetic word is Filling the keyword information to be corrected to obtain a correct keyword; the result processing module is configured to generate a final speech recognition result conforming to the current speech recognition scene according to the correct keyword.
可选地,所述场景解析模块是设置为根据预先设置的语音识别结果与场景对应表,匹配得到与所述原始语音识别结果对应的所述用户的语音识别场景。Optionally, the scene resolution module is configured to match the scene recognition table according to the preset voice recognition result, and obtain a voice recognition scene of the user corresponding to the original voice recognition result.
可选地,所述多音字提取模块是设置为根据所述语音识别场景和预先设置的场景关键信息提取表,从所述原始语音识别结果中获取需要校正的关键词信息;判断所述需要校正的关键词信息中是否存在多音字,如果有,则获取每个关键词信息中的多音字。Optionally, the multi-tone word extraction module is configured to: obtain, according to the voice recognition scene and a preset scene key information extraction table, the keyword information that needs to be corrected from the original voice recognition result; determine the need to correct Whether there is a polyphonic word in the keyword information, and if so, the polyphonic word in each keyword information is obtained.
可选地,所述多音字校正模块是设置为将所述多音字转换成对应的拼音,然后根据同音字对应表,提取出所述拼音对应的一个或多个汉字;将所述汉字填充到所述需要校正的关键词信息中替换多音字以组成得到一个或多个包含多音字的的垃圾词。Optionally, the polyphonic word correction module is configured to convert the multi-phonetic word into a corresponding pinyin, and then extract one or more Chinese characters corresponding to the pinyin according to the homophone correspondence table; The polyphonic words are replaced in the keyword information to be corrected to form one or more junk words containing polyphonic words.
可选地,所述语音识别结果与场景对应表和场景关键信息提取表的格式为XML可扩展标记语言。Optionally, the format of the speech recognition result and the scene correspondence table and the scene key information extraction table is an XML extensible markup language.
依据本发明的实施例的又一个方面,还提供了一种终端设备,包括如上所述的语音识别的客户端。According to still another aspect of an embodiment of the present invention, there is also provided a terminal device comprising a voice recognition client as described above.
依据本发明的实施例的又一个方面,还提供了一种计算机存储介质,所述计算机存储介质中存储有计算机可执行指令,所述计算机可执行指令用于执行上述的方法。According to still another aspect of an embodiment of the present invention, there is also provided a computer storage medium having stored therein computer executable instructions for performing the method described above.
在本发明的实施例中,对引擎提供商反馈的识别结果结合当前场景进一步优化识别率,从而让识别结果更符合当前用户的期望,提升用户的语音交 互体验。In the embodiment of the present invention, the recognition result fed back by the engine provider is further optimized with the current scene to optimize the recognition rate, so that the recognition result is more in line with the current user's expectation, and the user's voice communication is improved. Experience each other.
附图概述BRIEF abstract
图1为本发明的实施例中终端设备侧的语音识别的方法的流程图之一;1 is a flowchart of a method for voice recognition on a terminal device side according to an embodiment of the present invention;
图2为本发明的实施例中终端设备侧的语音识别的方法的流程图之二;以及2 is a second flowchart of a method for voice recognition on a terminal device side according to an embodiment of the present invention; and
图3为本发明的实施例中语音识别的终端设备的结构框图。FIG. 3 is a structural block diagram of a terminal device for voice recognition according to an embodiment of the present invention.
本发明的较佳实施方式Preferred embodiment of the invention
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the embodiments of the present invention have been shown in the drawings, the embodiments Rather, these embodiments are provided so that this disclosure will be more fully understood and the scope of the disclosure will be fully disclosed.
在本发明的实施例中提供了一种应用于终端设备侧的语音识别的方法、客户端及终端设备,获取用户输入的语音的原始语音识别结果,并根据原始语音识别结果解析出用户的语音识别场景,其中原始语音识别结果由云端服务器根据用户输入的语音识别得到;根据语音识别场景从原始语音识别结果中获取需要校正的关键词信息和每个关键词信息中的多音字;根据需要校正的关键词信息和每个关键词信息中的多音字生成一个或多个包含多音字的垃圾词;根据语音识别场景或需要校正的关键词信息所属的范围,获取终端设备中的与语音识别场景或需要校正的关键词信息对应的实际信息,并将垃圾词与实际信息进行匹配,筛选出正确的多音字,将正确的多音字填充到需要校正的关键词信息中,得到正确的关键词;根据正确的关键词生成符合当前语音识别场景的最终语音识别结果,由于能结合当前的语音识别场景及终端设备中的实际信息,对原始语音识别结果进一步进行优化,从而将原始语音识别结果转化成符合当前语音识别场景及终端设备的最终语音识别结果,提高语音识别率。In the embodiment of the present invention, a method, a client, and a terminal device for voice recognition applied to a terminal device side are provided, and an original voice recognition result of a voice input by a user is obtained, and a voice of the user is parsed according to the original voice recognition result. Identifying a scene, wherein the original speech recognition result is obtained by the cloud server according to the voice input input by the user; and the keyword information to be corrected and the multi-tone word in each keyword information are obtained from the original speech recognition result according to the speech recognition scene; The keyword information and the polyphonic words in each keyword information generate one or more junk words containing polyphonic words; and acquire the speech recognition scene in the terminal device according to the range of the speech recognition scene or the keyword information to be corrected Or the actual information corresponding to the corrected keyword information, and match the junk word with the actual information, filter out the correct polyphonic word, fill the correct polyphonic word into the keyword information to be corrected, and obtain the correct keyword; Generate a current speech recognition field based on the correct keyword The final speech recognition result is further optimized by combining the current speech recognition scene and the actual information in the terminal device, thereby converting the original speech recognition result into a final speech conforming to the current speech recognition scene and the terminal device. Identify the results and improve the speech recognition rate.
如图1所示,为本发明的实施例中终端设备侧的语音识别的方法的流程 图之一,包括步骤如下:As shown in FIG. 1 , a flow of a method for voice recognition on a terminal device side in an embodiment of the present invention One of the diagrams, including the steps are as follows:
步骤S101、获取用户输入的语音的原始语音识别结果,并根据所述原始语音识别结果解析出所述用户的语音识别场景,其中所述原始语音识别结果由云端服务器根据所述用户输入的语音识别得到。Step S101: Acquire an original voice recognition result of the voice input by the user, and parse the voice recognition scene of the user according to the original voice recognition result, where the original voice recognition result is recognized by the cloud server according to the voice input by the user. get.
可选地,在步骤S101中可以根据预先设置的语音识别结果与场景对应表,匹配得到与原始语音识别结果对应的用户的语音识别场景,其中语音识别场景用于表示用户在什么场景中使用语音识别,语音识别场景可以包括:打电话场景、音乐场景等,语音识别结果与场景对应表会将代表同一场景的不同表达方式都对应为一个统一的场景。如针对“打电话场景”,每种引擎提供商语音识别返回的原始语音识别结果可能不一样,有些返回的是“打电话”,而有些可能返回的是“呼叫”,也有些可能返回的是“Call”,等等,而在本实施例中语音识别结果与场景对应表会将这些同一场景的不同说法都对应成同一个统一的场景,由此在步骤S101中最终就可以得到该识别结果的唯一语音识别场景。Optionally, in step S101, according to the preset voice recognition result and the scene correspondence table, the voice recognition scene of the user corresponding to the original voice recognition result is obtained, where the voice recognition scene is used to indicate the voice in the scene. The voice recognition scene may include: a call scene, a music scene, and the like, and the voice recognition result and the scene correspondence table respectively correspond to different expressions representing the same scene as a unified scene. For the "call scene", the original speech recognition results returned by each engine provider's speech recognition may be different, some return "call", and some may return "call", and some may return "Call", etc., in the present embodiment, the speech recognition result and the scene correspondence table will correspond to different expressions of the same scene into the same unified scene, thereby finally obtaining the recognition result in step S101. The only speech recognition scene.
可选地,语音识别结果与场景对应表的格式XML(可扩展标记语言)格式,代码举例如下:Optionally, the speech recognition result and the format correspondence table format XML (Extensible Markup Language) format, the code examples are as follows:
<?xml version="1.0"encoding="utf-8"?><? Xml version="1.0"encoding="utf-8"? >
<SceneMapTable><SceneMapTable>
<Domain><Domain>
<Scene>打电话</Scene><Scene>Call</Scene>
<Value><Value>
<V>呼叫</V><V>Call</V>
<V>打电话</V><V>Call</V>
<V>Call</V><V>Call</V>
</Value></Value>
</Domain></Domain>
<Domain> <Domain>
<Scene>音乐</Scene><Scene>Music</Scene>
<Value><Value>
<V>播放音乐</V><V>Play music</V>
<V>听音乐</V><V>Listen to music</V>
<V>Music</V><V>Music</V>
</Value></Value>
</Domain></Domain>
</SceneMapTable></SceneMapTable>
通过上述代码设置的语音识别结果与场景对应表中记录了“呼叫”、“电话”、“call”与“打电话场景”的对应关系,以及“播放音乐”、“听音乐”、“Music”与“音乐场景”的对应关系。如果原始语音识别结果中包括“呼叫”,在步骤S101中就可以得到用户的语音识别场景为打电话场景。如果原始语音识别结果中包括“播放音乐”,在步骤S101中就可以得到用户的语音识别场景为音乐场景。The correspondence between the "call", "telephone", "call" and "calling scene", and "playing music", "listening to music", "Music" are recorded in the speech recognition result and the scene correspondence table set by the above code. Correspondence with "music scenes". If "call" is included in the original speech recognition result, the user's speech recognition scene can be obtained as a calling scene in step S101. If the original speech recognition result includes "play music", the user's speech recognition scene can be obtained as a music scene in step S101.
在本发明的实施例中,云端服务器可以利用相关技术中的自然语义理解的功能,对用户输入的语音进行识别,得到原始语音识别结果。In the embodiment of the present invention, the cloud server can utilize the function of natural semantic understanding in the related art to identify the voice input by the user, and obtain the original voice recognition result.
步骤S103、根据所述语音识别场景,从所述原始语音识别结果中获取需要校正的关键词信息和每个所述关键词信息中的多音字;Step S103: Acquire, according to the voice recognition scene, keyword information that needs to be corrected and a polyphonic word in each of the keyword information from the original voice recognition result;
可选地,根据所述语音识别场景和预先设置的场景关键信息提取表,从所述原始语音识别结果中获取需要校正的关键词信息;然后判断所述需要校正的关键词信息中是否存在多音字,如果有,则获取每个关键词信息中的多音字。Optionally, the keyword information extraction table is obtained according to the voice recognition scene and the preset scene key information extraction table, and the keyword information that needs to be corrected is obtained from the original voice recognition result; and then it is determined whether there is a plurality of keyword information that needs to be corrected. The phonetic word, if any, acquires the polyphonic word in each keyword message.
也就是,在步骤S103中可以根据场景关键信息提取表,从原始语音识别结果中提取可能需要校正的关键词信息,然后判断该关键词信息中是否存在多音字,如果有就把该关键词及多音字提取出来。判断关键词信息中是否存在多音字,可以根据多音字词典,将关键词信息中的每一个关键字在多音字词典中查询确认是否是多音字,最终把确认是多音字的所有关键字单独保存下来。例如,如果确定场景就是打电话,那需要提取联系人信息作为可能需 要校正的关键词信息,然后判断该联系人识别结果中是否包含多音字,如果有,就需要将该联系人及多音字提取出来。That is, in step S103, according to the scene key information extraction table, the keyword information that may need to be corrected is extracted from the original speech recognition result, and then it is determined whether there is a polyphonic word in the keyword information, and if so, the keyword and Multi-tone words are extracted. Judging whether there is a polyphonic word in the keyword information, according to the polysyllabic dictionary, each keyword in the keyword information can be queried in the polysyllabic dictionary to confirm whether it is a polyphonic word, and finally all the keywords whose confirmation is a polyphonic word are separately saved. Come down. For example, if you determine that the scene is a call, you need to extract the contact information as may be needed. The keyword information to be corrected, and then determine whether the contact recognition result contains a polyphonic word, and if so, the contact and the polyphonic word need to be extracted.
可选地,场景关键信息提取表的格式为XML格式,其代码举例如下:Optionally, the format of the scene key information extraction table is an XML format, and the code examples are as follows:
<?xml version="1.0"encoding="utf-8"?><? Xml version="1.0"encoding="utf-8"? >
<KeywordMapTable><KeywordMapTable>
<Domain><Domain>
<Scene>打电话</Scene><Scene>Call</Scene>
<Keyword><Keyword>
<Key>联系人</Key><Key>Contact</Key>
</Keyword></Keyword>
</Domain></Domain>
<Domain><Domain>
<Scene>音乐</Scene><Scene>Music</Scene>
<Keyword><Keyword>
<Key>歌曲名</Key><Key>Song name</Key>
<Key>专辑名</Key><Key> album name</Key>
<Key>艺术家</Key><Key>Artist</Key>
</Keyword></Keyword>
</Domain></Domain>
</KeywordMapTable></KeywordMapTable>
在上述代码中还介绍了如果确定场景就是音乐,那需要提取歌曲名、专辑名、或艺术家作为可能需要校正的关键词信息,然后判断歌曲名、专辑名、或艺术家识别结果中是否包含多音字,如果有,就需要将该歌曲名、专辑名、或艺术家及多音字提取出来。In the above code, it is also introduced that if it is determined that the scene is music, it is necessary to extract the song name, album name, or artist as keyword information that may need to be corrected, and then determine whether the song name, album name, or artist recognition result contains polyphonic words. If so, you will need to extract the song name, album name, or artist and polyphonic words.
步骤S105、根据需要校正的关键词信息和每个关键词信息中的多音字生成一个或多个包含多音字的垃圾词; Step S105, generating one or more junk words including polyphonic words according to the keyword information to be corrected and the polyphonic words in each keyword information;
可选地,将步骤S103中提取出的多音字转换成对应的拼音,然后根据同音字对应表,提取出该拼音对应所有可能汉字,然后将所有汉字填充到据需要校正的关键词信息中替换多音字以组成得到一个或多个包含多音字的垃圾词。Optionally, converting the multi-phonetic word extracted in step S103 into a corresponding pinyin, and then extracting all possible Chinese characters corresponding to the pinyin according to the homophone correspondence table, and then filling all the Chinese characters into the keyword information that needs to be corrected. Multi-tone words to form one or more junk words containing polyphonic words.
步骤S107、根据所述语音识别场景或所述需要校正的关键词信息所属的范围,获取所述终端设备中的与所述语音识别场景或所述需要校正的关键词信息对应的实际信息,并将所述垃圾词与所述实际信息进行匹配,筛选出正确的多音字,将所述正确的多音字填充到所述需要校正的关键词信息中,得到正确的关键词;Step S107: Acquire, according to the range of the voice recognition scene or the keyword information that needs to be corrected, the actual information corresponding to the voice recognition scene or the keyword information that needs to be corrected in the terminal device, and Matching the junk word with the actual information, filtering out the correct polyphonic word, and filling the correct polyphonic word into the keyword information to be corrected to obtain a correct keyword;
例如:如果语音识别场景为打电话场景或者需要校正的关键词信息所属的范围为联系人,则实际信息为实际联系人信息列表,当然可以理解的是,在本发明的实施例中并不限定实际信息的表示方式。For example, if the voice recognition scene is a call scene or the range of the keyword information to be corrected belongs to the contact, the actual information is the actual contact information list. Of course, it can be understood that it is not limited in the embodiment of the present invention. The way the actual information is represented.
步骤S109、根据正确的关键词生成符合当前语音识别场景的最终语音识别结果。Step S109: Generate a final speech recognition result that conforms to the current speech recognition scene according to the correct keyword.
可选地,在步骤S107中根据语音识别场景信息及关键词隶属范围,提取终端设备对应的实际信息,例如当前是打电话场景,并且当前的关键词为联系人范畴,则提取当前手机的真实联系人信息。然后在步骤S107中将以上垃圾词在真实联系人信息集合中一一比较,如果发现有完全匹配的,则保留该垃圾词,然后在步骤S109中作为最终的联系人识别结果。Optionally, in step S107, according to the voice recognition scene information and the keyword membership range, the actual information corresponding to the terminal device is extracted, for example, the current call scene, and the current keyword is the contact category, and the current mobile phone is extracted. contact information. Then, in step S107, the above junk words are compared one by one in the real contact information set, and if an exact match is found, the junk words are retained, and then the final contact recognition result is obtained in step S109.
在本发明的实施例中,该通过添加垃圾词提高多音字识别率,本发明的实施例中的语音识别方式适用于所有涉及终端关键信息识别的场景,如联系人,音乐名、艺术家、专辑名、应用名等等,通过上述实施例可以生成更准确,更符合用户期望的识别结果,从而提高语音识别率,提升用户语音交互体验。In the embodiment of the present invention, the multi-phone recognition rate is improved by adding junk words. The speech recognition mode in the embodiment of the present invention is applicable to all scenes related to terminal key information recognition, such as contacts, music names, artists, albums. By name, application name, and the like, the above embodiment can generate a recognition result that is more accurate and more in line with the user's expectation, thereby improving the voice recognition rate and improving the user's voice interaction experience.
下面结合图2,以打电话场景为例介绍本发明的实施例中的语音识别的流程,包括步骤如下:The following describes the flow of voice recognition in the embodiment of the present invention by taking a call scene as an example, and the steps are as follows:
步骤S201、解析原始语音识别结果的场景,得到唯一的语音识别场景;Step S201: Parse the scene of the original speech recognition result to obtain a unique speech recognition scene;
可选地,根据原始语音识别结果中的场景关键词进行解析,得到与该场 景关键词对应的唯一的语音识别场景;Optionally, parsing according to the scene keyword in the original speech recognition result, and obtaining the field a unique speech recognition scene corresponding to the scene keyword;
步骤S203、判断是否是打电话场景,如果是,进入步骤S205,在步骤S205中提取联系人词条,并获取联系人词条中姓名的多音字;如果不是,按照其他场景处理。Step S203: determining whether it is a call scene, if yes, proceeding to step S205, extracting a contact term in step S205, and acquiring a polyphonic word of the name in the contact term; if not, processing according to other scenarios.
步骤S207,将以上获取到的多音字转换为拼音;Step S207, converting the multi-tone word obtained above into pinyin;
步骤S209,查询所有与以上拼音匹配的汉字,如果有匹配的汉字,执行步骤S211;如果没有匹配的含义,按非多音字处理;Step S209, query all Chinese characters matching the above pinyin, if there is a matching Chinese character, perform step S211; if there is no matching meaning, press non-multiple word processing;
步骤S211,将获取到的垃圾汉字,替换原关键词中的多音字,生成姓名垃圾词;Step S211, replacing the obtained garbage Chinese characters with the polyphonic words in the original keyword to generate a name garbage word;
步骤S213,获取终端设备实际联系人信息列表;Step S213, obtaining a list of actual contact information of the terminal device;
步骤S215,将垃圾词在实际联系人列表中筛选,如果得出正确的联系人信息,执行步骤S217;如果没有得到正确的联系人信息,给出语音识别结果,如识别失败;Step S215, the junk word is filtered in the actual contact list. If the correct contact information is obtained, step S217 is performed; if the correct contact information is not obtained, the speech recognition result is given, such as the recognition failure;
步骤S217,将正确的联系人信息及语音识别场景信息重新组合生成最终语音识别结果。Step S217, recombining the correct contact information and the voice recognition scene information to generate a final voice recognition result.
如图3所示,为本发明的实施例中应用于终端设备侧的语音识别的客户端的结构示意图,客户端包括:FIG. 3 is a schematic structural diagram of a client for voice recognition applied to a terminal device side according to an embodiment of the present invention, where the client includes:
场景解析模块301,设置为获取用户输入的语音的原始语音识别结果,并根据所述原始语音识别结果解析出所述用户的语音识别场景,其中所述原始语音识别结果由云端服务器根据所述用户输入的语音识别得到;The scene parsing module 301 is configured to obtain an original voice recognition result of the voice input by the user, and parse the voice recognition scene of the user according to the original voice recognition result, where the original voice recognition result is determined by the cloud server according to the user The input speech recognition is obtained;
多音字提取模块302,设置为根据所述语音识别场景,从所述原始语音识别结果中获取需要校正的关键词信息和每个所述关键词信息中的多音字;The multi-syllable word extraction module 302 is configured to acquire, from the original speech recognition result, keyword information that needs to be corrected and a polyphonic word in each of the keyword information according to the speech recognition scene;
垃圾词生成模块303,设置为根据所述需要校正的关键词信息和每个关键词信息中的多音字生成一个或多个包含所述多音字的垃圾词;The junk word generating module 303 is configured to generate one or more junk words including the polyphonic words according to the keyword information that needs to be corrected and the polyphonic words in each keyword information;
多音字校正模块304,设置为根据所述语音识别场景或所述需要校正的关键词信息所属的范围,获取所述终端设备中的与所述语音识别场景或所述需要校正的关键词信息对应的实际信息,将所述垃圾词与所述实际信息进行匹配,筛选出正确的多音字,将所述正确的多音字填充到所述需要校正的关 键词信息中,得到正确的关键词;The multi-word correction module 304 is configured to acquire, according to the range of the voice recognition scene or the keyword information that needs to be corrected, the keyword information corresponding to the voice recognition scene or the need to be corrected in the terminal device. Actual information, matching the junk word with the actual information, filtering out the correct polyphonic word, and filling the correct polyphonic word into the need for correction In the key word information, the correct keyword is obtained;
结果处理模块305,设置为根据所述正确的关键词生成符合当前语音识别场景的最终语音识别结果。The result processing module 305 is configured to generate a final speech recognition result that conforms to the current speech recognition scene based on the correct keyword.
可选地,在本发明实施例中,所述场景解析模块301是设置为根据预先设置的语音识别结果与场景对应表,匹配得到与所述原始语音识别结果对应的所述用户的语音识别场景。Optionally, in the embodiment of the present invention, the scenario parsing module 301 is configured to match, according to a preset voice recognition result, a scene correspondence table, and obtain a voice recognition scene of the user corresponding to the original voice recognition result. .
可选地,在本发明实施例中,所述多音字提取模块302是设置为根据所述语音识别场景和预先设置的场景关键信息提取表,从所述原始语音识别结果中获取需要校正的关键词信息;判断所述需要校正的关键词信息中是否存在多音字,如果有,则获取每个关键词信息中的多音字。Optionally, in the embodiment of the present invention, the polyphonic word extraction module 302 is configured to obtain a key that needs to be corrected from the original speech recognition result according to the speech recognition scene and a preset scene key information extraction table. Word information; determining whether there is a polysyllabic word in the keyword information to be corrected, and if so, acquiring the polyphonic word in each keyword information.
可选地,在本发明实施例中,所述多音字校正模块304是设置为将所述多音字转换成对应的拼音,然后根据同音字对应表,提取出所述拼音对应的一个或多个汉字;将所述汉字填充到所述需要校正的关键词信息中替换多音字以组成得到一个或多个包含多音字的的垃圾词。Optionally, in the embodiment of the present invention, the polyphonic word correction module 304 is configured to convert the polyphonic word into a corresponding pinyin, and then extract one or more corresponding to the pinyin according to the homophone correspondence table. a Chinese character; filling the Chinese character into the keyword information to be corrected to replace the polyphonic word to form one or more junk words containing polyphonic words.
可选地,在本发明实施例中,所述语音识别结果与场景对应表和场景关键信息提取表的格式为XML可扩展标记语言。Optionally, in the embodiment of the present invention, the format of the speech recognition result and the scene correspondence table and the scene key information extraction table is an XML extensible markup language.
根据本发明实施例的又一个方面,还提供了一种终端设备,包括如上所述的语音识别的客户端。According to still another aspect of an embodiment of the present invention, there is also provided a terminal device comprising a voice recognition client as described above.
以上所述是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明所述原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above is a preferred embodiment of the present invention, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present invention. It should also be considered as the scope of protection of the present invention.
本领域普通技术人员可以理解上述实施例的全部或部分步骤可以使用计算机程序流程来实现,所述计算机程序可以存储于一计算机可读存储介质中,所述计算机程序在相应的硬件平台上(如系统、设备、装置、器件等)执行,在执行时,包括方法实施例的步骤之一或其组合。One of ordinary skill in the art will appreciate that all or a portion of the steps of the above-described embodiments can be implemented using a computer program flow, which can be stored in a computer readable storage medium, such as on a corresponding hardware platform (eg, The system, device, device, device, etc. are executed, and when executed, include one or a combination of the steps of the method embodiments.
可选地,上述实施例的全部或部分步骤也可以使用集成电路来实现,这 些步骤可以被分别制作成一个个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。Alternatively, all or part of the steps of the above embodiments may also be implemented using an integrated circuit. The steps may be separately fabricated into individual integrated circuit modules, or a plurality of modules or steps may be fabricated into a single integrated circuit module.
上述实施例中的各装置/功能模块/功能单元可以采用通用的计算装置来实现,它们可以集中在单个的计算装置上,也可以分布在多个计算装置所组成的网络上。The devices/function modules/functional units in the above embodiments may be implemented by a general-purpose computing device, which may be centralized on a single computing device or distributed over a network of multiple computing devices.
上述实施例中的各装置/功能模块/功能单元以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。上述提到的计算机可读取存储介质可以是只读存储器,磁盘或光盘等。When each device/function module/functional unit in the above embodiment is implemented in the form of a software function module and sold or used as a stand-alone product, it can be stored in a computer readable storage medium. The above mentioned computer readable storage medium may be a read only memory, a magnetic disk or an optical disk or the like.
工业实用性Industrial applicability
上述技术方案使得识别结果更符合当前用户的期望,提升了用户的语音交互体验。 The above technical solution makes the recognition result more in line with the current user's expectation, and improves the user's voice interaction experience.

Claims (12)

  1. 一种应用于终端设备侧的语音识别的方法,所述方法包括:A method for voice recognition applied to a terminal device side, the method comprising:
    获取用户输入的语音的原始语音识别结果,并根据所述原始语音识别结果解析出所述用户的语音识别场景;Obtaining an original voice recognition result of the voice input by the user, and parsing the voice recognition scene of the user according to the original voice recognition result;
    根据所述语音识别场景,从所述原始语音识别结果中获取需要校正的关键词信息和每个所述关键词信息中的多音字;Obtaining, from the original speech recognition result, keyword information that needs to be corrected and a polyphonic word in each of the keyword information according to the speech recognition scene;
    根据所述需要校正的关键词信息和每个关键词信息中的多音字生成一个或多个包含所述多音字的垃圾词;Generating one or more junk words including the polyphonic words according to the keyword information that needs to be corrected and the polyphonic words in each keyword information;
    根据所述语音识别场景或所述需要校正的关键词信息所属的范围,获取所述终端设备中的与所述语音识别场景或所述需要校正的关键词信息对应的实际信息,并将所述垃圾词与所述实际信息进行匹配,筛选出正确的多音字,将所述正确的多音字填充到所述需要校正的关键词信息中,得到正确的关键词;Obtaining actual information corresponding to the voice recognition scene or the keyword information to be corrected in the terminal device according to the range of the voice recognition scene or the keyword information that needs to be corrected, and the Matching the garbage words with the actual information, filtering out the correct polyphonic words, filling the correct polyphonic words into the keyword information to be corrected, and obtaining the correct keywords;
    根据所述正确的关键词生成符合当前语音识别场景的最终语音识别结果。A final speech recognition result conforming to the current speech recognition scene is generated according to the correct keyword.
  2. 如权利要求1所述的方法,其中,所述根据所述原始语音识别结果解析出所述用户的语音识别场景,包括:The method of claim 1, wherein the parsing the speech recognition scene of the user according to the original speech recognition result comprises:
    根据预先设置的语音识别结果与场景对应表,匹配得到与所述原始语音识别结果对应的所述用户的语音识别场景。According to the preset voice recognition result and the scene correspondence table, the voice recognition scene of the user corresponding to the original voice recognition result is matched.
  3. 如权利要求2所述的方法,其中,所述根据所述语音识别场景,从所述原始语音识别结果中获取需要校正的关键词信息和每个关键词信息中的多音字,包括:The method of claim 2, wherein the obtaining the keyword information to be corrected and the polyphonic word in each keyword information from the original speech recognition result according to the speech recognition scene comprises:
    根据所述语音识别场景和预先设置的场景关键信息提取表,从所述原始语音识别结果中获取需要校正的关键词信息;Obtaining, according to the voice recognition scene and a preset scene key information extraction table, keyword information that needs to be corrected from the original voice recognition result;
    判断所述需要校正的关键词信息中是否存在多音字,如果有,则获取每个关键词信息中的多音字。Determining whether there is a polysyllabic word in the keyword information to be corrected, and if so, acquiring the polyphonic word in each keyword information.
  4. 如权利要求1所述的方法,其中,所述根据所述需要校正的关键词信 息和每个关键词信息中的多音字生成一个或多个包含多音字的垃圾词,包括:The method of claim 1 wherein said keyword letter corrected according to said need And the polyphonic words in each keyword information generate one or more junk words containing polyphonic words, including:
    将所述多音字转换成对应的拼音,然后根据同音字对应表,提取出所述拼音对应的一个或多个汉字;Converting the polyphonic word into a corresponding pinyin, and then extracting one or more Chinese characters corresponding to the pinyin according to the homophone correspondence table;
    将所述汉字填充到所述需要校正的关键词信息中替换多音字以组成得到一个或多个包含多音字的垃圾词。Substituting the Chinese characters into the keyword information to be corrected to replace the polyphonic words to form one or more junk words containing polyphonic words.
  5. 如权利要求3所述的方法,其中,所述语音识别结果与场景对应表和场景关键信息提取表的格式为XML可扩展标记语言。The method of claim 3, wherein the format of the speech recognition result and the scene correspondence table and the scene key information extraction table is an XML extensible markup language.
  6. 一种应用于终端设备侧语音识别的客户端,所述客户端包括:A client applied to terminal device side voice recognition, the client includes:
    场景解析模块,设置为获取用户输入的语音的原始语音识别结果,并根据所述原始语音识别结果解析出所述用户的语音识别场景;The scene parsing module is configured to obtain an original speech recognition result of the voice input by the user, and parse the voice recognition scene of the user according to the original speech recognition result;
    多音字提取模块,设置为根据所述语音识别场景,从所述原始语音识别结果中获取需要校正的关键词信息和每个所述关键词信息中的多音字;a polyphonic word extraction module, configured to acquire, from the original speech recognition result, keyword information that needs to be corrected and a polyphonic word in each of the keyword information according to the speech recognition scene;
    垃圾词生成模块,设置为根据所述需要校正的关键词信息和每个关键词信息中的多音字生成一个或多个包含所述多音字的垃圾词;a junk word generating module, configured to generate one or more junk words including the polyphonic words according to the keyword information that needs to be corrected and the polyphonic words in each keyword information;
    多音字校正模块,设置为根据所述语音识别场景或所述需要校正的关键词信息所属的范围,获取所述终端设备中的与所述语音识别场景或所述需要校正的关键词信息对应的实际信息,将所述垃圾词与所述实际信息进行匹配,筛选出正确的多音字,将所述正确的多音字填充到所述需要校正的关键词信息中,得到正确的关键词;a multi-word correction module, configured to acquire, according to the range of the voice recognition scene or the keyword information that needs to be corrected, the keyword information corresponding to the voice recognition scene or the keyword information to be corrected in the terminal device Actual information, matching the junk word with the actual information, filtering out the correct polyphonic word, and filling the correct polyphonic word into the keyword information to be corrected to obtain a correct keyword;
    结果处理模块,设置为根据所述正确的关键词生成符合当前语音识别场景的最终语音识别结果。The result processing module is configured to generate a final speech recognition result conforming to the current speech recognition scene according to the correct keyword.
  7. 如权利要求6所述的客户端,其中,所述场景解析模块是设置为通过如下方式实现根据所述原始语音识别结果解析出所述用户的语音识别场景:The client according to claim 6, wherein the scene resolution module is configured to parse the voice recognition scene of the user according to the original voice recognition result by:
    根据预先设置的语音识别结果与场景对应表,匹配得到与所述原始语音识别结果对应的所述用户的语音识别场景。According to the preset voice recognition result and the scene correspondence table, the voice recognition scene of the user corresponding to the original voice recognition result is matched.
  8. 如权利要求6所述的客户端,其中,所述多音字提取模块是设置为通过如下方式实现根据所述语音识别场景,从所述原始语音识别结果中获取需要校正的关键词信息和每个关键词信息中的多音字: The client according to claim 6, wherein the multi-tone word extraction module is configured to obtain, from the original speech recognition result, keyword information and each required to be corrected according to the speech recognition scene according to the following manner Polyphonic words in keyword information:
    根据所述语音识别场景和预先设置的场景关键信息提取表,从所述原始语音识别结果中获取需要校正的关键词信息;判断所述需要校正的关键词信息中是否存在多音字,如果有,则获取每个关键词信息中的多音字。Determining, according to the speech recognition scene and the preset scene key information extraction table, keyword information that needs to be corrected from the original speech recognition result; determining whether there is a polyphonic word in the keyword information that needs to be corrected, and if so, Then, the polyphonic word in each keyword information is obtained.
  9. 如权利要求6所述的客户端,其中,所述多音字校正模块是设置为通过如下方式实现根据所述需要校正的关键词信息和每个关键词信息中的多音字生成一个或多个包含多音字的垃圾词:The client according to claim 6, wherein said multi-tone word correction module is configured to generate one or more inclusions of keyword information corrected according to said need and multi-tone words in each keyword information by: Multi-word word spam:
    将所述多音字转换成对应的拼音,然后根据同音字对应表,提取出所述拼音对应的一个或多个汉字;将所述汉字填充到所述需要校正的关键词信息中替换多音字以组成得到一个或多个包含多音字的的垃圾词。Converting the polyphonic word into a corresponding pinyin, and then extracting one or more Chinese characters corresponding to the pinyin according to the homophone correspondence table; filling the Chinese character into the keyword information to be corrected to replace the polyphonic word Compose one or more junk words containing polyphonic words.
  10. 如权利要求6所述的客户端,其中,所述语音识别结果与场景对应表和场景关键信息提取表的格式为XML可扩展标记语言。The client according to claim 6, wherein the format of the speech recognition result and the scene correspondence table and the scene key information extraction table is an XML extensible markup language.
  11. 一种终端设备,包括如权利要求6~10任一项所述的语音识别的客户端。A terminal device comprising the voice recognition client according to any one of claims 6 to 10.
  12. 一种计算机存储介质,所述计算机存储介质中存储有计算机可执行指令,所述计算机可执行指令用于执行权利要求1~5中任一项所述的方法。 A computer storage medium having stored therein computer executable instructions for performing the method of any one of claims 1 to 5.
PCT/CN2015/082972 2014-12-24 2015-06-30 Voice recognition method, client and terminal device WO2016101577A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410817478.3A CN105786880A (en) 2014-12-24 2014-12-24 Voice recognition method, client and terminal device
CN201410817478.3 2014-12-24

Publications (1)

Publication Number Publication Date
WO2016101577A1 true WO2016101577A1 (en) 2016-06-30

Family

ID=56149133

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/082972 WO2016101577A1 (en) 2014-12-24 2015-06-30 Voice recognition method, client and terminal device

Country Status (2)

Country Link
CN (1) CN105786880A (en)
WO (1) WO2016101577A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107785018A (en) * 2016-08-31 2018-03-09 科大讯飞股份有限公司 More wheel interaction semantics understanding methods and device
CN108305629A (en) * 2017-12-25 2018-07-20 广东小天才科技有限公司 A kind of scene learning Content acquisition methods, device, facility for study and storage medium
CN111402887A (en) * 2018-12-17 2020-07-10 北京未来媒体科技股份有限公司 Method and device for escaping characters by voice
CN113053362A (en) * 2021-03-30 2021-06-29 建信金融科技有限责任公司 Method, device, equipment and computer readable medium for speech recognition
CN114120972A (en) * 2022-01-28 2022-03-01 科大讯飞华南有限公司 Intelligent voice recognition method and system based on scene

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108694940B (en) * 2017-04-10 2020-07-03 北京猎户星空科技有限公司 Voice recognition method and device and electronic equipment
CN106997761A (en) * 2017-04-20 2017-08-01 滁州职业技术学院 The method and mobile terminal of a kind of secret protection
CN106875949B (en) * 2017-04-28 2020-09-22 深圳市大乘科技股份有限公司 Correction method and device for voice recognition
CN107315742A (en) * 2017-07-03 2017-11-03 中国科学院自动化研究所 The Interpreter's method and system that personalize with good in interactive function
CN107424612B (en) * 2017-07-28 2021-07-06 北京搜狗科技发展有限公司 Processing method, apparatus and machine-readable medium
CN108334567B (en) * 2018-01-16 2021-09-10 北京奇艺世纪科技有限公司 Junk text distinguishing method and device and server
CN109616111B (en) * 2018-12-24 2023-03-14 北京恒泰实达科技股份有限公司 Scene interaction control method based on voice recognition
CN111696545B (en) * 2019-03-15 2023-11-03 北京汇钧科技有限公司 Speech recognition error correction method, device and storage medium
CN112291281B (en) * 2019-07-09 2023-11-03 钉钉控股(开曼)有限公司 Voice broadcasting and voice broadcasting content setting method and device
CN110837734A (en) * 2019-11-14 2020-02-25 维沃移动通信有限公司 Text information processing method and mobile terminal

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101217035A (en) * 2007-12-29 2008-07-09 无敌科技(西安)有限公司 A vocabulary database construction method and the corresponding hunting and comparison method for voice identification system
CN102074231A (en) * 2010-12-30 2011-05-25 万音达有限公司 Voice recognition method and system
CN103594085A (en) * 2012-08-16 2014-02-19 百度在线网络技术(北京)有限公司 Method and system providing speech recognition result
CN103674012A (en) * 2012-09-21 2014-03-26 高德软件有限公司 Voice customizing method and device and voice identification method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101217035A (en) * 2007-12-29 2008-07-09 无敌科技(西安)有限公司 A vocabulary database construction method and the corresponding hunting and comparison method for voice identification system
CN102074231A (en) * 2010-12-30 2011-05-25 万音达有限公司 Voice recognition method and system
CN103594085A (en) * 2012-08-16 2014-02-19 百度在线网络技术(北京)有限公司 Method and system providing speech recognition result
CN103674012A (en) * 2012-09-21 2014-03-26 高德软件有限公司 Voice customizing method and device and voice identification method and device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107785018A (en) * 2016-08-31 2018-03-09 科大讯飞股份有限公司 More wheel interaction semantics understanding methods and device
CN108305629A (en) * 2017-12-25 2018-07-20 广东小天才科技有限公司 A kind of scene learning Content acquisition methods, device, facility for study and storage medium
CN108305629B (en) * 2017-12-25 2021-07-20 广东小天才科技有限公司 Scene learning content acquisition method and device, learning equipment and storage medium
CN111402887A (en) * 2018-12-17 2020-07-10 北京未来媒体科技股份有限公司 Method and device for escaping characters by voice
CN113053362A (en) * 2021-03-30 2021-06-29 建信金融科技有限责任公司 Method, device, equipment and computer readable medium for speech recognition
CN114120972A (en) * 2022-01-28 2022-03-01 科大讯飞华南有限公司 Intelligent voice recognition method and system based on scene
CN114120972B (en) * 2022-01-28 2022-04-12 科大讯飞华南有限公司 Intelligent voice recognition method and system based on scene

Also Published As

Publication number Publication date
CN105786880A (en) 2016-07-20

Similar Documents

Publication Publication Date Title
WO2016101577A1 (en) Voice recognition method, client and terminal device
US11615791B2 (en) Voice application platform
US10235999B1 (en) Voice application platform
KR102058131B1 (en) Modulation of Packetized Audio Signals
TW201833793A (en) Semantic extraction method and device of natural language and computer storage medium
US20130198268A1 (en) Generation of a music playlist based on text content accessed by a user
WO2020253399A1 (en) Log classification rule generation method, device, apparatus, and readable storage medium
CN111666746A (en) Method and device for generating conference summary, electronic equipment and storage medium
JP5496863B2 (en) Emotion estimation apparatus, method, program, and recording medium
AU2017216520A1 (en) Common data repository for improving transactional efficiencies of user interactions with a computing device
JP2019003319A (en) Interactive business support system and interactive business support program
CN108121455B (en) Identification correction method and device
CN106713111B (en) Processing method for adding friends, terminal and server
CN106681523A (en) Library configuration method, library configuration device and call handling method of input method
WO2020226617A1 (en) Invoking functions of agents via digital assistant applications using address templates
CN111354350A (en) Voice processing method and device, voice processing equipment and electronic equipment
JP2014229275A (en) Query answering device and method
KR20130073709A (en) Method and apparatus of recognizing business card using image and voice information
WO2019236444A1 (en) Voice application platform
WO2022143349A1 (en) Method and device for determining user intent
CN113741864A (en) Automatic design method and system of semantic service interface based on natural language processing
CN111401034B (en) Semantic analysis method, semantic analysis device and terminal for text
US20060242578A1 (en) Method for managing content
Lindgren Machine recognition of human language Part II??? Theoretical models of speech perception and language
JP2017134162A (en) Voice recognition device, voice recognition method, and voice recognition program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15871652

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15871652

Country of ref document: EP

Kind code of ref document: A1