WO2023226700A1 - 语音交互方法及其装置、电子设备和存储介质 - Google Patents

语音交互方法及其装置、电子设备和存储介质 Download PDF

Info

Publication number
WO2023226700A1
WO2023226700A1 PCT/CN2023/091826 CN2023091826W WO2023226700A1 WO 2023226700 A1 WO2023226700 A1 WO 2023226700A1 CN 2023091826 W CN2023091826 W CN 2023091826W WO 2023226700 A1 WO2023226700 A1 WO 2023226700A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
hot
command
words
voice interaction
Prior art date
Application number
PCT/CN2023/091826
Other languages
English (en)
French (fr)
Inventor
宿绍勋
王炳乾
夏友祥
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Publication of WO2023226700A1 publication Critical patent/WO2023226700A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • the present application relates to the field of voice interaction technology, and in particular to a voice interaction method and its device, electronic equipment and storage medium.
  • the present application aims to solve, at least to a certain extent, one of the problems in the related art.
  • the purpose of this application is to provide a voice interaction method and its device, electronic device and storage medium.
  • the embodiment of the present application provides a voice interaction method.
  • the voice interaction method includes: in response to the user's preset voice request for updating voice hot words, obtaining the voice data input by the user according to the preset command template; performing voice recognition on the voice data according to the preset command template to obtain the target voice Hot words; update the hot word library of the speech recognition model according to the target speech hot words.
  • the voice interaction method before the step of obtaining the voice data input by the user according to the preset command template in response to the user's preset voice request for updating voice hot words, includes: after the voice interaction input by the user If the request for a matching command fails, record the number of consecutive interaction recognitions for which the matching command fails, and add the command word corresponding to the voice interaction request to the list of consecutive failed command words; when the number of continuous interaction recognitions is greater than the number threshold, and the If the word meaning similarity between the command words in the consecutive failed command word list meets the preset conditions, the user is prompted to update the voice hot word.
  • the voice interaction method includes: if the voice interaction request input by the user successfully matches a command, clearing the number of consecutive interaction recognition times where the matching command fails and the list of consecutive failed command words.
  • the voice interaction method includes: determining the word meaning similarity according to the editing distance and/or the longest substring between the command words in the continuous failed command word list.
  • the voice interaction method includes: obtaining a voice interaction request input by the user; processing the voice interaction request according to the hot word library and the speech recognition model to obtain a command word; executing the command The control instructions corresponding to the words.
  • processing the voice interaction request according to the hot word library and the speech recognition model to obtain the command word includes: obtaining the acoustic score and sum of the speech hot words in the hot word library.
  • Hot word score determine the number of words whose edit distance to the hot word is a set value; calculate a comprehensive score corresponding to the hot word according to the acoustic score, the hot word score and the number of words; according to The comprehensive score determines the command word among the phonetic hot words in the hot word library.
  • the step of calculating the comprehensive score corresponding to the phonetic hot word based on the acoustic score, the hot word score and the vocabulary number is implemented through the following conditional expression:
  • x) is the acoustic score
  • ⁇ log P C (y) is the hot word score
  • ⁇ and ⁇ are the corresponding coefficients.
  • the step of calculating the comprehensive score corresponding to the phonetic hot word based on the acoustic score, the hot word score and the vocabulary number is implemented through the following conditional expression:
  • x) is the acoustic score
  • ⁇ log P C (y) is the hot word score
  • is the corresponding coefficient
  • the longest substring refers to the longest substring without repeated characters.
  • the step of obtaining the acoustic scores and hot word scores of the phonetic hot words in the hot word library includes:
  • the end-to-end speech recognition tool is used to maintain the state in the context graph during the decoding process, and the hot word score of each speech hot word in the hot word library is calculated through the state in the subgraph in the context graph.
  • the voice interaction device includes: an acquisition module, a recognition module and a hot word library update module.
  • the acquisition module is used to obtain the voice data input by the user according to the preset command template in response to the user's preset voice request for updating voice hot words;
  • the recognition module is used to perform processing on the voice data according to the preset command template.
  • Speech recognition obtains target speech hot words;
  • the hot word library update module is used to update the hot word library of the speech recognition model according to the target speech hot words.
  • the electronic device includes a processor and a memory.
  • the memory stores a computer program.
  • the computer program is executed by the processor, the voice interaction method described in any one of the above embodiments is implemented.
  • the application also provides a non-volatile computer-readable storage medium containing a computer program.
  • the processor is caused to execute the voice interaction method described in any one of the above embodiments.
  • the present application also provides a computer program, which includes computer readable code.
  • the computer readable code When the computer readable code is run on a computing processing device, it causes the computing processing device to execute the voice interaction method described in any one of the above embodiments. .
  • Figure 1 is a schematic flow chart of a voice interaction method in some embodiments of the present application.
  • Figure 2 is a schematic structural diagram of a voice interaction device according to certain embodiments of the present application.
  • Figure 3 is a schematic flowchart of a voice interaction method in some embodiments of the present application.
  • Figure 4 is a schematic structural diagram of a voice interaction device according to certain embodiments of the present application.
  • Figure 5 is a schematic flowchart of a voice interaction method in some embodiments of the present application.
  • Figure 6 is a schematic structural diagram of a voice interaction device according to certain embodiments of the present application.
  • Figure 7 is a schematic flowchart of a voice interaction method in some embodiments of the present application.
  • Figure 8 is a schematic structural diagram of a voice interaction device according to certain embodiments of the present application.
  • Figure 9 is a schematic flowchart of a voice interaction method in some embodiments of the present application.
  • Figure 10 is a schematic structural diagram of a voice interaction device according to certain embodiments of the present application.
  • Figure 11 is a schematic flowchart of a voice interaction method in some embodiments of the present application.
  • Figure 12 is a schematic structural diagram of an electronic device according to certain embodiments of the present application.
  • Figure 13 is a schematic structural diagram of a computer-readable storage medium according to certain embodiments of the present application.
  • first and second are only used for descriptive purposes and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, features defined as “first” and “second” may explicitly or implicitly include one or more of the described features. In the description of this application, “plurality” means two or more than two, unless otherwise expressly and specifically limited.
  • connection should be understood in a broad sense.
  • connection or integral connection; it can be mechanical connection, electrical connection or mutual communication; it can be directly connected, or it can be indirectly connected through an intermediary, it can be the internal connection of two elements or the interaction of two elements relation.
  • the voice interaction method includes:
  • the voice interaction device 10 includes: an acquisition module 11 , a recognition module 13 and a hot word library update module 13 .
  • the acquisition module 11 is used to obtain the voice data input by the user according to the preset command template in response to the user's preset voice request to update the voice hot word;
  • the recognition module 12 is used to perform voice recognition on the voice data according to the preset command template to obtain the target voice hot word.
  • Hot word library update module 13 is used to update the hot word library of the speech recognition model according to the target speech hot words.
  • Hot word enhancement technology is a technology that improves the recognition probability of specific context phrases (such as personal names, music lists, proper nouns, etc.) in the ASR system to achieve better recognition performance.
  • Hot word technology can help the speech recognition model adapt to more scenarios, such as adding common vocabulary to improve the recognition probability of common nouns in various fields, and hot word technology can also help identify words outside the model vocabulary to a certain extent ( out of vocabulary,OOV).
  • the speech recognition model cannot recognize the correct command corresponding to the voice request.
  • the user can issue a preset voice request to update the voice hot words, update the hot word library in the speech recognition model, and add the unfamiliar words in the voice request to the hot word library, so that the voice request can be
  • the speech recognition model recognizes and obtains the corresponding command.
  • the voice data input by the user according to the preset command template is obtained.
  • the preset voice request is, for example, "update voice hot words”.
  • the voice data input by the user according to the preset command template refers to, for example, the preset command template is "Luggage's Li, dashing Xiao", and the hot word corresponding to the preset command template is "Li Xiao”, and the user
  • the voice data input according to the preset command template is "turn off, turn off the light”.
  • the speech recognition model can recognize the target speech hot word input by the user according to the fixed sentence pattern of the preset naming template.
  • the fixed sentence pattern of the template “Luggage's Li, chic and unrestrained Xiao” can identify the target voice hot word of the user-input voice data "closed, closed, turned off the light” as “turn off the light”.
  • the hot word library of the speech recognition model is updated according to the target speech hot words.
  • the target speech hot word identified above is "turn off the lights" and can be added to the hot word library of the speech recognition model.
  • the voice interaction method of the present application can add unfamiliar words in the user's voice interaction request to the hot vocabulary library, so that the user's voice interaction request can be recognized by the speech recognition model to obtain the corresponding command, improving the application of voice in voice interaction.
  • the recognition model cannot accurately understand the user's voice commands.
  • the voice interaction method includes:
  • the voice interaction device 10 also includes a recording module 111 and a prompt module 113 .
  • the recording module 111 is used to record the number of consecutive interaction recognition failures in matching commands when the voice interaction request input by the user fails to match the command, and add the command words corresponding to the voice interaction request to the list of continuous failed command words;
  • the prompt module 113 is used to When the number of consecutive interaction recognitions is greater than the number threshold and the word meaning similarity between the command words in the list of consecutive failed command words meets the preset conditions, the user is prompted to update the voice hot words.
  • the voice interaction request input by the user fails to match the command, record the number of consecutive interaction recognitions that failed to match the command, and add the command word corresponding to the voice interaction request to the list of consecutive failed command words. For example, if the voice interaction request input by the user is "Play Internationale", the corresponding command cannot be matched in the recognition result in the speech model, that is, the voice interaction request fails to match the command. this , the number of consecutive interactive recognition failures for matching commands can be recorded, and the number of consecutive interactive recognition failures for matching commands can be 2 or more times.
  • the number of times threshold can be 2 times
  • the continuous failed command word list is a table composed of command words that have failed to recognize continuously.
  • the command word list generated by 4 consecutive user interaction recognition failures includes "lights out, lights out, black lights, lights off.” ", the commands pointed to by the command words entered by the user 4 times are all "turn off the lights", and the similarity of word meanings between the 4 command words in the list of consecutive failed command words meets the preset conditions, then the user can be prompted at this time to Updated voice hot words.
  • the preset condition can be that the word meaning similarity is 60%. If the word meaning similarity between four command words is 80%, it means that the word meaning similarity between the four command words meets the preset condition. At this time, you can prompt Users need to update their voice hot words.
  • the voice interaction method of the present application can record the number of consecutive interaction recognitions that fail to match the command, and add the command word corresponding to the voice interaction request to the list of consecutive failed command words.
  • the number of consecutive interaction recognitions is greater than the number threshold, and the consecutive failed commands
  • the user is prompted to update the voice hot word.
  • Voice interaction methods include:
  • the voice interaction device 10 also includes a clearing module 115 .
  • the clearing module 115 is configured to clear the number of consecutive interaction recognitions that have failed to match the command and the list of consecutive failed command words when the voice interaction request input by the user successfully matches the command.
  • the number of consecutive interaction recognition times for which the matching command failed and the list of consecutive failed command words are cleared.
  • the number of consecutive interaction recognitions for failed matching commands is 2, and the consecutive failed command words in the list of consecutive failed command words include "turn off the lights, turn off the lights". If the next voice interaction request input by the user is "turn off the lights", and the matched Corresponding lights-off command, at this time, the corresponding lights-off command can be returned to complete the voice interaction.
  • the number of consecutive interaction recognitions that failed to match the command is cleared or set to 0, and the list of continuous recognition command words is cleared.
  • the voice interaction request matches the command successfully
  • the user can complete the voice interaction according to the currently input user voice request without adding hot words. Therefore, the number of consecutive interaction recognitions that fail to match the command can be recalculated, as well as the number of consecutive failed commands.
  • the word list re-records the command words that failed to match.
  • the voice interaction method includes:
  • 012 Determine word meaning similarity based on the edit distance and/or the longest substring between command words in the list of consecutive failed command words.
  • the voice interaction device 10 also includes a similarity determination module 112 .
  • the similarity determination module 112 is configured to determine word meaning similarity based on the edit distance and/or the longest substring between command words in the list of consecutive failed command words.
  • Word meaning similarity is determined based on the edit distance and/or the longest substring between command words in a list of consecutive failed command words. That is to say, the voice interaction method of the present application can measure the similarity of word meanings between command words by comparing the editing distance between command words in the list of consecutive failed command words or the rule agreement of the longest substring.
  • Edit distance refers to the minimum number of edit operations required to transform one string into another. It describes how close two strings are. Allowed editing operations include: Substitutions, Insertions, and Deletions. For example, to turn up the volume -> turn up the volume, you only need to replace "high” with “large”, and the editing distance is 1; please turn off the lights -> please turn out the lights, you need to replace "off” with “off” and replace “ Close” replaces "off”, the editing distance is 2; turn up the volume -> please turn up the TV volume, you need to insert "please” in front, and then insert "TV” between "turn up” and “volume”, the editing distance is 3.
  • the command words in the list of consecutive failed command words include “turn off the lights, turn off the lights, black the lights, turn off the lights”.
  • the edit distance between “lights out”, “lights out”, “black light” and “lights off” is all 1, which means that the meaning similarity of these four consecutive failed command words is high, that is, “lights out”, “lights out”
  • the meanings of "turn off the lights”, “black lights” and “turn off the lights” are relatively similar.
  • the longest substring refers to the longest substring without repeated characters.
  • the command words in the list of consecutive failed command words include “turn off the lights, turn off the lights, black the lights, turn off the lights”.
  • the longest substring between "lights out”, “lights out”, “black light” and “lights off” is 1, which can also mean that these four consecutive failed command words have a high similarity in meaning, that is, "lights out”.
  • the meanings of "turn off the light”, “black light” and “turn off the light” are relatively similar.
  • the voice interaction method of the present application can measure the similarity of word meanings between command words by comparing the editing distance between command words in the list of consecutive failed command words and/or the rule agreement of the longest substring, and obtain each consecutive failed command word.
  • the similarity of word meanings between command words lays the foundation for judging whether the similarity of word meanings between command words in the list of consecutive failed command words meets the preset conditions.
  • Voice interaction methods include:
  • the voice interaction device also includes a voice processing module 15 and an instruction execution module 16 .
  • the acquisition module 11 is used to obtain the voice interaction request input by the user;
  • the voice processing module 15 is used to process the voice interaction request according to the hot vocabulary and the speech recognition model to obtain the command word;
  • the instruction execution module 16 is used to execute the command The control instructions corresponding to the words.
  • the interaction method of the present application can first process the voice interaction request according to the hot vocabulary library and the speech recognition model to obtain the command word, thereby executing the control instruction corresponding to the command word.
  • the voice interaction request input by the user can be "Turn down the volume of the TV". If the hot word library has the hot word "Turn down the volume”, then the voice interaction request can be processed according to the hot word library and the speech recognition model to obtain the command word " "Turn down the volume” to execute the control instruction corresponding to the command word "Turn down the volume”.
  • the interaction method of the present application can first process the voice interaction request according to the hot word library and the speech recognition model to obtain the command word, thereby executing the control instruction corresponding to the command word.
  • step 05 includes:
  • the speech processing module 15 is used to obtain the acoustic score and hot word score of the speech hot words in the hot word library; determine the number of words whose edit distance to the speech hot words is a set value; according to the acoustic scores, hot words The score and the number of words are used to calculate the comprehensive score of the corresponding hot word; the command word is determined among the hot words in the hot word library based on the comprehensive score.
  • the acoustic score and hot word score of the phonetic hot words in the hot word library can be obtained.
  • the user's voice interaction request can be input into the decoder, and the acoustic score of each voice hot word in the hot word library can be output.
  • the end-to-end speech recognition tool WeNet is used to maintain an upper and lower level during the decoding process.
  • the state in the Context Graph Therefore, the hot word score of each phonetic hot word in the hot word library can be calculated through the state in the subgraph, that is, a score proportional to the weight of the hot word is added to the original acoustic score.
  • the number of words whose edit distance to the hot voice word is a set value is determined. That is, the number of words whose edit distance to other hot voice words in the hot word library is a set value can be determined. For example, if one voice hot word 1 in the hot thesaurus is "turn up the volume”, the other two hot voice words in the hot thesaurus are voice hot words 2 "turn down the volume” and voice hot word 3 "turn up the volume.” Volume”, then the editing distance between voice hot word 1 "turn up the volume” and voice hot word 2 "turn down the volume” is 1, and the editing distance between voice hot word 1 "turn up the volume” and voice hot word 3 "turn up the volume” The distance is also 1.
  • the voice hot word 1 "Turn up the volume” is the same as the voice hot word 2 "Turn the volume down” and the voice hot word 3 "Turn up the volume” in the hot thesaurus.
  • the edit distance of "Volume” is the set value of 2 words.
  • the comprehensive score of the corresponding hot word is calculated based on the acoustic score, the hot word score and the number of words, that is, the comprehensive score of each hot word in the hot word library can be obtained. Specifically, you can first add the acoustic score and the hot word score of a certain speech hot word in the process of beam search through shallow fusion, and then combine the edit distance of the speech hot word with other speech hot words. For the calculation of the set value, the comprehensive score of the corresponding phonetic hot word is obtained.
  • the command word is determined among the phonetic hot words in the hot word library based on the comprehensive score. That is, the command word corresponding to the user's voice request can be determined based on the comprehensive score of each hot voice word in the hot word library, and the hot voice word with a high comprehensive score is determined as the command word.
  • the voice interaction method of this application not only increases the weight of hot words, but also adds the impact on the comprehensive score of the number of words whose edit distance between each hot word and other hot words in the hot word library is a set value, It can weaken to a certain extent the influence of hot word weights corresponding to similar hot words in hot words in the hot word library on identifying command words in user voice requests.
  • x) is the acoustic score
  • log P C (y) is the hot word score
  • ⁇ and ⁇ are the corresponding coefficients.
  • the edit distance is set to 1, then is the number of words with an edit distance of 1 from y.
  • the voice interaction method of the present application first adds points to the acoustic score and hot word score of a single hot word in hot words in the hot word library, and then subtracts the voice score.
  • the edit distance between a hot word and other hot speech words is the number of words with a set value, so that the comprehensive score of the hot word is calculated, and then the comprehensive score of each hot word in the hot word library is calculated.
  • the voice interaction method of this application not only increases the weight of hot words, but also adds the influence of the number of words on the comprehensive score that the editing distance between each hot word and other hot words in the hot word database is a set value, which can be used in To a certain extent, the impact of hot word weights corresponding to similar hot words in hot words in the hot word library on identifying command words in user voice requests is weakened.
  • step of calculating the comprehensive score of the corresponding phonetic hot word based on the acoustic score, hot word score and vocabulary number can also be implemented through the following conditional expression:
  • x) is the acoustic score
  • log P C (y) is the hot word score
  • is the corresponding coefficient.
  • the edit distance is set to 1, then is the number of words with an edit distance of 1 from y.
  • the voice interaction method of the present application adds points to the acoustic score and hot word score of the single hot word in the hot word library, where the coefficient ⁇ of the hot word score is divided The number of words based on the edit distance between this hot voice word and other hot voice words as the set value Then, the comprehensive score of the hot phonetic word is calculated, thereby calculating the comprehensive score of each hot phonetic word in the hot word database.
  • the voice interaction method of this application not only increases the weight of hot words, but also adds the influence of the number of words on the comprehensive score that the editing distance between each hot word and other hot words in the hot word database is a set value, which can be used in To a certain extent, the impact of hot word weights corresponding to similar voice hot words in the hot word library on identifying command words in user voice interaction requests is weakened.
  • the electronic device 100 includes a processor 110 and a memory 120.
  • the memory 120 stores a computer program 121.
  • the computer program 121 is executed by the processor 10, it implements the voice interaction method described in any of the above embodiments.
  • the electronic device 100 includes a mobile phone, a computer, an iPad and other smart devices with a display panel.
  • the electronic device 100 of the present application can add unfamiliar words in the user's voice interaction request to the hot vocabulary library by applying the above-mentioned voice interaction method, so that the user's voice interaction request can be recognized by the speech recognition model to obtain the corresponding command, improving the performance of the voice interaction process.
  • the problem is that the speech recognition model cannot accurately understand the user's voice commands during interaction.
  • this application also provides a non-volatile computer-readable storage medium 200 containing a computer program.
  • computer program 210 is executed by one or more processors 220, Implement the voice interaction method described in any of the above embodiments.
  • Computer program 210 includes computer program code.
  • Computer program code can be in the form of source code, object code, executable file or some intermediate form, etc.
  • Computer-readable storage media can include: any entity or device that can carry computer program code, recording media, USB flash drives, mobile hard drives, magnetic disks, optical disks, computer memory, read-only memory (ROM, Read-Only Memory), random access memory Access memory (RAM, Random Access Memory), and software distribution media, etc.
  • the computer-readable storage medium 200 of the present application can use the above-mentioned voice interaction method to add unfamiliar words in the user's voice interaction request to the hot vocabulary library, so that the user's voice interaction request can be recognized by the speech recognition model to obtain the corresponding command. Improve the problem that the speech recognition model cannot accurately understand the user's voice commands in voice interaction.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • User Interface Of Digital Computer (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种语音交互方法及其装置(10)、电子设备(100)和存储介质(200)。语音交互方法包括:响应于用户更新语音热词的预设语音请求,获取用户根据预设命令模板输入的语音数据(01);根据预设命令模板对语音数据进行语音识别得到目标语音热词(02);根据目标语音热词更新语音识别模型的热词库(03)。可以将用户的语音交互请求中的陌生词汇添加至热词库中,使得用户的语音交互请求能够被语音识别模型识别得到对应的命令,改善在语音交互中应用语音识别模型无法准确理解用户语音指令的问题。

Description

语音交互方法及其装置、电子设备和存储介质
相关申请的交叉引用
本公开要求在2022年5月27日提交中国专利局、申请号为202210592298.4、名称为“语音交互方法及其装置、电子设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本申请涉及语音交互技术领域,特别涉及一种语音交互方法及其装置、电子设备和存储介质。
背景技术
在语音识别的实际应用中,对于常用的词汇识别效果比较好。但是,对于一些特有的人名、歌名、地名或者某个领域的专有词汇,例如人名宋星辰、歌名国际歌、地名丽泽商务区以及语音识别专业词汇解码器,存在语音识别准确率不高的情况。
发明内容
有鉴于此,本申请旨在至少在一定程度上解决相关技术中的问题之一。为此,本申请的目的在于提供一种语音交互方法及其装置、电子设备和存储介质。
本申请实施方式提供一种语音交互方法。所述语音交互方法包括:响应于用户更新语音热词的预设语音请求,获取用户根据预设命令模板输入的语音数据;根据所述预设命令模板对所述语音数据进行语音识别得到目标语音热词;根据所述目标语音热词更新语音识别模型的热词库。
在某些实施方式中,所述响应于用户更新语音热词的预设语音请求,获取用户根据预设命令模板输入的语音数据的步骤之前,所述语音交互方法包括:在用户输入的语音交互请求匹配命令失败的情况下,记录匹配命令失败的连续交互识别次数,将所述语音交互请求对应的命令词添加到连续失败命令词列表;在所述连续交互识别次数大于次数阈值,且所述连续失败命令词列表中的命令词之间的词义相似度满足预设条件的情况下,提示用户更新语音热词。
在某些实施方式中,所述语音交互方法包括:在用户输入的语音交互请求匹配命令成功的情况下,清空匹配命令失败的所述连续交互识别次数和所述连续失败命令词列表。
在某些实施方式中,所述在用户输入的语音交互请求匹配命令失败的情况下,记录匹配命令失败的连续交互识别次数,将所述语音交互请求对应的命令词添加到连续失败命令词列表的步骤之后,语音交互方法包括:根据所述连续失败命令词列表中的命令词之间的编辑距离和/或最长子串确定所述词义相似度。
在某些实施方式中,所述语音交互方法包括:获取用户输入的语音交互请求;根据所述热词库和所述语音识别模型对所述语音交互请求进行处理得到命令词;执行所述命令词对应的控制指令。
在某些实施方式中,所述根据所述热词库和所述语音识别模型对所述语音交互请求进行处理得到命令词,包括:获取所述热词库中的语音热词的声学得分和热词得分;确定与所述语音热词的编辑距离为设定值的词汇数量;根据所述声学得分、所述热词得分和所述词汇数量计算对应所述语音热词的综合得分;根据所述综合得分在所述热词库中的语音热词中确定所述命令词。
在某些实施方式中,所述根据所述声学得分、所述热词得分和所述词汇数量计算对应所述语音热词的综合得分的步骤通过下列条件式实现:
其中,argmax log P(y|x)为所述声学得分,λlog PC(y)为所述热词得分,为所述词汇数量,λ和μ为相应的系数。
在某些实施方式中,所述根据所述声学得分、所述热词得分和所述词汇数量计算对应所述语音热词的综合得分的步骤通过下列条件式实现:
其中,argmax log P(y|x)为所述声学得分,λlog PC(y)为所述热词得分,为所述词汇数量,λ为相应的系数。
在某些实施方式中,所述最长子串是指无重复字符的最长子串。
在某些实施方式中,所述获取所述热词库中的语音热词的声学得分和热词得分的步骤包括:
将用户的语音交互请求输入解码器,利用解码器输出得到所述热词库中每个语音热词的声学得分;以及
利用端到端语音识别工具在解码过程中维护上下文图中的状态,通过该上下文图中的子图中的状态计算得到所述热词库中每个语音热词的热词得分。
本申请还提供一种语音交互装置。所述语音交互装置包括:获取模块、识别模块和热词库更新模块。所述获取模块用于响应于用户更新语音热词的预设语音请求,获取用户根据预设命令模板输入的语音数据;所述识别模块用于根据所述预设命令模板对所述语音数据进行语音识别得到目标语音热词;所述热词库更新模块用于根据所述目标语音热词更新语音识别模型的热词库。
本申请还提供一种电子设备。所述电子设备包括处理器和存储器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时实现上述实施方式中任一项所述的语音交互方法。
本申请还提供一种包含计算机程序的非易失性计算机可读存储介质。当所述计算机程序被一个或多个处理器执行时,使得所述处理器执行上述实施方式中任一项所述的语音交互方法。
本申请还提供一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算处理设备上运行时,导致所述计算处理设备执行上述实施方式中任一项所述的语音交互方法。
本申请的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本申请的实践了解到。
上述说明仅是本公开技术方案的概述,为了能够更清楚了解本公开的技术手段,而可依照说明书的内容予以实施,并且为了让本公开的上述和其它目的、特征和优点能够更明显易懂,以下特举本公开的具体实施方式。
附图说明
为了更清楚地说明本公开实施例或相关技术中的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
本申请上述的和/或附加的方面和优点从下面结合附图对实施例的描述中 将变得明显和容易理解,其中:
图1是本申请某些实施方式的语音交互方法的流程示意图;
图2是本申请某些实施方式的语音交互装置的结构示意图;
图3是本申请某些实施方式的语音交互方法的流程示意图;
图4是本申请某些实施方式的语音交互装置的结构示意图;
图5是本申请某些实施方式的语音交互方法的流程示意图;
图6是本申请某些实施方式的语音交互装置的结构示意图;
图7是本申请某些实施方式的语音交互方法的流程示意图;
图8是本申请某些实施方式的语音交互装置的结构示意图;
图9是本申请某些实施方式的语音交互方法的流程示意图;
图10是本申请某些实施方式的语音交互装置的结构示意图;
图11是本申请某些实施方式的语音交互方法的流程示意图;
图12是本申请某些实施方式的电子设备的结构示意图;
图13是本申请某些实施方式的计算机可读存储介质的结构示意图。
具体实施例
下面详细描述本申请的实施方式,所述实施方式的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施方式是示例性的,仅用于解释本申请,而不能理解为对本申请的限制。
在本申请的描述中,需要理解的是,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个所述特征。在本申请的描述中,“多个”的含义是两个或两个以上,除非另有明确具体地限定。
在本申请的描述中,需要说明的是,除非另有明确的规定和限定,术语“安装”、“相连”、“连接”应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或一体地连接;可以是机械连接,也可以是电连接或可以相互通信;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通或两个元件的相互作用关系。对于本领域的普通技术人员而言,可以根据具体情况理解上述术语在本申请中的具体含义。
下文的公开提供了许多不同的实施方式或例子用来实现本申请的不同结构。为了简化本申请的公开,下文中对特定例子的部件和设置进行描述。当然,它们仅仅为示例,并且目的不在于限制本申请。此外,本申请可以在不同例子中重复参考数字和/或参考字母,这种重复是为了简化和清楚的目的,其本身不指示所讨论各种实施方式和/或设置之间的关系。
下面详细描述本申请的实施方式,所述实施方式的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施方式是示例性的,仅用于解释本申请,而不能理解为对本申请的限制。
请参阅图1,本申请提供一种语音交互方法。该语音交互方法包括:
01:响应于用户更新语音热词的预设语音请求,获取用户根据预设命令模板输入的语音数据;
02:根据预设命令模板对语音数据进行语音识别得到目标语音热词;
03:根据目标语音热词更新语音识别模型的热词库。
请结合图2,本申请还提供一种语音交互装置10。该语音交互装置10包括:获取模块11、识别模块13和热词库更新模块13。获取模块11用于响应于用户更新语音热词的预设语音请求,获取用户根据预设命令模板输入的语音数据;识别模块12用于根据预设命令模板对语音数据进行语音识别得到目标语音热词;热词库更新模块13用于根据目标语音热词更新语音识别模型的热词库。
热词增强技术是在ASR系统中提升特定上下文短语(如人名、音乐列表、专有名词等)识别概率的技术,用以实现更好的识别性能。热词技术能够帮助语音识别模型适配更多场景,如通过添加常用词表的方式提各领域的常用名词的识别概率,并且热词技术还可以一定程度上帮助识别模型词表以外的词(out of vocabulary,OOV)。
可以理解地,用户在发出具有陌生词汇的语音请求时,语音识别模型不能识别出与该语音请求对应的正确的命令。此时,用户可以发出更新语音热词的预设语音请求,对语音识别模型中的热词库进行更新,将该语音请求中的陌生词汇能够添加至热词库中,使得该语音请求能够被语音识别模型识别得到对应的命令。
首先,响应于用户更新语音热词的预设语音请求,获取用户根据预设命令模板输入的语音数据,指的是,在用户发出更新语音热词的预设语音 请求后,可以获取用户根据预设命令模板输入的语音数据。其中,预设语音请求例如为“更新语音热词”。用户根据预设命令模板输入的语音数据,指的是,例如,预设命令模板为“行李的李,潇洒的潇”,该预设命令模板对应识别出的热词为“李潇”,用户根据该预设命令模板输入的语音数据为“关闭的闭,灭灯的灯”。
然后,根据预设命令模板对语音数据进行语音识别得到目标语音热词,指的是,语音识别模型可以根据预设命名模板的固定句式识别用户输入的目标语音热词,由上述预设命令模板“行李的李,潇洒的潇”的固定句式,可以识别用户输入的语音数据“关闭的闭,灭灯的灯”的目标语音热词为“闭灯”。
最后,根据目标语音热词更新语音识别模型的热词库。例如,可以将上述识别出的目标语音热词为“闭灯”添加至语音识别模型的热词库中。
如此,本申请的语音交互方法可以将用户的语音交互请求中的陌生词汇添加至热词库中,使得用户的语音交互请求能够被语音识别模型识别得到对应的命令,改善在语音交互中应用语音识别模型无法准确理解用户语音指令的问题。
请参阅图3,在步骤01之前,语音交互方法包括:
011:在用户输入的语音交互请求匹配命令失败的情况下,记录匹配命令失败的连续交互识别次数,将语音交互请求对应的命令词添加到连续失败命令词列表;
013:在连续交互识别次数大于次数阈值,且连续失败命令词列表中的命令词之间的词义相似度满足预设条件的情况下,提示用户更新语音热词。
请参阅图4,语音交互装置10还包括记录模块111和提示模块113。记录模块111用于在用户输入的语音交互请求匹配命令失败的情况下,记录匹配命令失败的连续交互识别次数,将语音交互请求对应的命令词添加到连续失败命令词列表;提示模块113用于在连续交互识别次数大于次数阈值,且连续失败命令词列表中的命令词之间的词义相似度满足预设条件的情况下,提示用户更新语音热词。
在用户输入的语音交互请求匹配命令失败的情况下,记录匹配命令失败的连续交互识别次数,将语音交互请求对应的命令词添加到连续失败命令词列表。例如,用户输入的语音交互请求为“播放国际歌”,在语音模型中识别的结果中匹配不到相应的命令,即语音交互请求匹配命令失败。此 时,可以记录匹配命令失败的连续交互识别次数,匹配命令失败的连续交互识别次数可以为2次或2次以上。
在匹配命令失败的连续交互识别次数大于次数阈值,且连续失败命令词列表中的命令词之间的词义相似度满足预设条件的情况下,提示用户更新语音热词。其中,次数阈值可以为2次,连续失败命令词列表为连续识别失败的命令词构成的表格,例如4次用户连续交互识别失败产生的命令词列表包括“熄灯,灭灯,黑灯,闭灯”,4次用户输入的命令词所指向的命令均为“关灯”,且连续失败命令词列表中的4个命令词之间的词义相似度满足预设条件,则此时可以提示用户需要更新语音热词。
例如,预设条件可以为词义相似度为60%,若4个命令词之间的词义相似度为80%,则表示4个命令词之间的词义相似度满足预设条件,此时可以提示用户需要更新语音热词。
如此,本申请的语音交互方法可以通过记录匹配命令失败的连续交互识别次数,并将语音交互请求对应的命令词添加到连续失败命令词列表,在连续交互识别次数大于次数阈值,且连续失败命令词列表中的命令词之间的词义相似度满足预设条件的情况下,提示用户进行更新语音热词。
请参阅图5,语音交互方法包括:
015:在用户输入的语音交互请求匹配命令成功的情况下,清空匹配命令失败的连续交互识别次数和连续失败命令词列表。
请参阅图6,语音交互装置10还包括清空模块115。清空模块115用于在用户输入的语音交互请求匹配命令成功的情况下,清空匹配命令失败的连续交互识别次数和连续失败命令词列表。
具体地,在用户输入的语音交互请求匹配命令成功的情况下,清空匹配命令失败的连续交互识别次数和连续失败命令词列表。例如,匹配命令失败的连续交互识别次数为2,连续失败命令词列表中连续失败的命令词包括“熄灯,灭灯”,若下一次用户输入的语音交互请求为“关灯”,且匹配到了相应的关灯命令,此时,可返回相应的关灯命令完成语音交互,相应地,将匹配命令失败的连续交互识别次数清空或置0,并清空连续识别命令词列表。也就是说,语音交互请求匹配命令成功的情况下,用户根据当前输入的用户语音请求可以完成语音交互,不需要添加热词,从而匹配命令失败的连续交互识别次数可以重新计算,以及连续失败命令词列表重新记录匹配失败的命令词。
请参阅图7,在步骤011之后,语音交互方法包括:
012:根据连续失败命令词列表中的命令词之间的编辑距离和/或最长子串确定词义相似度。
请结合图8,语音交互装置10还包括相似度确定模块112。相似度确定模块112用于根据连续失败命令词列表中的命令词之间的编辑距离和/或最长子串确定词义相似度。
根据连续失败命令词列表中的命令词之间的编辑距离和/或最长子串确定词义相似度。也即是,本申请的语音交互方法可以通过比较连续失败命令词列表中的命令词之间的编辑距离或最长子串的规则约定的方法衡量命令词之间的词义相似度。
编辑距离是指一个字符串改编成另一个字符串的最少编辑操作次数,它描述了两个字符串的相近程度。允许的编辑操作包括:替换字符(Substitutions)、插入字符(Insertions)和删除字符(Deletions)。例如,调高音量->调大音量,只需要将“高”替换“大”即可,编辑距离为1;请关闭灯光->请熄灭灯光,需要将“关”替换“熄”,将“闭”替换“灭”,编辑距离为2;调高音量->请调高电视音量,需要在前面插入“请”,再在“调高”和“音量”之间插入“电视”,编辑距离为3。
例如,连续失败命令词列表中的命令词包括“熄灯,灭灯,黑灯,闭灯”。“熄灯”、“灭灯”、“黑灯”和“闭灯”之间的编辑距离均为1,则表示这4个连续失败命令词的词义相似度较高,即表示“熄灯”、“灭灯”、“黑灯”和“闭灯”之间的词义比较相似。
最长子串指的是无重复字符的最长子串。例如,连续失败命令词列表中的命令词包括“熄灯,灭灯,黑灯,闭灯”。“熄灯”、“灭灯”、“黑灯”和“闭灯”之间的最长子串为1,也可以表示这4个连续失败命令词的词义相似度较高,即表示“熄灯”、“灭灯”、“黑灯”和“闭灯”之间的词义比较相似。
如此,本申请的语音交互方法可以通过比较连续失败命令词列表中的命令词之间的编辑距离和/或最长子串的规则约定的方法衡量命令词之间的词义相似度,得到各个连续失败命令词之间的词义相似度,为判断连续失败命令词列表中的命令词之间的词义相似度是否满足预设条件奠定基础。
此外,用户发现,在语音交互请求中,某些词汇在语音交互过程中识别不准确的原因为:受热词库中的热词的影响导致语音交互请求中的原有正常词汇识别失误。例如,在将“调大音量”设为热词时,用户的语音交互 请求中的命令词“调小音量”也常常会被识别成为“调大音量”。
有鉴于此,请参阅图9,语音交互方法包括:
04:获取用户输入的语音交互请求;
05:根据热词库和语音识别模型对语音交互请求进行处理得到命令词;
06:执行命令词对应的控制指令。
请参阅图10,语音交互装置还包括语音处理模块15和指令执行模块16。
请结合图2,获取模块11用于获取用户输入的语音交互请求;语音处理模块15用于根据热词库和语音识别模型对语音交互请求进行处理得到命令词;指令执行模块16用于执行命令词对应的控制指令。
具体地,对于用户发起的语音交互请求,本申请的交互方法可以先根据热词库和语音识别模型对语音交互请求进行处理得到命令词,从而执行命令词对应的控制指令。
用户输入的语音交互请求可以为“将电视调小音量”,热词库中具有热词“调小音量”,则可以根据热词库和语音识别模型对语音交互请求进行处理得到命令词为“调小音量”,从而执行命令词“调小音量”对应的控制指令。
如此,本申请的交互方法可以先根据热词库和语音识别模型对语音交互请求进行处理得到命令词,从而执行命令词对应的控制指令。
请参阅图11,步骤05包括:
051:获取热词库中的语音热词的声学得分和热词得分;
052:确定与语音热词的编辑距离为设定值的词汇数量;
053:根据声学得分、热词得分和词汇数量计算对应语音热词的综合得分;
054:根据综合得分在热词库中的语音热词中确定命令词。
请结合图10,语音处理模块15用于获取热词库中的语音热词的声学得分和热词得分;确定与语音热词的编辑距离为设定值的词汇数量;根据声学得分、热词得分和词汇数量计算对应语音热词的综合得分;根据综合得分在热词库中的语音热词中确定命令词。
首先,获取热词库中的语音热词的声学得分和热词得分。具体地,可以将用户的语音交互请求输入解码器,输出得到热词库中每个语音热词的声学得分。另外,由于热词更新的原理为将一些先验的知识加入到了语音识别系统中,利用端到端语音识别工具WeNet在解码过程中维护一个上下 文图(Context Graph)中的状态。因此,可以通过子图中的状态计算得到热词库中每个语音热词的热词得分,即在原有声学得分的基础上增加了正比于热词权重的分数。
然后,确定与语音热词的编辑距离为设定值的词汇数量,即,可以确定热词库中每个语音热词与其他语音热词的编辑距离为设定值的词汇数量。例如,若热词库中的一个语音热词1为“调大音量”,热词库中的其他两个语音热词分别为语音热词2“调小音量”和语音热词3“调高音量”,则语音热词1“调大音量”与语音热词2“调小音量”的编辑距离为1,语音热词1“调大音量”与语音热词3“调高音量”的编辑距离也为1,若编辑距离的设定值为1,则此时语音热词1“调大音量”与热词库中的语音热词2“调小音量”和语音热词3“调高音量”的编辑距离为设定值的词汇数量为2。
接着,根据声学得分、热词得分和词汇数量计算得到对应语音热词的综合得分,即可以得到热词库中每个语音热词的综合得分。具体可以先通过浅融合(Shallow Fusion)的形式在束搜索的过程中先对某个语音热词的声学得分和热词得分进行相加,然后结合该语音热词与其他语音热词的编辑距离为设定值的计算得到对应语音热词的综合得分。
最后,根据综合得分在热词库中的语音热词中确定命令词。也即是,可以根据热词库中每个语音热词的综合得分的高低确定与用户语音请求中相对应的命令词,将综合得分高的语音热词确定为命令词。
如此,本申请的语音交互方法不仅增加了热词权重,而且,由于添加了热词库中每个语音热词与其他语音热词的编辑距离为设定值的词汇数量对综合得分的影响,可以在一定程度上削弱热词库中的相近的语音热词对应的热词权重对识别用户语音请求中的命令词的影响。
根据声学得分、热词得分和词汇数量计算对应语音热词的综合得分的步骤通过下列条件式实现:
其中,argmax log P(y|x)为声学得分,log PC(y)为热词得分,为词汇数量,λ和μ为相应的系数。
具体地,若编辑距离的设定值为1,则为与y编辑距离为1的词汇数量。
就热词库中的单个语音热词而言,本申请的语音交互方法通过先对热词库中的单个语音热词的声学得分和热词得分进行加分,然后减去该语音 热词与其他语音热词的编辑距离为设定值的词汇数量,从而计算得到该语音热词的综合得分,进而计算出热词库中每个语音热词的综合得分。
如此,本申请的语音交互方法不仅增加了热词权重,由于添加了热词库中每个语音热词与其他语音热词的编辑距离为设定值的词汇数量对综合得分的影响,可以在一定程度上削弱热词库中的相近的语音热词对应的热词权重对识别用户语音请求中的命令词的影响。
此外,根据声学得分、热词得分和词汇数量计算对应语音热词的综合得分的步骤还可以通过下列条件式实现:
其中,argmax log P(y|x)为声学得分,log PC(y)为热词得分,为词汇数量,λ为相应的系数。
具体地,若编辑距离的设定值为1,则为与y编辑距离为1的词汇数量。
就热词库中的单个语音热词而言,本申请的语音交互方法对热词库中的单个语音热词的声学得分和热词得分进行加分,其中,将热词得分的系数λ除以该语音热词与其他语音热词的编辑距离为设定值的词汇数量进而计算得到该语音热词的综合得分,从而计算出热词库中每个语音热词的综合得分。
如此,本申请的语音交互方法不仅增加了热词权重,由于添加了热词库中每个语音热词与其他语音热词的编辑距离为设定值的词汇数量对综合得分的影响,可以在一定程度上削弱热词库中的相近的语音热词对应的热词权重对识别用户语音交互请求中语音交互请求中的命令词的影响。
请参阅图12,本申请还提供一种电子设备100。电子设备100包括处理器110和存储器120,存储器120存储有计算机程序121,计算机程序121被处理器10执行时实现上述任意一项实施例所述的语音交互方法。电子设备100包括手机、电脑、ipad等具有显示面板的智能设备。
本申请的电子设备100应用上述的语音交互方法可以将用户的语音交互请求中的陌生词汇添加至热词库中,使得用户的语音交互请求能够被语音识别模型识别得到对应的命令,改善在语音交互中应用语音识别模型无法准确理解用户语音指令的问题。
请参阅图13,本申请还提供一种包含有计算机程序的非易失性计算机可读存储介质200。当计算机程序210被一个或多个处理器220执行时, 实现上述任意实施例所述的语音交互方法。
例如,计算机程序210被处理器220执行时实现以下语音交互方法的步骤:
01:响应于用户更新语音热词的预设语音请求,获取用户根据预设命令模板输入的语音数据;
02:根据预设命令模板对语音数据进行语音识别得到目标语音热词;
03:根据目标语音热词更新语音识别模型的热词库。
可以理解,计算机程序210包括计算机程序代码。计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。计算机可读存储介质可以包括:能够携带计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、以及软件分发介质等。
本申请的计算机可读存储介质200应用上述的语音交互方法可以将用户的语音交互请求中的陌生词汇添加至热词库中,使得用户的语音交互请求能够被语音识别模型识别得到对应的命令,改善在语音交互中应用语音识别模型无法准确理解用户语音指令的问题。
以上实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (14)

  1. 一种语音交互方法,包括:
    响应于用户更新语音热词的预设语音请求,获取用户根据预设命令模板输入的语音数据;
    根据所述预设命令模板对所述语音数据进行语音识别得到目标语音热词;以及
    根据所述目标语音热词更新语音识别模型的热词库。
  2. 根据权利要求1所述的语音交互方法,其中,在响应于所述用户更新所述语音热词的所述预设语音请求,获取所述用户根据所述预设命令模板输入的所述语音数据的步骤之前,所述语音交互方法还包括:
    在用户输入的语音交互请求匹配命令失败的情况下,记录匹配命令失败的连续交互识别次数,将所述语音交互请求对应的命令词添加到连续失败命令词列表;以及
    在所述连续交互识别次数大于次数阈值,且所述连续失败命令词列表中的命令词之间的词义相似度满足预设条件的情况下,提示用户更新语音热词。
  3. 根据权利要求2所述的语音交互方法,其中,所述语音交互方法还包括:
    在用户输入的所述语音交互请求匹配命令成功的情况下,清空匹配命令失败的所述连续交互识别次数和所述连续失败命令词列表。
  4. 根据权利要求2所述的语音交互方法,其中,在用户输入的所述语音交互请求匹配命令失败的情况下,记录匹配命令失败的所述连续交互识别次数,将所述语音交互请求对应的所述命令词添加到所述连续失败命令词列表的步骤之后,语音交互方法还包括:
    根据所述连续失败命令词列表中的命令词之间的编辑距离和/或最长子串确定所述词义相似度。
  5. 根据权利要求1所述的语音交互方法,其中,所述语音交互方法还包括:
    获取用户输入的语音交互请求;
    根据所述热词库和所述语音识别模型对所述语音交互请求进行处理得到命令词;以及
    执行所述命令词对应的控制指令。
  6. 根据权利要求5所述的语音交互方法,其中,根据所述热词库和所述语音识别模型对所述语音交互请求进行处理得到所述命令词的步骤包括:
    获取所述热词库中的语音热词的声学得分和热词得分;
    确定与所述语音热词的编辑距离为设定值的词汇数量;
    根据所述声学得分、所述热词得分和所述词汇数量计算对应所述语音热词的综合得分;以及
    根据所述综合得分在所述热词库中的语音热词中确定所述命令词。
  7. 根据权利要求6所述的语音交互方法,其中,所述根据所述声学得分、所述热词得分和所述词汇数量计算对应所述语音热词的综合得分的步骤通过下列条件式实现:
    其中,argmax log P(y|x)为所述声学得分,log PC(y)为所述热词得分,为所述词汇数量,λ和μ为相应的系数。
  8. 根据权利要求6所述的语音交互方法,其中,所述根据所述声学得分、所述热词得分和所述词汇数量计算对应所述语音热词的综合得分的步骤通过下列条件式实现:
    其中,argmax log P(y|x)为所述声学得分,log PC(y)为所述热词得分,为所述词汇数量,λ为相应的系数。
  9. 根据权利要求4所述的语音交互方法,其中,所述最长子串是指无重复字符的最长子串。
  10. 根据权利要求6-8任意一项所述的语音交互方法,其中,所述获取所述热词库中的语音热词的声学得分和热词得分的步骤包括:
    将用户的语音交互请求输入解码器,利用解码器输出得到所述热词库中每个语音热词的声学得分;以及
    利用端到端语音识别工具在解码过程中维护上下文图中的状态,通过该上下文图中的子图中的状态计算得到所述热词库中每个语音热词的热词得分。
  11. 一种语音交互装置,包括:
    获取模块,用于响应于用户更新语音热词的预设语音请求,获取用户根据预设命令模板输入的语音数据;
    识别模块,用于根据所述预设命令模板对所述语音数据进行语音识别 得到目标语音热词;以及
    热词库更新模块,用于根据所述目标语音热词更新语音识别模型的热词库。
  12. 一种电子设备,包括处理器和存储器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时实现权利要求1-10中任一项所述的语音交互方法。
  13. 一种包含计算机程序的非易失性计算机可读存储介质,其中,当所述计算机程序被一个或多个处理器执行时,使得所述处理器执行权利要求1-10中任一项所述的语音交互方法。
  14. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算处理设备上运行时,导致所述计算处理设备执行根据权利要求1-10中任一项所述的语音交互方法。
PCT/CN2023/091826 2022-05-27 2023-04-28 语音交互方法及其装置、电子设备和存储介质 WO2023226700A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210592298.4A CN117174077A (zh) 2022-05-27 2022-05-27 语音交互方法及其装置、电子设备和存储介质
CN202210592298.4 2022-05-27

Publications (1)

Publication Number Publication Date
WO2023226700A1 true WO2023226700A1 (zh) 2023-11-30

Family

ID=88918359

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/091826 WO2023226700A1 (zh) 2022-05-27 2023-04-28 语音交互方法及其装置、电子设备和存储介质

Country Status (2)

Country Link
CN (1) CN117174077A (zh)
WO (1) WO2023226700A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030120493A1 (en) * 2001-12-21 2003-06-26 Gupta Sunil K. Method and system for updating and customizing recognition vocabulary
CN111028830A (zh) * 2019-12-26 2020-04-17 大众问问(北京)信息科技有限公司 一种本地热词库更新方法、装置及设备
CN112420034A (zh) * 2020-09-14 2021-02-26 当趣网络科技(杭州)有限公司 语音识别的方法、系统、电子装置和存储介质
CN113241070A (zh) * 2021-04-28 2021-08-10 北京字跳网络技术有限公司 热词召回及更新方法、装置、存储介质和热词系统
CN113436614A (zh) * 2021-07-02 2021-09-24 科大讯飞股份有限公司 语音识别方法、装置、设备、系统及存储介质
CN114333791A (zh) * 2021-12-10 2022-04-12 广州小鹏汽车科技有限公司 语音识别方法、服务器、语音识别系统、可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030120493A1 (en) * 2001-12-21 2003-06-26 Gupta Sunil K. Method and system for updating and customizing recognition vocabulary
CN111028830A (zh) * 2019-12-26 2020-04-17 大众问问(北京)信息科技有限公司 一种本地热词库更新方法、装置及设备
CN112420034A (zh) * 2020-09-14 2021-02-26 当趣网络科技(杭州)有限公司 语音识别的方法、系统、电子装置和存储介质
CN113241070A (zh) * 2021-04-28 2021-08-10 北京字跳网络技术有限公司 热词召回及更新方法、装置、存储介质和热词系统
CN113436614A (zh) * 2021-07-02 2021-09-24 科大讯飞股份有限公司 语音识别方法、装置、设备、系统及存储介质
CN114333791A (zh) * 2021-12-10 2022-04-12 广州小鹏汽车科技有限公司 语音识别方法、服务器、语音识别系统、可读存储介质

Also Published As

Publication number Publication date
CN117174077A (zh) 2023-12-05

Similar Documents

Publication Publication Date Title
CN107305768B (zh) 语音交互中的易错字校准方法
TWI664540B (zh) Search word error correction method and device, and weighted edit distance calculation method and device
CN108847241B (zh) 将会议语音识别为文本的方法、电子设备及存储介质
US11328133B2 (en) Translation processing method, translation processing device, and device
JP4945086B2 (ja) 論理形式のための統計的言語モデル
US7299187B2 (en) Voice command processing system and computer therefor, and voice command processing method
JP5533042B2 (ja) 音声検索装置、音声検索方法、プログラム及び記録媒体
JP7059213B2 (ja) 表示制御システム、プログラム、及び記憶媒体
WO2018045646A1 (zh) 基于人工智能的人机交互方法和装置
KR102075505B1 (ko) 핵심 키워드 추출 방법 및 시스템
CN112331206A (zh) 语音识别方法及设备
WO2017166631A1 (zh) 语音信号处理方法、装置和电子设备
JP5932869B2 (ja) N−gram言語モデルの教師無し学習方法、学習装置、および学習プログラム
US10997223B1 (en) Subject-specific data set for named entity resolution
WO2017020454A1 (zh) 检索方法和装置
CN107861948B (zh) 一种标签提取方法、装置、设备和介质
CN112861521B (zh) 语音识别结果纠错方法、电子设备及存储介质
KR20210060897A (ko) 음성 처리 방법 및 장치
WO2020220824A1 (zh) 识别语音的方法和装置
WO2024045475A1 (zh) 语音识别方法、装置、设备和介质
CN114154487A (zh) 文本自动纠错方法、装置、电子设备及存储介质
US20200286318A1 (en) Method, electronic device, and computer readable storage medium for creating a vote
JP2023002690A (ja) セマンティックス認識方法、装置、電子機器及び記憶媒体
WO2020052060A1 (zh) 用于生成修正语句的方法和装置
WO2023226700A1 (zh) 语音交互方法及其装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23810773

Country of ref document: EP

Kind code of ref document: A1