WO2022217621A1 - 语音交互的方法和装置 - Google Patents

语音交互的方法和装置 Download PDF

Info

Publication number
WO2022217621A1
WO2022217621A1 PCT/CN2021/087958 CN2021087958W WO2022217621A1 WO 2022217621 A1 WO2022217621 A1 WO 2022217621A1 CN 2021087958 W CN2021087958 W CN 2021087958W WO 2022217621 A1 WO2022217621 A1 WO 2022217621A1
Authority
WO
WIPO (PCT)
Prior art keywords
semantic recognition
recognition result
priority
parameter
preset condition
Prior art date
Application number
PCT/CN2021/087958
Other languages
English (en)
French (fr)
Inventor
苏琪
聂为然
潘倞燊
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202180005755.9A priority Critical patent/CN115500085A/zh
Priority to EP21936495.7A priority patent/EP4318464A4/en
Priority to PCT/CN2021/087958 priority patent/WO2022217621A1/zh
Publication of WO2022217621A1 publication Critical patent/WO2022217621A1/zh
Priority to US18/488,647 priority patent/US20240046931A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present application relates to the field of electronic devices, and more particularly, to methods and apparatuses for voice interaction.
  • the user and the electronic device can interact by voice.
  • the user can speak voice commands to the electronic device.
  • the electronic device can acquire the voice instruction and perform the operation indicated by the voice instruction.
  • the electronic device itself can recognize voice commands.
  • the ability of electronic devices to recognize voice commands may be relatively limited. If the voice command is only recognized by the electronic device, the recognition result may be inaccurate, and the electronic device may not be able to give an appropriate response to the user.
  • the electronic device can also upload the voice information related to the voice command to the cloud; the cloud can recognize the voice information and feed back the recognized result to the electronic device.
  • the cloud's speech recognition ability and natural language understanding ability can be relatively strong. However, the interaction of the electronic device with the cloud may depend on the current network status of the electronic device. That is to say, the interaction between the electronic device and the cloud may generate a relatively long delay. If the voice command is recognized through the cloud, the electronic device may not be able to obtain the recognition result from the cloud in a timely manner, and thus cannot respond quickly to the user.
  • the present application provides a voice interaction method and device, which aims to take into account the accuracy and efficiency of voice recognition, thereby facilitating appropriate and quick responses to users.
  • a voice interaction method applied to a first device, and the method includes:
  • the first semantic recognition result and the first preset condition determine to perform the first operation determined by the first device according to the first semantic recognition result, or determine to perform the second operation instructed by the second device .
  • the second device may indicate the second operation to the first device by sending one or more of the semantic recognition result and the operation information to the first device.
  • the first device may determine the operation corresponding to the user's voice instruction without relying on the information provided by the second device. In this way, it is beneficial to reduce the response time delay for the first device to execute the user's voice command, and improve the response efficiency.
  • the first device may determine the operation corresponding to the user's voice command according to the information provided by the second device. Therefore, it is beneficial to improve the accuracy with which the first device responds to the user's voice command.
  • the voice command processing mode can be flexibly selected according to the voice interaction scene that the first device is good at, so as to balance the response delay and the response accuracy.
  • the method further includes: the first device sending the first voice information to the second device.
  • the first device sends the first voice information to the second device, and the second device may send feedback for the first voice information to the first device.
  • the first device can adjust its own semantic recognition model, voice control model, etc. according to the feedback from the second device, so as to help improve the accuracy of the semantic recognition result output by the first device and optimize the response to the user
  • the applicability of the operation of the voice command, etc., or the feedback of the second device is ignored. For example, when the delay of the feedback of the second device is longer than that of the voice recognition result of the first device, the feedback of the second device is ignored.
  • the first device may also acquire feedback for the first voice information from the second device relatively faster. Therefore, it is beneficial to shorten the time for the first device to respond to the user's instruction.
  • the determining to execute according to the first semantic recognition result and the first preset condition is determined by the first device according to the first semantic recognition result
  • the first operation including:
  • the first semantic recognition result does not satisfy the first preset condition, it is determined to execute the second operation indicated by the second device.
  • the first preset condition is helpful for the first device to judge whether the first device can relatively accurately recognize the current voice information, and further facilitates taking into account the accuracy and efficiency of the voice recognition.
  • the first device is preset with multiple functions, and the first semantic recognition result satisfies a first preset condition, including:
  • the first semantic recognition result indicates a first function, and the first function belongs to the plurality of functions.
  • the first device is preset with multiple functions, and the first semantic recognition result does not meet a first preset condition, including: the first semantic recognition result indicates a first function, the first function does not belong to the multiple functions described.
  • the multiple functions include one or more of the following functions: a vehicle control function, a navigation function, an audio function, and a video function.
  • the multiple functions preset by the first device may include, for example, multiple functions supported by the first device.
  • the first device may have a relatively higher semantic recognition capability for multiple functions preset by the first device.
  • the first device may have a relatively lower semantic recognition capability for other functions not preset by the first device.
  • the first device can determine whether the first device can relatively accurately recognize the current voice information according to the multiple functions and the first function preset by the first device, which is beneficial to take into account the accuracy and efficiency of voice recognition.
  • the first device is preset with multiple intentions, and the first semantic recognition result satisfies a first preset condition, including:
  • the first semantic recognition result indicates a first intent belonging to the plurality of intents.
  • the method further includes: the first device is preset with multiple intents, and the first semantic recognition result does not satisfy a first preset condition, including: the first semantic recognition result indicates a first intent , the first intent does not belong to the plurality of intents.
  • the multiple intents include one or more of the following intents: hardware enabling intent, path planning intent, playing audio intent, and playing video intent.
  • the multiple intents preset by the first device may include, for example, multiple intents supported by the first device.
  • the first device may have a relatively higher semantic recognition capability for the multiple intents preset by the first device.
  • the first device may have a relatively lower semantic recognition capability for other intentions not preset by the first device.
  • the first device can determine whether the first device can relatively accurately recognize the current voice information according to the multiple intents and the first intent preset by the first device, thereby facilitating both the accuracy and efficiency of the voice recognition.
  • the first device is preset with multiple parameters, and the first semantic recognition result satisfies a first preset condition, including:
  • the first semantic recognition result includes a first parameter, and the first parameter belongs to the plurality of parameters.
  • the first device is preset with multiple parameters, and the first semantic recognition result does not meet the first preset condition, including: the first semantic recognition result indicates a first parameter, the first parameter does not belong to the multiple parameters.
  • the multiple parameters preset by the first device may include, for example, multiple parameters supported by the first device.
  • the first device may have a relatively higher semantic recognition capability for the plurality of parameters preset by the first device.
  • the first device may have a relatively lower semantic recognition capability for other parameters not preset by the first device.
  • the first device can determine whether the first device can relatively accurately recognize the current voice information according to a plurality of parameters and the first parameters preset by the first device, thereby facilitating both the accuracy and efficiency of voice recognition.
  • the first device is preset with multiple intents corresponding to the first function, and the first semantic recognition result satisfies a first preset condition, Also includes:
  • the first semantic recognition result also indicates a first intent belonging to the plurality of intents.
  • the first device is preset with multiple intentions corresponding to the first function, and the first semantic recognition result does not meet the first preset condition, further comprising: the first semantic recognition result further includes: Indicates a first intent that does not belong to the plurality of intents.
  • the intent corresponding to the first function is generally not unlimited. Establishing a correspondence between multiple functions and multiple intentions is helpful for the first device to relatively and more accurately determine whether the first device can relatively accurately recognize the current voice information, which is conducive to taking into account the accuracy and efficiency of voice recognition .
  • the first device is preset with multiple parameters corresponding to the first function, and the first semantic recognition result satisfies a first preset condition, Also includes:
  • the first semantic recognition result also indicates a first parameter belonging to the plurality of parameters.
  • the first device is preset with multiple parameters corresponding to the first function, and the first semantic recognition result does not meet the first preset condition, further comprising: the first semantic recognition result further includes: Indicates a first parameter that does not belong to the plurality of parameters.
  • the multiple parameter types include one or more of the following parameter types: hardware identification, time, temperature, location, singer, song, playlist, audio playback mode, movie, TV series, actor, video playback model.
  • the parameters corresponding to the first function are usually not unlimited. Establishing a corresponding relationship between multiple functions and multiple parameters is beneficial to improve the relatively and more accurate judgment of the first device whether the first device can relatively accurately recognize the current voice information, which is conducive to taking into account the accuracy and efficiency of voice recognition. .
  • the first device is preset with multiple parameters corresponding to the first intent, and the first semantic recognition result satisfies a first preset condition, Also includes:
  • the first semantic recognition result also indicates a first parameter belonging to the plurality of parameters.
  • the first device is preset with a plurality of parameters corresponding to the first intention, the first semantic recognition result does not meet the first preset condition, and further includes: the first semantic recognition result further includes: Indicates a first parameter that does not belong to the plurality of parameters.
  • the parameters corresponding to the first intent are usually not unlimited. Establishing a correspondence between multiple intents and multiple parameters is helpful for the first device to relatively and more accurately determine whether the first device can relatively accurately recognize the current voice information, which is conducive to taking into account the accuracy and efficiency of voice recognition .
  • the first semantic recognition result indicates a first function and indicates a first parameter
  • the first semantic recognition result satisfies a first preset condition, further comprising:
  • the first function indicated by the first semantic recognition result corresponds to the same parameter type as the first parameter indicated by the first semantic recognition result.
  • the first semantic recognition result indicates a first function and indicates a first parameter
  • the first semantic recognition result does not meet the first preset condition
  • the first semantic recognition result indicates the The first function corresponds to a different parameter type from the first parameter indicated by the first semantic recognition result.
  • the parameter type corresponding to the vehicle control function may be time, temperature, hardware identification, etc.
  • the parameter type corresponding to the temperature control function may be temperature, etc.
  • the parameter type corresponding to the navigation function may be location, time, and the like.
  • the parameter type corresponding to the audio function may be singer, song, playlist, time, audio playback mode, and the like.
  • the parameter type corresponding to the video function may be a movie, a TV series, an actor, a time, a video playback mode, and the like.
  • the parameter types corresponding to air conditioners, cameras, seats, windows, etc. may be hardware identifiers.
  • the parameter type corresponding to 5°C, 28°C, etc. may be temperature.
  • the parameter type corresponding to 1 hour, 1 minute, etc. may be time.
  • the parameter types corresponding to position A, position B, etc. may be positions.
  • the parameter types corresponding to singer A, singer B, etc. may be singers.
  • the parameter types corresponding to song A, song B, etc. may be songs.
  • the parameter types corresponding to playlist A, playlist B, etc. may be playlists.
  • the parameter type corresponding to standard playback, high-quality playback, lossless playback, etc. may be an audio playback mode.
  • the parameter types corresponding to movie A, movie B, etc. may be movies.
  • the parameter types corresponding to TV series A and TV series B may be TV series.
  • the parameter types corresponding to actor A, actor B, etc. may be actors.
  • the parameter type corresponding to standard definition playback, high definition playback, overdue playback, Blu-ray playback, etc. may be a video playback mode.
  • the accuracy of the first device's recognition of the first voice information may be relatively high. If the first device obtains in advance that the first parameter corresponds to the first parameter type, but the first function does not correspond to the first parameter type, the accuracy of the analysis of the first voice information by the first device may be relatively low. Establishing the relationship between multiple parameters and multiple functions through the parameter type is beneficial to improve the relatively and more accurate judgment of the first device whether the first device can relatively accurately recognize the current voice information, which is conducive to taking into account the accuracy of voice recognition. and efficiency.
  • the determining of the first semantic recognition result according to the first voice information includes:
  • the second semantic recognition result indicating a second function and indicating the first parameter
  • the multiple functions preset by the first device do not include the second function, and the multiple parameters preset by the first device include the first parameter, correcting the second semantic recognition result
  • the second function in is the first function, and the first semantic recognition result is obtained, and the first function and the second function are two different functions of the same type.
  • the first function may be a local translation function
  • the second function may be a cloud translation function. Both the first function and the second function may belong to translation type functions.
  • the first function may be a local navigation function
  • the second function may be a cloud navigation function. Both the first function and the second function may be functions of the navigation type.
  • the first function may be a local audio function
  • the second function may be a cloud audio function. Both the first function and the second function may belong to audio playback type functions.
  • the first function may be a local video function
  • the second function may be a cloud video function. Both the first function and the second function may belong to video playback type functions.
  • the second function does not belong to the plurality of functions preset by the first device, which means that the first device may have relatively weak semantic recognition ability for the second function.
  • the first device may learn the voice instructions related to the second function for many times, thereby gradually improving the semantic recognition ability for the second function. That is to say, by modifying the second function to the first function, the first device can apply the learned skills in a relatively unfamiliar field, which is beneficial to increase the applicable scenarios for end-to-end decision-making, thereby improving speech recognition. s efficiency.
  • the first semantic recognition result indicates a first intent and indicates a first parameter
  • the first semantic recognition result satisfies a first preset condition, further comprising:
  • the first intent indicated by the first semantic recognition result corresponds to the same parameter type as the first parameter indicated by the first semantic recognition result.
  • the first semantic recognition result indicates a first intent and indicates a first parameter
  • the first semantic recognition result does not meet the first preset condition, and further includes:
  • the first intent corresponds to a different parameter type from the first parameter indicated by the first semantic recognition result.
  • the parameter type corresponding to the hardware startup intent may be time, temperature, hardware identification, and the like.
  • the parameter type corresponding to the route planning intent may be location, time, and the like.
  • the parameter type corresponding to the intent to play audio may be singer, song, playlist, time, audio playback mode, and the like.
  • the parameter type corresponding to the video playback intent may be a movie, a TV series, an actor, a time, a video playback mode, and the like.
  • the parameter types corresponding to air conditioners, cameras, seats, windows, etc. may be hardware identifiers.
  • the parameter type corresponding to 5°C, 28°C, etc. may be temperature.
  • the parameter type corresponding to 1 hour, 1 minute, etc. may be time.
  • the parameter types corresponding to position A, position B, etc. may be positions.
  • the parameter types corresponding to singer A, singer B, etc. may be singers.
  • the parameter types corresponding to song A, song B, etc. may be songs.
  • the parameter types corresponding to playlist A, playlist B, etc. may be playlists.
  • the parameter type corresponding to standard playback, high-quality playback, lossless playback, etc. may be an audio playback mode.
  • the parameter types corresponding to movie A, movie B, etc. may be movies.
  • the parameter types corresponding to TV series A and TV series B may be TV series.
  • the parameter types corresponding to actor A, actor B, etc. may be actors.
  • the parameter type corresponding to standard definition playback, high definition playback, overdue playback, Blu-ray playback, etc. may be a video playback mode.
  • the accuracy of the analysis of the first voice information by the first device may be relatively high. If the first device obtains in advance that the first parameter corresponds to the first parameter type, but the first intent does not correspond to the first parameter type, the accuracy of the analysis of the first voice information by the first device may be relatively low. Establishing the relationship between multiple parameters and multiple intentions through the parameter type is beneficial to improve the relatively and more accurate judgment of the first device whether the first device can relatively accurately recognize the current voice information, which is conducive to taking into account the accuracy of voice recognition. and efficiency.
  • the type of parameter can be deduced from each other among multiple intents. This advantageously reduces the complexity of establishing relationships between parameters and intents.
  • the determining of the first semantic recognition result according to the first voice information includes:
  • the multiple intents preset by the first device do not include the second intent, and the multiple parameters preset by the first device include the first parameter, correcting the third semantic recognition result
  • the second intent in is the first intent, and the first semantic recognition result is obtained, and the first intent and the second intent are two different intents of the same type.
  • the first intent may be a local translation English intent
  • the second intent may be a cloud translation English intent. Both the first intent and the second intent may belong to the intent of translating English.
  • the first intent may be a local path planning intent
  • the second intent may be a cloud path planning intent. Both the first intent and the second intent may belong to a path planning type of intent.
  • the first intent may be an intent to play local audio
  • the second intent may be an intent to play cloud audio. Both the first intent and the second intent may belong to the play audio type intent.
  • the first intent may be an intent to play a local video
  • the second intent may be an intent to play a cloud video. Both the first intent and the second intent may belong to the intent of playing a video type.
  • the second intent does not belong to the multiple intents preset by the first device, which means that the first device may have relatively weak semantic recognition capability for the second intent.
  • the first device may learn the voice instructions related to the second intent multiple times, thereby gradually improving the semantic recognition capability for the second intent. That is to say, by modifying the second intent to the first intent, the first device can apply the learned skills in the relatively unfamiliar field, which is beneficial to increase the applicable scenarios of end-side decision-making, which is beneficial to improve the speech recognition. s efficiency.
  • the first semantic recognition result satisfies the first preset condition, including:
  • the first semantic recognition result includes a first indicator bit, and the first indicator bit indicates that the first semantic recognition result satisfies the first preset condition.
  • the first semantic recognition result does not meet the first preset condition, comprising: the first semantic recognition result includes a second indication bit, and the second indication bit indicates the first semantic recognition result The first preset condition is not met.
  • the determining of the first semantic recognition result according to the first voice information includes:
  • the fourth semantic recognition result includes a first function and a first parameter
  • the first function belongs to a plurality of functions preset by the first device
  • the first parameter belongs to a plurality of parameters preset by the first device
  • the first function and the first parameter In the case of corresponding to the same parameter type, the first semantic recognition result is determined, and the semantic recognition result includes the first indicator bit.
  • the first device may have relatively weak semantic recognition capabilities for the first function.
  • the first device may learn the voice instructions related to the first function multiple times, thereby gradually improving the semantic recognition ability for the first function. That is to say, by carrying the first indicator bit in the semantic recognition result, the first device can apply the learned skills in the relatively unfamiliar field, which is beneficial to increase the applicable scenarios of end-to-end decision-making, which is beneficial to improve the speech identification efficiency.
  • the determining of the first semantic recognition result according to the first voice information includes:
  • the fifth semantic recognition result includes a first intent and a first parameter
  • the first intent belongs to a plurality of intents preset by the first device
  • the first parameter belongs to a plurality of parameters preset by the first device
  • the first intent and the first parameter In the case of corresponding to the same parameter type, the first semantic recognition result is determined, and the semantic recognition result includes the first indicator bit.
  • the first intent may have relatively weak semantic recognition capabilities for the first intent.
  • the first device may learn the voice instructions related to the first intent multiple times, thereby gradually improving the semantic recognition capability for the first intent. That is to say, by carrying the first indicator bit in the semantic recognition result, the first device can apply the learned skills in the relatively unfamiliar field, which is beneficial to increase the applicable scenarios of end-to-end decision-making, which is beneficial to improve the speech identification efficiency.
  • the method further includes:
  • the sixth semantic recognition result from the second device is discarded.
  • the determining to execute the second operation instructed by the second device according to the first semantic recognition result and the first preset condition includes:
  • the first preset condition is helpful for the first device to judge whether the first device can relatively accurately recognize the current voice information, and further facilitates taking into account the accuracy and efficiency of the voice recognition.
  • the method further includes:
  • the first device can learn new parameters. This is beneficial to increase the applicable scenarios for end-to-end decision-making, which in turn helps to improve the efficiency of speech recognition.
  • the method further includes:
  • the seventh semantic recognition result and the second preset condition it is determined to perform the operation indicated by the first voice information, or it is determined to perform the third operation determined by the first device according to the seventh semantic recognition result. operate, or determine to perform a fourth operation indicated by the second means.
  • the user and the first device can conduct voice dialogue with respect to a special scenario or a special field.
  • the user may not be able to completely achieve the purpose of voice control through one voice command.
  • two adjacent rounds of voice interaction are usually related.
  • user responses may be somewhat random.
  • the user's reply may not be related to the first device query or the voice information to be acquired. If the first device completely follows the user's reply, the content of the previous voice interaction may be invalid, which may increase the number of manipulations of the first device by the user. If the first device completely ignores the user's reply, it may cause the first device to fail to respond to the user's instruction in some special scenarios, so that the user's voice instruction becomes ineffective.
  • the second preset condition is used to indicate whether the first device ends the multiple rounds of voice interaction, which is helpful for the first device to relatively appropriately select whether to jump out of the multiple rounds of voice interaction.
  • the seventh semantic recognition result and the second preset condition it is determined to execute the operation indicated by the first voice information, or it is determined to execute the operation.
  • the third operation determined by the first device according to the seventh semantic recognition result, or determining to perform the fourth operation instructed by the second device includes:
  • the seventh semantic recognition result does not satisfy the second preset condition, it is determined to execute the operation corresponding to the first semantic recognition result.
  • the new end-side voice interaction may end the previous round of end-side voice interaction.
  • the new terminal-side voice interaction may end the previous round of cloud-side voice interaction.
  • the new cloud-side voice interaction may end the previous round of terminal-side voice interaction.
  • the new cloud-side voice interaction may end the previous round of cloud-side voice interaction.
  • the first device can comprehensively judge the first preset condition and the second preset condition, which is beneficial for the first device to relatively appropriately select whether to jump out of multiple rounds of voice interaction, and is conducive to taking into account the accuracy and efficiency of voice recognition.
  • the seventh semantic recognition result satisfies the second preset condition, including:
  • the priority of the seventh semantic recognition result is higher than the priority of the first semantic recognition result.
  • the user can end the current multiple rounds of voice interaction through a high-priority voice command.
  • the priority of the seventh semantic recognition result is higher than the priority of the first semantic recognition result, including one or more of the following:
  • the priority of the function indicated by the seventh semantic recognition result is higher than the priority of the function indicated by the first semantic recognition result
  • the priority of the intent indicated by the seventh semantic recognition result is higher than the priority of the intent indicated by the first semantic recognition result
  • the priority of the parameter indicated by the seventh semantic recognition result is higher than the priority of the parameter indicated by the first semantic recognition result.
  • Functions, intentions, and parameters can better reflect the current voice interaction scene.
  • the priority of the function, the priority of the intention, and the priority of the parameter are beneficial to enable a device to relatively accurately determine whether to jump out of the current multi-round voice interaction.
  • the determining to execute the second operation instructed by the second device according to the first semantic recognition result and the first preset condition includes:
  • the seventh semantic recognition result satisfies the second preset condition, including:
  • the priority of the seventh semantic recognition result is higher than the priority of the sixth semantic recognition result.
  • the sixth semantic recognition result is indicated by the second means, and the sixth semantic recognition result may have relatively higher accuracy.
  • the priority of the seventh semantic recognition result and the priority of the sixth semantic recognition result it is beneficial for the first device to relatively appropriately select whether to skip multiple rounds of voice interaction.
  • the priority of the seventh semantic recognition result is higher than the priority of the sixth semantic recognition result, including one or more of the following:
  • the priority of the function indicated by the seventh semantic recognition result is higher than the priority of the function indicated by the sixth semantic recognition result
  • the priority of the intent indicated by the seventh semantic recognition result is higher than the priority of the intent indicated by the sixth semantic recognition result
  • the priority of the parameter indicated by the seventh semantic recognition result is higher than the priority of the parameter indicated by the sixth semantic recognition result.
  • Functions, intentions, and parameters can better reflect the current voice interaction scene.
  • the priority of the function, the priority of the intention, and the priority of the parameter are beneficial to enable a device to relatively accurately determine whether to jump out of the current multi-round voice interaction.
  • the method further includes:
  • the determining to perform the operation indicated by the first voice information according to the seventh semantic recognition result and the second preset condition, or determining to perform the fourth operation indicated by the second device includes:
  • the eighth semantic recognition result and the second preset condition it is determined to execute the operation indicated by the first voice information, or it is determined to execute the fourth operation.
  • the seventh semantic recognition result recognized by the first device may be relatively inaccurate, and the eighth semantic recognition result recognized by the second device may be relatively accurate.
  • the first device can determine whether the eighth semantic recognition result satisfies the second preset condition, thereby helping to relatively accurately determine whether to end the current multiple rounds of voice interaction.
  • the first apparatus may not determine the operation to be performed according to the seventh semantic recognition result.
  • the first semantic recognition result does not satisfy the first preset condition, it may mean that the current multiple rounds of voice interaction belong to cloud-side voice interaction.
  • the first device may acquire the eighth semantic recognition result from the second device to continue the current cloud-side multi-round voice interaction. This is conducive to maintaining multiple rounds of voice interaction on the cloud side.
  • determining to perform the fourth operation includes:
  • the eighth semantic recognition result does not satisfy the second preset condition, it is determined to execute the operation indicated by the first voice information.
  • the eighth semantic recognition result is indicated by the second means, and the eighth semantic recognition result may have a relatively higher accuracy. According to the priority of the eighth semantic recognition result, it is favorable for the first device to select whether to jump out of multiple rounds of voice interaction relatively appropriately.
  • the eighth semantic recognition result satisfies the second preset condition, including:
  • the priority of the eighth semantic recognition result is higher than the priority of the first semantic recognition result.
  • the user can end the current multiple rounds of voice interaction through a high-priority voice command.
  • the priority of the eighth semantic recognition result is higher than the priority of the first semantic recognition result, including one or more of the following:
  • the priority of the function indicated by the eighth semantic recognition result is higher than the priority of the function indicated by the first semantic recognition result
  • the priority of the intent indicated by the eighth semantic recognition result is higher than the priority of the intent indicated by the first semantic recognition result
  • the priority of the fifth parameter indicated by the eighth semantic recognition result is higher than the priority of the parameter indicated by the first semantic recognition result.
  • Functions, intentions, and parameters can better reflect the current voice interaction scene.
  • the priority of the function, the priority of the intention, and the priority of the parameter are beneficial to enable a device to relatively accurately determine whether to jump out of the current multi-round voice interaction.
  • the determining to execute the second operation instructed by the second device according to the first semantic recognition result and the first preset condition includes:
  • the eighth semantic recognition result satisfies the second preset condition, including:
  • the priority of the eighth semantic recognition result is higher than the priority of the sixth semantic recognition result.
  • Both the sixth semantic recognition result and the eighth semantic recognition result are indicated by the second device, and both the sixth semantic recognition result and the eighth semantic recognition result may have relatively higher accuracy.
  • By comparing the priority of the eighth semantic recognition result and the priority of the sixth semantic recognition result it is beneficial for the first device to relatively appropriately select whether to jump out of multiple rounds of voice interaction.
  • the priority of the eighth semantic recognition result is higher than the priority of the sixth semantic recognition result, including one or more of the following:
  • the priority of the function indicated by the eighth semantic recognition result is higher than the priority of the function indicated by the sixth semantic recognition result
  • the priority of the intent indicated by the eighth semantic recognition result is higher than the priority of the intent indicated by the sixth semantic recognition result
  • the priority of the parameter indicated by the eighth semantic recognition result is higher than the priority of the parameter indicated by the sixth semantic recognition result.
  • Functions, intentions, and parameters can better reflect the current voice interaction scene.
  • the priority of the function, the priority of the intention, and the priority of the parameter are beneficial to enable a device to relatively accurately determine whether to jump out of the current multi-round voice interaction.
  • the second voice information is irrelevant to the operation indicated by the first voice information.
  • the correlation between the second voice information and the operation indicated by the first voice information is lower than the second preset threshold.
  • the first voice information has nothing to do with the second voice information.
  • the degree of correlation between the first voice information and the second voice information is lower than the second preset threshold.
  • one or more of the functions, intentions, and parameters indicated by the first voice information and the second voice information are different.
  • the method further includes:
  • the voice wake-up operation is performed.
  • a device for voice interaction including:
  • an acquisition unit for acquiring the first voice information from the voice sensor
  • a processing unit configured to determine a first semantic recognition result according to the first voice information
  • the processing unit is further configured to, according to the first semantic recognition result and the first preset condition, determine to execute the first operation determined by the first device according to the first semantic recognition result, or determine to execute the first operation determined by the first device according to the first semantic recognition result.
  • the processing unit is specifically configured to:
  • the first device is preset with multiple functions, and the first semantic recognition result satisfies a first preset condition, including:
  • the first semantic recognition result indicates a first function, and the first function belongs to the plurality of functions.
  • the first device is preset with multiple intentions, and the first semantic recognition result satisfies a first preset condition, including:
  • the first semantic recognition result indicates a first intent belonging to the plurality of intents.
  • the first device is preset with multiple parameters, and the first semantic recognition result satisfies the first preset condition, including:
  • the first semantic recognition result indicates a first parameter, and the first parameter belongs to the plurality of parameters.
  • the first semantic recognition result indicates a first function and indicates a first parameter
  • the first semantic recognition result satisfies a first preset condition
  • the first function indicated by the first semantic recognition result corresponds to the same parameter type as the first parameter indicated by the first semantic recognition result.
  • the processing unit is specifically configured to:
  • the second semantic recognition result indicating a second function and indicating the first parameter
  • the multiple functions preset by the first device do not include the second function, and the multiple parameters preset by the first device include the first parameter, correcting the second semantic recognition result
  • the second function in is the first function, and the first semantic recognition result is obtained, and the first function and the second function are two different functions of the same type.
  • the first semantic recognition result indicates a first intent and indicates a first parameter
  • the first semantic recognition result satisfies a first preset condition
  • the first intent indicated by the first semantic recognition result corresponds to the same parameter type as the first parameter indicated by the first semantic recognition result.
  • the processing unit is specifically configured to:
  • the multiple intents preset by the first device do not include the second intent, and the multiple parameters preset by the first device include the first parameter, correcting the third semantic recognition result
  • the second intent in is the first intent, and the first semantic recognition result is obtained, and the first intent and the second intent are two different intents of the same type.
  • the first semantic recognition result satisfies the first preset condition, including:
  • the first semantic recognition result includes a first indicator bit, and the first indicator bit indicates that the first semantic recognition result satisfies the first preset condition.
  • the processing unit is specifically used for:
  • the fourth semantic recognition result includes a first function and a first parameter
  • the first function belongs to a plurality of functions preset by the first device
  • the first parameter belongs to a plurality of parameters preset by the first device
  • the first function and the first parameter In the case of corresponding to the same parameter type, the first semantic recognition result is determined, and the semantic recognition result includes the first indicator bit.
  • the processing unit is specifically used for:
  • the fifth semantic recognition result includes a first intent and a first parameter
  • the first intent belongs to a plurality of intents preset by the first device
  • the first parameter belongs to a plurality of parameters preset by the first device
  • the first intent and the first In the case that the parameters correspond to the same parameter type, the first semantic recognition result is determined, and the semantic recognition result includes the first indication bit.
  • the apparatus further includes:
  • a sending unit configured to send the first voice information to the second device
  • the processing unit is further configured to discard the sixth semantic recognition result from the second device.
  • the processing unit is specifically configured to:
  • the processing unit is further configured to:
  • the apparatus further includes a storage unit configured to store the association relationship between the second parameter and the second parameter type.
  • the obtaining unit is further configured to obtain second voice information from the voice sensor
  • the processing unit is further configured to, according to the second voice information, determine a seventh semantic recognition result
  • the processing unit is further configured to, according to the seventh semantic recognition result and the second preset condition, determine to execute the operation indicated by the first voice information, or determine to execute the operation indicated by the first device according to the first voice information. 7. The third operation determined by the semantic recognition result, or the fourth operation instructed by the second device is determined to be performed.
  • the processing unit is specifically configured to:
  • the seventh semantic recognition result does not satisfy the second preset condition, it is determined to execute the operation corresponding to the first semantic recognition result.
  • the seventh semantic recognition result satisfies the second preset condition, including:
  • the priority of the seventh semantic recognition result is higher than the priority of the first semantic recognition result.
  • the priority of the seventh semantic recognition result is higher than the priority of the first semantic recognition result, including one or more of the following:
  • the priority of the function indicated by the seventh semantic recognition result is higher than the priority of the function indicated by the first semantic recognition result
  • the priority of the intent indicated by the seventh semantic recognition result is higher than the priority of the intent indicated by the first semantic recognition result
  • the priority of the parameter indicated by the seventh semantic recognition result is higher than the priority of the parameter indicated by the first semantic recognition result.
  • the processing unit is specifically configured to:
  • the seventh semantic recognition result satisfies the second preset condition, including:
  • the priority of the seventh semantic recognition result is higher than the priority of the sixth semantic recognition result.
  • the priority of the seventh semantic recognition result is higher than the priority of the sixth semantic recognition result, including one or more of the following:
  • the priority of the function indicated by the seventh semantic recognition result is higher than the priority of the function indicated by the sixth semantic recognition result
  • the priority of the intent indicated by the seventh semantic recognition result is higher than the priority of the intent indicated by the sixth semantic recognition result
  • the priority of the parameter indicated by the seventh semantic recognition result is higher than the priority of the parameter indicated by the sixth semantic recognition result.
  • the apparatus further includes:
  • a sending unit configured to send second voice information to the second device
  • the processing unit is specifically used for:
  • the eighth semantic recognition result and the second preset condition it is determined to execute the operation indicated by the first voice information, or it is determined to execute the fourth operation.
  • the processing unit is specifically configured to:
  • the eighth semantic recognition result does not satisfy the second preset condition, it is determined to execute the operation indicated by the first voice information.
  • the eighth semantic recognition result satisfies the second preset condition, including:
  • the priority of the eighth semantic recognition result is higher than the priority of the first semantic recognition result.
  • the priority of the eighth semantic recognition result is higher than the priority of the first semantic recognition result, including one or more of the following:
  • the priority of the function indicated by the eighth semantic recognition result is higher than the priority of the function indicated by the first semantic recognition result
  • the priority of the intent indicated by the eighth semantic recognition result is higher than the priority of the intent indicated by the first semantic recognition result
  • the priority of the parameter indicated by the eighth semantic recognition result is higher than the priority of the parameter indicated by the first semantic recognition result.
  • the processing unit is specifically configured to:
  • the eighth semantic recognition result satisfies the second preset condition, including:
  • the priority of the eighth semantic recognition result is higher than the priority of the sixth semantic recognition result.
  • the priority of the eighth semantic recognition result is higher than the priority of the sixth semantic recognition result, including one or more of the following:
  • the priority of the function indicated by the eighth semantic recognition result is higher than the priority of the function indicated by the sixth semantic recognition result
  • the priority of the intent indicated by the eighth semantic recognition result is higher than the priority of the intent indicated by the sixth semantic recognition result
  • the priority of the parameter indicated by the eighth semantic recognition result is higher than the priority of the parameter indicated by the sixth semantic recognition result.
  • the second voice information is irrelevant to the operation indicated by the first voice information.
  • the apparatus further includes:
  • the wake-up module is used to perform a voice wake-up operation in response to the user's input operation.
  • an apparatus for voice interaction includes: a processor and a memory, the processor is coupled with the memory, the memory is used for storing a computer program, and the processor is used for executing the memory in the memory.
  • a stored computer program to cause the apparatus to perform the method described in any one of the possible implementations of the first aspect above.
  • a computer-readable medium stores program code for execution by a device, the program code comprising a method for performing any one of the implementation manners of the first aspect above.
  • a computer program product containing instructions, when the computer program product runs on a computer, the computer program product causes the computer to execute the method described in any one of the implementation manners of the first aspect above.
  • a sixth aspect provides a chip, the chip includes a processor and a data interface, the processor reads the instructions stored in the memory through the data interface, and executes any one of the implementations of the first aspect above. method described.
  • the chip may further include a memory, in which instructions are stored, the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the The processor is configured to execute the method described in any one of the implementation manners of the first aspect.
  • a voice interaction system in a seventh aspect, includes the first device described in any possible implementation manner of the foregoing first aspect, and any of the foregoing possible implementation manners of the first aspect.
  • the second device described above, the first device is configured to execute the method described in any possible implementation manner of the first aspect.
  • the solution provided by the embodiment of the present application can facilitate the first device to determine whether it has the ability to independently recognize the user's voice command.
  • the first device can independently determine the operation corresponding to the user's voice command, thereby helping to reduce the response delay for the first device to execute the user's voice command and improve response efficiency.
  • the first device may choose to perform an operation instructed by other devices, which is beneficial to improve the accuracy of the first device in responding to the user's voice command.
  • the voice commands collected by the sensor are processed by the local processor and sent to the cloud for processing, and adaptively choose to execute the operation feedback from the local processor or the cloud, which can balance response efficiency and response accuracy.
  • the solutions provided by the embodiments of the present application can enable the first device to continuously learn new voice commands, so as to broaden the voice interaction scenarios that the first device is relatively good at.
  • the solution provided by the present application is conducive to enabling the first device to appropriately choose whether to jump out of multiple rounds of voice interaction, thereby improving the effect of voice interaction.
  • FIG. 1 is a schematic diagram of a voice interaction system.
  • Figure 2 is a schematic diagram of a system architecture.
  • FIG. 3 is a schematic diagram of a voice interaction system.
  • FIG. 4 is a schematic diagram of a voice interaction system.
  • FIG. 5 is a voice interaction system provided by an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a voice interaction method provided by an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a voice interaction method provided by an embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a voice interaction method provided by an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of a voice interaction method provided by an embodiment of the present application.
  • FIG. 10 is a schematic flowchart of a voice interaction method provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of an apparatus for voice interaction provided by an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of an apparatus for voice interaction provided by an embodiment of the present application.
  • Application Scenario 1 Application Scenario of Intelligent Driving
  • the user can control the intelligent driving device through voice.
  • the user can issue voice commands to the voice assistant in the car to control the smart driving device.
  • the user can adjust the inclination of the seat back, adjust the temperature of the air conditioner, turn the seat heater on or off, turn the lights on or off, open or close the windows, turn on or off the trunk, plan navigation routes, play personalized playlists, etc.
  • voice interaction is conducive to providing users with a convenient driving environment.
  • Application Scenario 2 Application Scenario of Smart Home
  • users can control smart home devices through voice.
  • a user can issue a voice command to an IoT device (eg, a smart home device) or an IoT control device (eg, a mobile phone, etc.) to control the IoT device.
  • the user can control the temperature of the smart air conditioner, control the smart TV to play the TV series specified by the user, control the smart cooking device to start at the time specified by the user, control the smart curtain to open or close, and control the smart light fixture to adjust the color temperature through voice. Wait.
  • voice interaction is beneficial to provide users with a comfortable home environment.
  • FIG. 1 is a schematic diagram of a voice interaction system 100 .
  • the execution device 110 may be a device with speech recognition capability, natural language understanding capability, and the like.
  • the execution device 110 may be, for example, a server.
  • the execution device 110 may also cooperate with other computing devices, such as data storage, router, load balancer and other devices.
  • the execution device 110 may be arranged on one physical site, or distributed across multiple physical sites.
  • the execution device 110 may use the data in the data storage system 150 or invoke the program code in the data storage system 150 to implement at least one of functions such as speech recognition, machine learning, deep learning, and model training.
  • the data storage system 150 in FIG. 1 can be integrated on the execution device 110, and can also be set on the cloud or other network servers.
  • a user may operate respective local devices (eg, local device 101 and local device 102 ) to interact with execution device 110 .
  • the local device shown in FIG. 1 may, for example, represent various types of voice interaction terminals.
  • the user's local device can interact with the execution device 110 through a wired or wireless communication network.
  • the format or standard of the communication network is not limited, and can be a wide area network, a local area network, a point-to-point connection, or any combination thereof.
  • the local device 101 may provide the execution device 110 with local data or feedback calculation results.
  • all or part of the functionality of the execution device 110 may be implemented by a local device.
  • the local device 101 implements the functions of the device 110 and provides services for its own users, or provides services for the users of the local device 102 .
  • FIG. 2 is a schematic diagram of a system architecture 200 .
  • Data collection device 260 may be used to collect training data. Data collection device 260 may also be used to store training data in database 230 .
  • the training device 220 can obtain the target model/rule 201 by training based on the training data maintained in the database 230 .
  • the target model/rule 201 trained here can be used to execute the voice interaction method of the embodiment of the present application.
  • the training device 220 does not necessarily perform the training of the target model/rule 201 entirely based on the training data maintained by the database 230, and may also obtain training data from the cloud or other places for model training.
  • the above description should not be taken as a limitation on the embodiments of the present application.
  • the training data maintained in the database 230 are not necessarily all collected by the data collection device 260, and may also be received from other devices.
  • the training data in database 230 may be obtained by client device 240 , or may be obtained by execution device 210 .
  • Client device 240 may include, for example, various types of voice interactive terminals.
  • the execution device 210 may be a device with speech recognition capability, natural language understanding capability, and the like. For example, by obtaining voice information through the data acquisition device 260 and performing related processing, training data such as text features of the input text and phonetic symbols of the target voice can be obtained; the text features of the input text and the phonetic symbols of the target voice can also be obtained through the data acquisition device 260 feature.
  • speech information can be directly used as training data.
  • the same account may be logged on multiple client devices 240 , and the data collected by the multiple client devices 240 may be maintained in the database 230 .
  • the above-mentioned training data may include, for example, one or more of data such as speech, corpus, and hot words.
  • Speech can refer to sounds that are loaded with a certain linguistic meaning.
  • the corpus is the language material, which can refer to the use of text and text context to describe the language in the real world and the context of the language.
  • Hot words are hot words. Hot words can be a lexical phenomenon, and hot words can reflect the issues, topics, things, etc. that some people are relatively concerned about in a period.
  • the above-mentioned training data may include, for example, input speech (for example, the input speech may be from a user, or may be a speech acquired by other devices).
  • the above-mentioned training data may include, for example, a feature vector of the input speech (such as phonetic symbol features, for example, the phonetic symbol features may reflect the phonetic symbols of the input speech).
  • the feature vector of the input speech can be obtained by performing feature extraction on the input speech.
  • the above-mentioned training data may include, for example, target text corresponding to the input speech, and the like.
  • the above-mentioned training data may include, for example, text features of the target text corresponding to the input speech.
  • the target text can be obtained by feature preprocessing on the input speech.
  • the text features of the target text can be obtained by performing feature extraction on the target text.
  • the target text corresponding to "n ⁇ h ⁇ o” can be "Hello”.
  • Feature extraction of "n ⁇ h ⁇ o” can obtain the phonetic features of the input speech.
  • the text features of the target text "Hello” can be obtained.
  • the input speech can be sent by the client device 240 to the data acquisition device 260, or read from the storage device by the data acquisition device 260, and can also be acquired by real-time acquisition.
  • the data collection device 260 may determine the training data from the above phonetic features and/or text features.
  • the above feature preprocessing on the input speech may include normalization, word-to-speech conversion, prosodic pause prediction and other processes.
  • Normalization can refer to converting non-Chinese characters such as numbers and symbols in the text into Chinese characters according to semantics.
  • the phonetic-word conversion may refer to predicting the corresponding pinyin for each voice, and then generating a Chinese character text sequence for each voice.
  • Prosodic pause prediction may refer to predicting accent markers, prosodic phrases, intonation phrase markers, and the like.
  • Feature preprocessing can be performed by data acquisition device 260, or by client device 240 or other devices.
  • the data collection device 260 may perform feature preprocessing and feature extraction on the input speech to obtain text features of the target text.
  • the client device 240 may perform feature preprocessing on the input speech to obtain target text; the data collection device 260 may perform feature extraction on the target text.
  • the pronunciation of the input speech is "n ⁇ menh ⁇ o"
  • the following phonetic features can be generated:
  • S can be the beginning of a sentence, or it can be understood as the beginning of the sentence;
  • E can be the end of the sentence, or the end of the sentence;
  • the numbers "0", “1", “2", “3”, and "4" can be tone marks;
  • SP0" and "SP1” can be marks for different pause levels;
  • the initials and finals of Hanyu Pinyin can be used as phonemes; spaces can be used between different phonemes/marks "_” separated.
  • the target text corresponding to another input speech is "Hello everyone”
  • the following text features can be generated:
  • S can be a sentence start marker or a start marker
  • E can be a sentence end marker or an end marker
  • the numbers "0" and “1” , "3", "4" can be tonal marks
  • SP0 "SP1”
  • the initials and finals of Hanyu Pinyin can be used as phonemes
  • space “_” can be used between different phonemes/marks separated.
  • the training device 220 may input the acquired training data into the target model/rule 201 .
  • the phonetic feature result output according to the target model/rule 201 can be compared with the phonetic feature corresponding to the current input voice, or the text feature result output according to the target model/rule 201 can be compared with the text feature corresponding to the current input voice, thereby The training of the target model/rule 201 is completed.
  • the target model/rule 201 is obtained by training according to the training device 220, which may be a model constructed based on a neural network.
  • the neural network here may be a convolutional neural network (CNN), a recurrent neural network (RNN), Time recurrent neural network (long-short term memory, LSTM), bidirectional time recurrent neural network (bidirectional long-short term memory, BLSTM), deep convolutional neural network (deep convolutional neural networks, DCNN) and so on.
  • the target model/rule 201 may be implemented based on a self-attention neural network.
  • the type of the target model/rule 201 may, for example, belong to an automatic speech recognition (automatic speech recognition, ASR) model, a natural language processing (natural language processing, NLP) model, and the like.
  • the target model/rule 201 obtained by the above training device 220 can be applied to different systems or devices.
  • the execution device 210 may be configured with an input/output (I/O) interface 212 . Through the I/O interface 212 , the execution device 210 can perform data interaction with external devices of the execution device 210 .
  • a “user” may input data to I/O interface 212 through client device 240 .
  • the user can input the intermediate prediction result to the I/O interface 212 through the client device 240 , and then the client device 240 sends the intermediate prediction result obtained after certain processing to the execution device 210 through the I/O interface 212 .
  • the intermediate prediction result may be, for example, the target text corresponding to the input speech or the like.
  • the training device 220 can generate corresponding target models/rules 201 based on different training data for different goals or different tasks, and the corresponding target models/rules 201 can be used to achieve the above-mentioned goals or complete the above-mentioned goals. tasks to provide the user with the desired result.
  • the execution device 210 can call data, codes, etc. in the data storage system 250 , and can also store data, instructions, etc. in the data storage system 250 .
  • the execution device 210 may also divide the target model/rule 201 obtained by the training device 220 to obtain its sub-model/sub-rule, and deploy the obtained sub-model/sub-rule on the client device 240 and the execution device 210 respectively. .
  • execution device 210 may send the personalized sub-model of target model/rule 201 to client device 240, which deploys it within the device.
  • the general sub-model of the target model/rule 201 has no parameters updated during the training process, and therefore does not change.
  • training device 220 may obtain training data through database 230 .
  • the training device 220 can train the training data to obtain a speech model.
  • the training device 220 may send the speech model obtained by training to the execution device 210, and the execution device 210 divides the speech model to obtain the personalized speech sub-model and the general speech sub-model.
  • the training device 220 may first divide the speech model obtained by training to obtain the personalized speech sub-model and the general speech sub-model, and send the personalized speech sub-model and the general speech sub-model to the execution device 210 .
  • the target model/rule 201 may be obtained by training on the basis of the basic speech model. During the training process, a part of the target model/rule 201 may be updated, and another part of the target model/rule 201 may not be updated. The updated portion of the target model/rule 201 may correspond to the personalized speech sub-model. The non-updated portion of the target model/rule 201 may correspond to a generic speech submodel.
  • the basic speech model may be pre-trained by the training device 220 using the speech, corpus, etc. of multiple people, or may be an existing speech model.
  • Client device 240 and computing module 211 may work together.
  • the client device 240 and the computing module 211 may analyze the data input to the client device 240 and/or the data input to the execution device 210 (eg, the intermediate prediction results from the client device 240) according to the above-mentioned personalized speech sub-model and the general speech sub-model. ) to be processed.
  • the client device 240 may process the input user voice to obtain phonetic features or text features corresponding to the user voice; then, the client device 240 may input the phonetic features or text features to the computing module 211 .
  • the preprocessing module 213 of the execution device 210 may receive the input speech from the I/O interface 112, and perform feature preprocessing and feature extraction on the input speech to obtain text features of the target text.
  • the preprocessing module 213 can input the text features of the target text into the calculation module 211.
  • the calculation module 211 can input the phonetic features or text features into the target model/rule 201, so as to obtain the output results of speech recognition (such as semantic recognition results, and operations corresponding to voice commands, etc.).
  • the computing module 211 can input the output result to the client device 240, so that the client device 240 can perform corresponding operations in response to the user's voice instruction.
  • the I/O interface 212 can send the input data to the corresponding module of the execution device 210, and can also return the output result to the client device 240 to provide the user.
  • the I/O interface 212 can send the intermediate prediction result corresponding to the input speech to the computing module 211 , and can also return the result obtained after recognizing the speech to the client device 240 .
  • the user can input data such as speech, corpus, etc. into the client device 240, and can view the results output by the execution device 210 on the client device 240.
  • the specific presentation form can be sound or sound and display. combination, etc.
  • the client device 240 can also act as a data collection terminal to store the collected data such as speech and corpus in the database 230 .
  • the client device 240 may not be used for collection, but the user's voice, corpus and other data and the output result of the I/O interface 212 may be stored in the database 230 as new sample data by other devices.
  • the execution device 210 and the data storage system 250 may be integrated in different devices.
  • the execution device 210 and the data storage system 250 can be integrated in the client device 240; and when the data processing capability of the client device 240 is not very strong, the execution device 210 and the data storage system can be integrated into the client device 240.
  • 250 can be integrated in specialized data processing equipment.
  • the database 230, the training device 220 and the data acquisition device 260 in FIG. 2 can be integrated in a special data processing device, can also be set on the cloud or other servers on the network, and can also be set on the client device 240 and the data respectively. processing equipment.
  • FIG. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among the devices, devices, modules, etc. shown in FIG. 2 does not constitute any limitation.
  • the data storage system 250 is an external memory relative to the execution device 210 , in other cases, the data storage system 250 may also be placed in the execution device 210 .
  • execution device 210 may reside in client device 240 .
  • the generic speech submodel of the target model/rule 201 may be the presence speech model of the client device 240 . After the client device 240 leaves the factory, the personalized speech sub-model of the target model/rule 201 can be updated according to the data collected by the client device 240 .
  • 3 and 4 may be schematic diagrams of two voice interaction systems.
  • the voice interaction system 300 shown in FIG. 3 may include at least one first device.
  • the first device may be various types of voice interactive terminals.
  • the first device may acquire the user's voice instruction through the interactive interface.
  • the first device can recognize the voice command and obtain the recognized result.
  • the first device may perform a corresponding operation according to the recognized result to respond to the user's voice command.
  • the first device may also perform related processing such as machine learning, deep learning, model training, etc. through a memory for storing data and a processor for data processing.
  • the voice interaction system 300 shown in FIG. 3 may have a certain corresponding relationship with the voice interaction system 100 shown in FIG. 1 .
  • the first apparatus may be equivalent to the local device 101 or the local device 102 shown in FIG. 1 , for example.
  • the voice interaction system 300 shown in FIG. 3 may have a certain corresponding relationship with the system architecture 200 shown in FIG. 2 .
  • the first apparatus may be equivalent to the client device 240 shown in FIG. 2 , for example.
  • the first apparatus shown in FIG. 3 may have some or all of the functions of the execution device 210 shown in FIG. 2 .
  • voice recognition may mainly rely on the first device.
  • the first device has relatively strong speech recognition ability, natural language understanding ability, etc.
  • the first device can usually quickly respond to the user's voice command. This means that the voice interaction system 300 shown in FIG. 3 has relatively high requirements on the processing capability of the first device. If the voice recognition capability, natural language understanding capability, etc. of the first device are relatively weak, the first device cannot accurately respond to the user's voice command, which may reduce the user's voice interaction experience.
  • the voice interaction system 400 shown in FIG. 4 may include at least one first device and at least one second device.
  • the first device may be various types of voice interactive terminals.
  • the second device may be a cloud device with speech recognition capability, natural language understanding capability, etc., such as a server.
  • the first device may receive or acquire the user's voice, and the voice may contain the user's voice instruction.
  • the first device may forward the user's voice to the second device.
  • the first device may obtain the result obtained after recognizing the speech from the second device.
  • the first device may perform an operation according to the result obtained after recognizing the voice, so as to respond to the user's voice command.
  • the second device can acquire the voice from the first device through the interactive interface, and perform voice recognition on the voice.
  • the second device may also forward the result obtained after recognizing the speech to the first device.
  • the memory shown in FIG. 4 may be a general term that includes local storage as well as a database that stores historical data.
  • the database in FIG. 4 can be on the second device or on other devices.
  • both the first device and the second device may perform related processing such as machine learning, deep learning, model training, and speech recognition through a memory for storing data and a processor for data processing.
  • the voice interaction system 400 shown in FIG. 4 may have a certain corresponding relationship with the voice interaction system 100 shown in FIG. 1 .
  • the first device may be equivalent to the local device 101 or the local device 102 shown in FIG. 1 , for example; the second device may be equivalent to the execution device 110 shown in FIG. 1 , for example.
  • the voice interaction system 400 shown in FIG. 4 may have a certain corresponding relationship with the system architecture 200 shown in FIG. 2 .
  • the first device may be equivalent to the client device 240 shown in FIG. 2 , for example; the second device may be equivalent to the execution device 210 shown in FIG. 2 , for example.
  • the voice recognition may mainly rely on the second device.
  • the first device may not process the speech, or only perform simple preprocessing.
  • the voice interaction system 400 shown in FIG. 4 is beneficial to reduce the requirement on the processing capability of the first device.
  • the interaction of the second device with the first device may cause a delay. This may be unfavorable for the first device to quickly respond to the user's voice command, which may further reduce the user's sense of voice interaction experience.
  • FIG. 5 is a voice interaction system 500 provided by an embodiment of the present application.
  • the voice interaction system 500 may include a first device and a second device.
  • the first device may be various types of voice interaction equipment, such as a car, a car machine, an on-board computer (On-board computer, or On-board PC), a chip (such as an on-board chip, a voice processing chip, etc.), a processor, a mobile phone, a personal Computers, smart bracelets, tablets, smart cameras, set-top boxes, game consoles, voice in-car devices, smart cars, media consumption devices, smart home devices, smart voice assistants on wearable devices that can speak, smart speakers or All kinds of machines or equipment for dialogue, etc.
  • the first apparatus may be, for example, the local device 101 or the local device 102 shown in FIG. 1 , or a unit or module of the local device 101 or the local device 102 .
  • the first apparatus may be the client device 240 shown in FIG. 2 , or a unit or module of the client device 240 .
  • the first device may include, for example, a semantic recognition module, an operation decision module, and a transceiver module.
  • the semantic recognition module can be used to recognize the speech information from the speech sensor.
  • the voice information may be, for example, audio directly obtained by a voice sensor, or may be a processed signal that carries the content of the audio.
  • the semantic recognition module can output the semantic recognition result obtained after the semantic recognition.
  • the semantic recognition result can be structured information that can reflect the user's speech content.
  • the semantic recognition module can store, for example, a voice interaction model, such as an ASR model, an NLP model, and the like.
  • the semantic recognition module can perform semantic recognition on speech information through the speech interaction model.
  • the first transceiver module can be used to forward the voice information from the voice sensor to the second device, and obtain the voice analysis result indicated by the second device from the second device, and the voice analysis result can include, for example, a semantic recognition result and/or operation information .
  • the action decision module may be used to determine actions in response to the user.
  • the operation decision module can obtain the semantic recognition result from the semantic recognition module; the operation decision module can determine to execute the corresponding operation according to the semantic recognition result.
  • the operation decision module may obtain the speech analysis result indicated by the second device from the transceiver module, and then determine to perform the operation indicated by the second device.
  • the operation decision module may determine whether the operation in response to the speech information is instructed by the semantic recognition module or by the second device.
  • the second device may be independent of the first device.
  • the second device may be, for example, a remote service platform or a server.
  • the second means may be implemented by one or more servers.
  • the server may include, for example, one or more of a network server, an application server, a management server, a cloud server, an edge server, and a virtual server (a server virtualized by using multiple physical resources).
  • the second device may also be other devices with speech recognition capabilities, natural language understanding capabilities, etc., such as car machines, in-vehicle computers, chips (such as in-vehicle chips, voice processing chips, etc.), processors, mobile phones, personal computers, or tablet computers and other equipment with data processing functions.
  • the second device may be, for example, the execution device 110 shown in FIG. 1 , or a unit or module of the execution device 110 .
  • the second apparatus may be the execution device 210 shown in FIG. 2 , or a unit or module of the execution device 210 .
  • the second apparatus may include, for example, a second transceiver module and a voice analysis module.
  • the second transceiver module may be used to acquire voice information from the first device.
  • the second transceiver module may also be configured to send the speech analysis result output by the speech analysis module to the first device.
  • the speech analysis module may be configured to acquire speech information from the first device from the second transceiver module, and may perform speech analysis on the speech information.
  • the speech analysis module can store speech interaction models, AS, NLP models, and the like.
  • the speech analysis module can perform speech processing on speech information through the speech interaction model.
  • the speech analysis module can output the semantic recognition result obtained after the semantic recognition.
  • the speech analysis module may analyze the semantic recognition result to obtain operation information corresponding to the speech information.
  • the operation information may be used to indicate an operation performed by the first device.
  • the representation form of the operation information may be, for example, an instruction.
  • FIG. 6 is a schematic flowchart of a voice interaction method 600 provided by an embodiment of the present application.
  • the method 600 shown in FIG. 6 can be applied to, for example, the voice interaction system 500 shown in FIG. 5 .
  • the first device performs a voice wake-up operation in response to a user's input operation.
  • the user may speak the wake word to the first device.
  • the first device can detect the wake-up word input by the user, and then wake up the voice interaction function of the first device.
  • the user may press a wake button to input an operation to the first device.
  • the first device can detect the user's input operation on the wake-up button, and then wake up the voice interaction function of the first device.
  • the first device may include a wake-up module, and the wake-up module may be used to detect a user's input operation.
  • the first device acquires first voice information from a voice sensor.
  • Voice sensors can be used to pick up the user's voice.
  • the voice sensor may be a device with recording capabilities.
  • the voice sensor may include, for example, a microphone.
  • the voice sensor may be a device with recording function and data processing capability.
  • the voice sensor can also perform related voice processing on the user's voice, such as noise reduction processing, amplification processing, coding and modulation processing, and the like.
  • the speech processing described above may also be performed by other modules.
  • the first voice information may be, for example, audio directly obtained by a voice sensor, or may be a processed signal that carries the content of the audio.
  • the user speaks a first voice instruction
  • the first voice instruction is used to instruct the first device to perform the target operation.
  • the voice sensor Through the voice sensor, the first voice instruction can be converted into the first voice information, so that the first voice information can be used to indicate the target operation.
  • the first device determines to execute the target operation indicated by the first voice information.
  • the first device can perceive or learn the specific meaning of the first voice instruction, and then can determine the target operation indicated by the first voice information.
  • the first device may perform the target operation in response to the user's first voice instruction.
  • determining, by the first device, to perform the target operation indicated by the first voice information according to the first voice information includes: the first device determining, according to the first voice information, a first semantic recognition Result; the first device determines to perform the first operation according to the first semantic recognition result.
  • the first operation may be an operation determined by the first device according to the first voice information.
  • the first operation may correspond to the above-mentioned target operation.
  • the process of converting the first voice information into the first semantic recognition result may include, for example: converting the first voice information including audio content into the first text information; and performing semantic extraction on the first text information to obtain a structured first semantic recognition result.
  • the first semantic recognition result may include one or more of the following structured information: function, intent, parameter.
  • the function of the first semantic recognition result may be a specific value of the function result.
  • the function of the first semantic recognition result may represent a class of policies.
  • the function of the first semantic recognition result may also be referred to as a domain.
  • An intent can be a concrete value for the result of the intent.
  • the parameter can be a concrete value of the slot result.
  • the user may speak a first voice instruction to the first device.
  • the first voice instruction may be converted into first text information, and the first text information may be, for example, "turn on device A".
  • Device A may be, for example, an air conditioner.
  • the natural meaning of the first textual information may indicate the first operation. However, only based on the first text information, the first device usually cannot directly determine the natural meaning of the first voice command, nor can it determine the operation indicated by the first voice command.
  • the first text information may be converted into a first semantic recognition result, and the first semantic recognition result may indicate a first operation.
  • the first semantic recognition result may include a first function, a first intent, and a first parameter; the first function may be, for example, "car control function"; the first intent may be "open”; the first parameter may be "Device A”.
  • the first device may determine according to the first semantic recognition result: the first operation may be an operation within the vehicle control function; the intention of the first operation may be to open or open; the object to be opened or opened may be device A.
  • the first semantic recognition result may include a first function, a first intent, and a first parameter; the first function may be, for example, "car control function”; the first intent may be "open device A"; the first parameter Can be “null”.
  • the first device may determine, according to the first semantic recognition result: the first operation may be an operation within the vehicle control function; the intention of the first operation may be to turn on the device A, or make the device A in an on state; the first semantic recognition result may be It does not include specific parameters (such as temperature parameters, timing parameters, mode parameters, etc.) of the ON state of device A.
  • the first semantic recognition result may include a first function and a first parameter; the first function may be, for example, a "vehicle control function"; and the first parameter may be "device A".
  • the first device may determine according to the first semantic recognition result: the first operation may be an operation within the vehicle control function; the object of the first operation may be the device A.
  • the first semantic recognition result may include a first intent and a first parameter; the first function may be, for example, the first intent may be "open”; the first parameter may be "device A".
  • the first apparatus may determine according to the first semantic recognition result: the intent of the first operation may be to open or open; the object to be opened or opened may be the device A.
  • the first semantic recognition result may include a first intent; the first intent may be "turn on device A".
  • the first device may determine according to the first semantic recognition result: the intention of the first operation may be to turn on the device A, or to make the device A in the on state; the first semantic recognition result may not include the specific parameters of the on state of the device A (such as temperature parameters, timing parameters, mode parameters, etc.).
  • the first semantic recognition result may include a first parameter; the first parameter may be "device A".
  • the first apparatus may determine, according to the first semantic recognition result, that the object of the first operation may be the device A.
  • determining, by the first device, to perform the target operation indicated by the first voice information according to the first voice information includes: the first device determining, according to the first voice information, a first semantic recognition Result; the first device determines to execute the first operation determined by the first device according to the first semantic recognition result according to the first semantic recognition result and the first preset condition.
  • the first device may determine whether the first preset condition is satisfied, and then, according to the determination result, determine whether to execute the first operation determined by the first device according to the first semantic recognition result. For example, in the case that the first semantic recognition result satisfies the first preset condition, the first apparatus may execute 603a.
  • the first device sends the first voice information to the second device.
  • the first device is a mobile phone
  • the second device is a server. That is, the mobile phone can send the first voice information to the server.
  • the first device is a mobile phone
  • the second device is a mobile phone. That is, the mobile phone can send the first voice information to the mobile phone.
  • a possible way is that the first device sends the first voice information to the second device through another device.
  • Other devices may be, for example, mobile phones and the like.
  • 603b and 603a may be performed synchronously or sequentially within a certain period of time. This embodiment of the present application may not limit the execution order of 603b and 603a.
  • the first apparatus may further execute 603b.
  • Both the first device and the second device can perform semantic recognition on the first voice information.
  • the second device may send the sixth semantic recognition result to the first device according to the first voice information.
  • the second device may feed back the recognition result to the first device for the first voice information.
  • the first device and the second device may obtain the same or different semantic recognition results.
  • the first device can determine the first semantic recognition result according to the first voice information, and the first semantic recognition result can be used to indicate the first operation
  • the second device can determine the sixth semantic recognition result according to the first voice information, and the first semantic recognition result can be used to indicate the first operation.
  • Six semantic recognition results can be used to indicate the second operation; at least one of the first operation and the second operation can correspond to the above target operation.
  • the content of 603a describes some possible examples in which the first device determines the first semantic recognition result according to the first voice information, and the method for the second device to determine the sixth semantic recognition result according to the first voice information can refer to 603a, which will not be discussed here. Describe in detail.
  • the second device may also indicate the target operation to the first device by using the target operation information.
  • the target operation information can be represented as operation signaling, for example.
  • the first device may determine the first operation in response to the first voice instruction according to the first semantic recognition result analyzed by the first device itself.
  • the first device may acquire the sixth semantic recognition result and/or target operation information from the second device.
  • the sixth semantic recognition result and/or the target operation information may be determined by the second device according to the first voice information.
  • the sixth semantic recognition result can be used to reflect the meaning of the first voice information, and used to indirectly instruct the first device to perform the second operation; the target operation information can be used to directly instruct the first device to perform the second operation. Since there may be a delay in the sixth semantic recognition result and/or target operation information fed back by the second device, the first device may acquire the sixth semantic recognition result and/or target operation information after performing the first operation.
  • the first device can adjust its own semantic recognition model, voice control model, etc. according to the sixth semantic recognition result and/or the target operation information, so as to help improve the accuracy of the output semantic recognition result of the first device and optimize the response to the user's voice command. suitability for operation, etc.
  • the first device may discard the sixth semantic recognition result from the second device.
  • the first device may discard the sixth semantic recognition result after receiving the sixth semantic recognition result. Alternatively, the first device may not receive the sixth semantic recognition result. That is, if the first device can determine the operation to be performed according to the first voice information, the second device provides feedback for the first voice information.
  • the first device may not determine whether it has the ability to recognize the voice information. In other words, the first device may not be able to determine the target operation indicated by the first voice information according to the first voice information.
  • the first device may send the first voice information to the second device in advance.
  • the second device can recognize and process the first voice information to obtain a sixth semantic recognition result and/or target operation information; the sixth semantic recognition result can be used to reflect the meaning of the first voice information, and be used to indirectly indicate the first
  • the device performs the target operation; the target operation information can be used to directly instruct the first device to perform the target operation.
  • the first device can determine the target operation according to the first voice information, the first device can skip or discard the result fed back by the second device according to the first voice information. If the first device cannot determine the target operation according to the first voice information, since the first device sends the first voice information to the second device in advance, the first device can obtain the sixth semantic recognition result and the second device relatively quickly. / or target operation information. Therefore, it is beneficial to shorten the time for the first device to respond to the user's instruction.
  • only one of 603b and 603a is performed by the first device.
  • the first device can determine that there is one and only one of the following to be executed: the first device determines the first operation to be executed according to the first voice information; the first device determines the to-be-executed operation according to the result fed back by the second device Second operation.
  • the first device may determine the operation corresponding to the user's voice instruction without relying on the information provided by the second device. Therefore, it is beneficial to improve the efficiency of the first device in responding to the user's voice command. Additionally, the first device may choose not to send the first voice information to the second device. This is beneficial to reduce the number of signaling transmissions of the first device. In a voice interaction scenario in which the first device is relatively unfamiliar, the first device may determine the operation corresponding to the user's voice command according to the information provided by the second device. Therefore, it is beneficial to improve the accuracy with which the first device responds to the user's voice command.
  • the first device determines a first semantic recognition result according to the first voice information; the first device determines, according to the first semantic recognition result and a first preset condition, that the execution is performed by the second device. The indicated second action.
  • the first device may determine whether the first preset condition is satisfied, and then, according to the determination result, determine whether to execute the second operation instructed by the second device. For example, if the first semantic recognition result does not satisfy the first preset condition, the first device may execute 603b and not execute 603a; if the first semantic recognition result satisfies the first preset condition, the first device may 603a is executed, and 603b is not executed.
  • the following describes how the first device determines whether to perform the first operation determined by the first device according to the first semantic recognition result or the second device according to the first semantic recognition result and the first preset condition through some examples.
  • the indicated second action is how the first device determines whether to perform the first operation determined by the first device according to the first semantic recognition result or the second device according to the first semantic recognition result and the first preset condition through some examples. The indicated second action.
  • the first apparatus may determine to perform the first operation determined by the first apparatus according to the first semantic recognition result.
  • the first device may determine to perform a corresponding operation according to the first semantic recognition result, so that the first device may respond to the user's voice instruction.
  • the first semantic recognition result does not satisfy the first preset condition, it is determined to execute the second operation indicated by the second device.
  • the first device may determine the operation to be performed according to the instruction of the second device, so that the first device may respond to the user's voice instruction.
  • the second device may indicate the second operation to the first device through the semantic recognition result and/or the operation information.
  • the determining to perform the second operation instructed by the second device according to the first semantic recognition result and the first preset condition includes: when the first semantic recognition result does not satisfy the first predetermined condition.
  • the second device may recognize the first voice information to obtain a sixth semantic recognition result.
  • the second device may indirectly indicate the second operation to be performed by the first device to the first device through the sixth semantic recognition result, so that the first device can respond to the user's voice instruction. That is, the first device may determine, according to the sixth semantic recognition result, that the operation to be performed by the first device is the second operation.
  • the fact that the first semantic recognition result satisfies the first preset condition means that the function included in (corresponding to or indicated by) the first semantic recognition result is a function with a priority processing level preset by the first device.
  • the first device may preset multiple functions, and when the first function included in (corresponding to or indicated by) the first semantic recognition result belongs to the multiple functions, the first semantic recognition result satisfies the first preset condition.
  • the method further includes: the first device obtains the first semantic recognition list, the first semantic recognition list is The semantic recognition list includes a plurality of functions; the first semantic recognition result satisfying the first preset condition includes: the first semantic recognition result includes a first function, and the first function belongs to the plurality of functions.
  • the fact that the first semantic recognition result does not satisfy the first preset condition means that the function included in (corresponding to or indicated by) the first semantic recognition result is not a function with a priority processing level preset by the first device.
  • the first device may preset multiple functions, and when the first function included (corresponding to or indicated by) in the first semantic recognition result does not belong to the multiple functions, the first semantic recognition result does not satisfy the first preset condition .
  • the method further includes: the first device obtains the first semantic recognition list, the first semantic recognition list is The semantic recognition list includes a plurality of functions; the first semantic recognition result does not satisfy the first preset condition, including: the first semantic recognition result includes a first function, and the first function does not belong to the plurality of functions.
  • the first device can perform semantic recognition on the first speech information to obtain a first semantic recognition result, and the first semantic recognition result can include the first function; the first device can search for the first function in the first semantic recognition list .
  • the first device may determine that the first semantic recognition result satisfies the first preset condition, and then the first device may determine that the first device executes the first function according to the first function.
  • a first operation of semantic recognition result determination if the first semantic recognition list does not include the first function, the first device may determine that the first semantic recognition result does not satisfy the first preset condition, and then the first device may determine that the execution is instructed by the second device the second operation.
  • the first semantic recognition list may be, for example, a list pre-stored by the first device.
  • the first semantic recognition list may include, for example, a plurality of functions supported by the first device.
  • the first device may have relatively higher semantic recognition capabilities for the plurality of functions in the first semantic recognition list.
  • the first device may have relatively lower semantic recognition capabilities for functions other than the first semantic recognition list. If the first function is included in the first semantic recognition list, it may mean that the accuracy of the first semantic recognition result is relatively high. If the first function is not included in the first semantic recognition list, it may mean that the accuracy of the first semantic recognition result is relatively low.
  • the plurality of functions in the first semantic recognition list may include "car control functions". That is to say, for the "vehicle control function", the semantic recognition ability of the first device may be relatively good.
  • the first text message may be, for example, "open device A”.
  • the first text information may indicate "car control function”.
  • the first semantic recognition result may include a first function, and the first function may be a "vehicle control function”.
  • the first device may determine that the first semantic recognition result satisfies the first preset condition according to the first semantic recognition list and the first semantic recognition result, and determine to execute the first operation determined by the first device according to the first semantic recognition result.
  • the plurality of functions in the first semantic recognition list may include a "local translation function” but not a "cloud translation function”. That is to say, for the "local translation function", the semantic recognition ability of the first device may be relatively good; for the "cloud translation function”, the semantic recognition ability of the first device may be relatively poor (for example, the first device may not be able to know in time current hot words, unable to achieve all foreign language translation, etc.).
  • the first text information may be, for example, "translate the following content in foreign language A", wherein "foreign language A" may not belong to the foreign language that can be translated by the first device.
  • the first semantic recognition result may include a first function, and the first function may be a "cloud translation function".
  • the first device may determine, according to the first semantic recognition list and the first semantic recognition result, that the first semantic recognition result does not satisfy the first preset condition.
  • the first device may determine to perform the second operation instructed by the second device.
  • the fact that the first semantic recognition result satisfies the first preset condition means that the intent included in (corresponding to or indicated by) the first semantic recognition result is an intent with a priority processing level preset by the first device.
  • the first device may preset multiple intents, and when the first intent included (corresponding to or indicated by) in the first semantic recognition result belongs to the multiple intents, the first semantic recognition result satisfies the first preset condition.
  • the method further includes: the first device obtains the second semantic recognition list, the second The semantic recognition list includes a plurality of intents; the first semantic recognition result satisfying a first preset condition includes: the first semantic recognition result includes a first intent, and the first intent belongs to the multiple intents.
  • the fact that the first semantic recognition result does not satisfy the first preset condition means that the intent included (corresponding to or indicated by) in the first semantic recognition result is not the intent that the first device presets to have a priority processing level.
  • the first device may preset multiple intents, and when the first intent included (corresponding to or indicated by) in the first semantic recognition result does not belong to the multiple intents, the first semantic recognition result does not satisfy the first preset condition .
  • the method further includes: the first device obtains the second semantic recognition list, the second The semantic recognition list includes multiple intents; the first semantic recognition result does not satisfy the first preset condition, including: the first semantic recognition result includes a first intent, and the first intent does not belong to the multiple intents.
  • the first device may perform semantic recognition on the first speech information to obtain a first semantic recognition result, and the first semantic recognition result may include the first intent; the first device may search for the first intent in the second semantic recognition list.
  • the first device may determine that the first semantic recognition result satisfies the first preset condition, and then the first device may determine that the The first operation of determining the first semantic recognition result.
  • the second semantic recognition list does not include the first intent, the first device may determine that the first semantic recognition result does not satisfy the first preset condition, and then the first device may determine that the execution is instructed by the second device the second operation.
  • the second semantic recognition list may be, for example, a list pre-stored by the first device.
  • the second semantic recognition list may include, for example, a plurality of intents supported by the first device.
  • the first device may have relatively higher semantic recognition capabilities for the plurality of intents in the second semantic recognition list.
  • the first device may have relatively lower semantic recognition capabilities for intents other than the second semantic recognition list. If the first intent is included in the second semantic recognition list, it may mean that the accuracy of the first semantic recognition result is relatively high. If the first intent is not included in the second semantic recognition list, it may mean that the accuracy of the first semantic recognition result is relatively low.
  • the plurality of intents in the second semantic recognition list may include "turn on device A.” That is to say, for the intention of "opening device A", the semantic recognition ability of the first device may be relatively good.
  • the first text message may be, for example, "open device A”.
  • the first text message may indicate "turn on device A”.
  • the first semantic recognition result may include a first intent, and the first intent may be "turn on device A”.
  • the first device may determine that the first semantic recognition result satisfies the first preset condition according to the second semantic recognition list and the first semantic recognition result, and determine to execute the first operation determined by the first device according to the first semantic recognition result.
  • the multiple intents in the second semantic recognition list may include "local audio playback intent” but not “cloud audio playback intent”. That is to say, for the "local audio playback intent", the semantic recognition capability of the first device may be relatively good (for example, the first device may recognize the audio resources indicated in the voice command according to the locally stored audio data); Audio playback intent", the semantic recognition capability of the first device may be relatively poor (for example, the first device may not support the recognition capability for cloud audio data, etc.).
  • the first text information may be, for example, "play from 1 minute of song A", where “song A” may belong to cloud audio data.
  • the first semantic recognition result may include a first intent, and the first intent may be "play song A”.
  • the first device may determine that the first semantic recognition result does not satisfy the first preset condition according to the second semantic recognition list and the first semantic recognition result.
  • the first device may determine to perform the second operation instructed by the second device.
  • the first semantic recognition result satisfies the first preset condition, which means that the parameter included (corresponding to or indicated by) in the first semantic recognition result indicates that the first device presets a function, intent, scene with a priority processing level. , device or location.
  • the first device may preset multiple parameters, and when the first parameter included in the first semantic recognition result belongs to the multiple parameters, the first semantic recognition result satisfies the first preset condition.
  • the method further includes: the first device obtains a third semantic recognition list, the third semantic recognition list is The semantic recognition list includes a plurality of parameters; the first semantic recognition result satisfying the first preset condition includes: the first semantic recognition result includes a first parameter, and the first parameter belongs to the plurality of parameters.
  • the first semantic recognition result does not meet the first preset condition, which means that the parameter included in the first semantic recognition result (corresponding to or indicated by) indicates that the first device does not have the function of the priority processing level by default, and the intention is , scene, device or location.
  • the first device may preset multiple parameters, and when the first parameter included in the first semantic recognition result does not belong to the multiple parameters, the first semantic recognition result does not satisfy the first preset condition.
  • the method further includes: the first device obtains a third semantic recognition list, the third semantic recognition list is The semantic recognition list includes a plurality of parameters; the first semantic recognition result does not meet the first preset condition, including: the first semantic recognition result includes a first parameter, and the first parameter does not belong to the plurality of parameters.
  • the first device may perform semantic recognition on the first speech information to obtain a first semantic recognition result, and the first semantic recognition result may include the first parameter; the first device may search the third semantic recognition list for the first parameter.
  • the third semantic recognition list includes the first parameter, the first device may determine that the first semantic recognition result satisfies the first preset condition, and then the first device may determine that the The first operation of determining the first semantic recognition result.
  • the third semantic recognition list does not include the first parameter, the first device may determine that the first semantic recognition result does not satisfy the first preset condition, and then the first device may determine that the execution by the second device is performed by the second device. The indicated second action.
  • the third semantic recognition list may be, for example, a list pre-stored by the first device.
  • the third semantic recognition list may include, for example, a plurality of parameters supported by the first device.
  • the first device may have a relatively higher semantic recognition capability for the plurality of parameters in the third semantic recognition list.
  • the first device may have relatively lower semantic recognition capabilities for parameters other than the third semantic recognition list. If the first parameter is included in the third semantic recognition list, it may mean that the accuracy of the first semantic recognition result is relatively high. If the first parameter is not included in the third semantic recognition list, it may mean that the accuracy of the first semantic recognition result is relatively low.
  • the plurality of parameters in the third semantic recognition list may include "device A”. That is to say, for the parameter "device A", the semantic recognition ability of the first device may be relatively good.
  • the first text message may be, for example, "open device A”.
  • the first text information may indicate "device A”.
  • the first semantic recognition result may include a first parameter, and the first parameter may be "device A”.
  • the first device may determine that the first semantic recognition result satisfies the first preset condition according to the third semantic recognition list and the first semantic recognition result, and determine to execute the first operation determined by the first device according to the first semantic recognition result.
  • the plurality of parameters in the third semantic recognition list may not include "position A”. That is to say, for the parameter "Location A", the semantic recognition ability of the first device may be relatively poor.
  • the first text information may be, for example, "navigate to location B, and approach location A”.
  • the first semantic recognition result may include a first parameter, and the first parameter may be "position A”.
  • the first device may determine that the first semantic recognition result does not satisfy the first preset condition according to the third semantic recognition list and the first semantic recognition result.
  • the first device may determine to perform the second operation instructed by the second device.
  • the first semantic recognition result satisfying the first preset condition may include at least two of the following: the first semantic recognition result includes a first function, and the first function belongs to the first semantic recognition list; The first semantic recognition result includes a first intent, and the first intent belongs to the second semantic recognition list; the first semantic recognition result includes a first parameter, and the first parameter belongs to the third semantic recognition list.
  • the first function of the first semantic recognition result is "car control function”
  • the first intent of the first semantic recognition result is "open”
  • the first parameter of the first semantic recognition result is "device A”.
  • the first semantic recognition result can be The first preset condition is satisfied.
  • the first semantic recognition list, the second semantic recognition list, and the third semantic recognition list may be, for example, three mutually independent lists.
  • the first semantic recognition list, the second semantic recognition list, and the third semantic recognition list may be from the same list.
  • the first semantic recognition list, the second semantic recognition list, and the third semantic recognition list may belong to one general list.
  • the general list may include, for example, a plurality of sublists, and the first semantic recognition list, the second semantic recognition list, and the third semantic recognition list may be, for example, three sublists of the general table.
  • the first semantic recognition list further includes multiple intents corresponding to the first function, the first semantic recognition result satisfies a first preset condition, and further includes: the first semantic recognition result further includes: Include a first intent belonging to the plurality of intents.
  • the first semantic recognition list further includes multiple intents corresponding to the first function, the first semantic recognition result does not meet the first preset condition, and further includes: the first semantic recognition result Also included is a first intent that is not part of the plurality of intents.
  • the intent corresponding to the first function is generally not unlimited.
  • the navigation function may correspond to intentions such as route planning intentions, voice packet intentions, etc.; navigation functions generally do not correspond to hardware-on intentions or hardware-off intentions.
  • the audio function may, for example, correspond to the intent of playing audio, the intent of lyrics, and the like; the audio function usually does not correspond to intents such as path planning.
  • the first apparatus may pre-record the correspondence between the multiple functions and the multiple intents in the first semantic recognition list.
  • the first semantic recognition list may satisfy the first preset condition.
  • the first semantic recognition result can be judged The first preset condition is not met.
  • the first function and the first intent of the first semantic recognition result do not belong to the first semantic recognition list, it may be determined that the first semantic recognition result does not satisfy the first preset condition.
  • the first function of the first semantic recognition result is "car control function", and the first intention of the first semantic recognition result is "open device A";
  • the first semantic recognition list and in the case that "open device A" corresponds to "vehicle control function" in the first semantic recognition list, the first semantic recognition result may satisfy the first preset condition.
  • the first device may determine to perform the first operation determined by the first device according to the first semantic recognition result.
  • the first function of the first semantic recognition result is "navigation function”
  • the first intention of the first semantic recognition result is "play song A”.
  • the title of "Song A” and the place name of "Location A” may be the same in text. That is, the same words can represent different meanings.
  • the plurality of functions of the first semantically recognized list may include a "navigation function”; the plurality of intents of the first semantically recognized list may include "play song A”.
  • "navigation function" does not correspond to "play song A”.
  • the first device may determine to perform the second operation instructed by the second device.
  • the first function of the first semantic recognition result is "car control function", and the first intention of the first semantic recognition result is "opening device B".
  • the multiple functions of the first semantic recognition list may include "vehicle control function”; the multiple intentions of the first semantic recognition list may not include “open device B" (for example, the device B is not set in the vehicle). Thereby, it can be determined that the first semantic recognition result does not satisfy the first preset condition.
  • the first device may determine to perform the second operation instructed by the second device.
  • the first semantic recognition list further includes a plurality of parameters corresponding to the first function, the first semantic recognition result satisfies a first preset condition, and further includes: the first semantic recognition result further includes: A first parameter is included, and the first parameter belongs to the plurality of parameters.
  • the first semantic recognition list further includes a plurality of parameters corresponding to the first function, the first semantic recognition result does not meet the first preset condition, and further includes: the first semantic recognition result Also included is a first parameter that does not belong to the plurality of parameters.
  • the parameters corresponding to the first function are usually not unlimited.
  • the navigation function may correspond to parameters such as location; the navigation function usually does not correspond to parameters such as audio playback mode.
  • the audio function may correspond to parameters such as singers and songs, for example; the audio function usually does not correspond to parameters such as temperature.
  • the first device may pre-record the correspondence between the plurality of functions and the plurality of parameters in the first semantic recognition list.
  • the first semantic recognition list may satisfy the first preset condition.
  • the first semantic recognition result can be judged The first preset condition is not met.
  • the first function and the first parameter of the first semantic recognition result do not belong to the first semantic recognition list, it may be determined that the first semantic recognition result does not satisfy the first preset condition.
  • the first function of the first semantic recognition result is "temperature control function", and the first parameter of the first semantic recognition result is "28°C";
  • a semantic recognition list and in the case that "temperature control function" corresponds to "28°C" in the first semantic recognition list, the first semantic recognition result can satisfy the first preset condition.
  • the first device may determine to perform the first operation determined by the first device according to the first semantic recognition result.
  • the first function of the first semantic recognition result is "audio function”
  • the first parameter of the first semantic recognition result is "HD playback”.
  • the plurality of functions of the first semantic identification list may include “audio function”; the plurality of parameters of the first semantic identification list may include “high definition playback”.
  • "audio function” does not correspond to "HD playback”.
  • "audio function” can correspond to "standard playback”, “high-quality playback”, and “lossless playback”; in the first semantic recognition list, “audio function” can correspond to "standard definition playback", “High-definition playback”, “Ultra-HD playback”, “Blu-ray playback”, etc.
  • the first device may determine to perform the second operation instructed by the second device.
  • the first function of the first semantic recognition result is "playing function", and the first parameter of the first semantic recognition result is "singer A".
  • the plurality of functions of the first semantic recognition list may include a "play function”; the plurality of parameters of the first semantic recognition list may not include "Singer A” (eg, the first device has not previously played a song by singer A). Thereby, it can be determined that the first semantic recognition result does not satisfy the first preset condition.
  • the first device may determine to perform the second operation instructed by the second device.
  • the first semantic recognition result satisfies a first preset condition, further comprising: in the first semantic recognition list, the first parameter corresponds to the first intent.
  • the first semantic recognition result does not meet the first preset condition, and further includes: in the first semantic recognition list, the first parameter does not correspond to the first intent.
  • first function and the first intent may have a corresponding relationship
  • first function and the first parameter may have a corresponding relationship
  • first intent and the first parameter may also have a corresponding relationship
  • the first function of the first semantic recognition result is "car control function”
  • the first intent of the first semantic recognition result is “turn on the air conditioner”
  • the first parameter of the first semantic recognition result is "28°C”
  • Car control function “Turn on the air conditioner”, and “28°C” all belong to the first semantic recognition list, and in the first semantic recognition list, "Car control function” and “Turn on the air conditioner”, and “Turn on the air conditioner” and “Turn on the air conditioner”. 28°C”
  • the first semantic recognition result may satisfy the first preset condition.
  • the first device may determine to perform the first operation determined by the first device according to the first semantic recognition result.
  • the first function of the first semantic recognition result is "car control function"
  • the first intention of the first semantic recognition result is “turn on the air conditioner”
  • the first parameter of the first semantic recognition result is "5°C”.
  • the plurality of functions of the first semantic recognition list may include “vehicle control functions”.
  • the plurality of parameters of the first semantic recognition list may include "turn on the air conditioner”.
  • the plurality of parameters of the first semantic recognition list may include "5°C”.
  • "vehicle control function” may correspond to "turn on the air conditioner”
  • “vehicle control function” may correspond to "5°C”.
  • "turn on the air conditioner” and "5°C" may not correspond.
  • the first device may determine to perform the second operation instructed by the second device.
  • the second semantic recognition list further includes a plurality of parameters corresponding to the first intent, the first semantic recognition result satisfies a first preset condition, and further includes: the first semantic recognition result further includes: A first parameter is included, and the first parameter belongs to the plurality of parameters.
  • the second semantic recognition list further includes a plurality of parameters corresponding to the first intent, the first semantic recognition result does not meet the first preset condition, and further includes: the first semantic recognition result Also included is a first parameter that does not belong to the plurality of parameters.
  • the parameters corresponding to the first intent are usually not unlimited.
  • enabling hardware intent may correspond to parameters such as hardware identification; enabling hardware intent usually does not correspond to parameters such as location.
  • the route planning intent may correspond to parameters such as location, for example; the route planning intent usually does not correspond to parameters such as songs.
  • the first apparatus may pre-record the correspondence between the multiple intents and the multiple parameters in the second semantic recognition list.
  • the second semantic recognition list may satisfy the first preset condition.
  • the first semantic recognition result can be judged The first preset condition is not met.
  • the first intent and the first parameters of the first semantic recognition result do not belong to the second semantic recognition list, it may be determined that the first semantic recognition result does not satisfy the first preset condition.
  • the first intent of the first semantic recognition result is "open device A", and the first parameter of the first semantic recognition result is "1 hour”; in “open device A” and “1 hour” both belong to the second The semantic recognition list, and in the second semantic recognition list, in the case that "open device A" corresponds to "1 hour", the first semantic recognition result can satisfy the first preset condition.
  • the first device may determine to perform the first operation determined by the first device according to the first semantic recognition result.
  • the first intent of the first semantic recognition result is "play audio”
  • the first parameter of the first semantic recognition result is "photo A”.
  • the plurality of intents of the second semantically recognized list may include “play audio.”
  • the plurality of parameters of the second semantic recognition list may include "Photo A”.
  • “play audio” and “photo A” may not correspond.
  • “playing audio” may, for example, correspond to parameters such as “singer”, “song”, and “playlist”; in the second semantic recognition list, “photo A” "Upload” and other intents correspond.
  • the first device may determine to perform the second operation instructed by the second device.
  • the first intention of the first semantic recognition result is "play video”
  • the first parameter of the first semantic recognition result is "actor A”.
  • the multiple intents of the second semantic identification list may include "play video”; the multiple parameters of the second semantic identification list may not include "actor A” (eg, the first device has not previously played actor A's film and television work). Thereby, it can be determined that the first semantic recognition result does not satisfy the first preset condition.
  • the first device may determine to perform the second operation instructed by the second device.
  • the first semantic recognition result may indicate at least two of the first function, the first intent, and the first parameter.
  • the first semantic recognition result may indicate the first function and the first parameter, then in the first semantic
  • the first function indicated by the recognition result corresponds to the same parameter type as the first parameter indicated by the first semantic recognition result
  • the first semantic recognition result satisfies the first preset condition
  • the first parameter indicated by a semantic recognition result corresponds to different parameter types
  • the first semantic recognition result does not satisfy the first preset condition.
  • the first semantic recognition result may indicate the first intent and the first parameter, then when the first intent indicated by the first semantic recognition result corresponds to the same parameter type as the first parameter indicated by the first semantic recognition result, the first The semantic recognition result satisfies the first preset condition; when the first intent indicated by the first semantic recognition result and the first parameter indicated by the first semantic recognition result correspond to different parameter types, the first semantic recognition result does not satisfy the first preset condition.
  • the third semantic recognition list is further used to indicate that the first parameter corresponds to a first parameter type
  • the method further includes: acquiring a first semantic recognition list, where the first semantic recognition list includes multiple a function, and a plurality of parameter types corresponding to the plurality of functions; the first semantic recognition result satisfies the first preset condition, further comprising: the first semantic recognition result further includes a first function, the first semantic recognition result A function belongs to the plurality of functions, and in the first semantic identification list, the first function corresponds to the first parameter type.
  • the third semantic recognition list is further used to indicate that the first parameter corresponds to a first parameter type
  • the method further includes: acquiring a first semantic recognition list, where the first semantic recognition list includes multiple a function, and a plurality of parameter types corresponding to the plurality of functions; the first semantic recognition result does not meet the first preset condition, further comprising: the first semantic recognition result further includes a first function, the The first function belongs to the plurality of functions, and in the first semantic recognition list, the first function does not correspond to the first parameter type.
  • the first device may record the function and the corresponding one or more parameter types in advance.
  • the correspondence (or association) of the plurality of functions and the plurality of parameter types may be stored in the first semantic recognition list.
  • the parameter type corresponding to the vehicle control function may be time, temperature, hardware identification, etc.
  • the parameter type corresponding to the temperature control function may be temperature, etc.
  • the parameter type corresponding to the navigation function may be location, time, and the like.
  • the parameter type corresponding to the audio function may be singer, song, playlist, time, audio playback mode, and the like.
  • the parameter type corresponding to the video function may be a movie, a TV series, an actor, a time, a video playback mode, and the like.
  • the first device may store parameters and corresponding one or more parameter types in advance.
  • the correspondence or association of the plurality of parameters and the plurality of parameter types may be stored in the third semantic recognition list.
  • the parameter types corresponding to air conditioners, cameras, seats, windows, etc. may be hardware identifiers.
  • the parameter type corresponding to 5°C, 28°C, etc. may be temperature.
  • the parameter type corresponding to 1 hour, 1 minute, etc. may be time.
  • the parameter types corresponding to position A, position B, etc. may be positions.
  • the parameter types corresponding to singer A, singer B, etc. may be singers.
  • the parameter types corresponding to song A, song B, etc. may be songs.
  • the parameter types corresponding to playlist A, playlist B, etc. may be playlists.
  • the parameter type corresponding to standard playback, high-quality playback, lossless playback, etc. may be an audio playback mode.
  • the parameter types corresponding to movie A, movie B, etc. may be movies.
  • the parameter types corresponding to TV series A and TV series B may be TV series.
  • the parameter types corresponding to actor A, actor B, etc. may be actors.
  • the parameter type corresponding to standard definition playback, high definition playback, overdue playback, Blu-ray playback, etc. may be a video playback mode.
  • the user can implement a certain type of function of the first device through a variety of voice information.
  • the slots in the voice message are usually filled with a limited type of parameters.
  • the number of parameters corresponding to one parameter type can be any number.
  • the user may instruct the first device to perform operations related to the navigation function through a voice command.
  • the voice instructions can be converted into semantic recognition results related to the navigation function.
  • the parameter type corresponding to the slot of the semantic recognition result may be, for example, parameter types such as position and time.
  • the parameter type corresponding to the slot of the semantic recognition result may not be temperature, song, playlist, audio playback mode, movie, TV series, video playback mode, or the like.
  • the user may instruct the first device to perform operations related to the vehicle control function through a voice command.
  • the voice commands can be converted into semantic recognition results related to vehicle control functions.
  • the parameter type corresponding to the slot of the semantic recognition result may be, for example, parameter types such as time, temperature, and hardware identification.
  • the parameter type corresponding to the slot of the semantic recognition result may not be location, singer, song, playlist, audio playback mode, movie, TV series, actor, video playback mode, or the like.
  • the first semantic recognition result may include a first function and a first parameter.
  • the first device may determine a corresponding operation according to the first semantic recognition result obtained by itself.
  • the first semantic recognition list may include a navigation function
  • the parameter type corresponding to the navigation function may include position
  • the parameter type corresponding to position A can be position.
  • the user may instruct the first device to perform an operation related to the navigation function through a voice instruction "navigate to location A".
  • the voice instruction may be converted into a first semantic recognition result
  • the first function of the first semantic recognition result may be a navigation function
  • the first parameter of the first semantic recognition result may be location A.
  • the first device may determine that the first semantic recognition result satisfies the first preset condition according to the first semantic recognition list and the third semantic recognition list.
  • the first device may determine the first operation to be performed by the first device according to the first semantic recognition result.
  • the first device obtains in advance that the first parameter corresponds to the first parameter type, but the first function does not correspond to the first parameter type, the accuracy of the analysis of the first voice information by the first device may be relatively low. The specific meaning of a parameter may be wrong. In this case, if the first device only determines the corresponding operation according to the first semantic recognition result obtained by itself, a response error may occur. The first device may choose to perform the operation instructed by the second device in response to the user's voice command.
  • the first semantic recognition list may include a navigation function, and in the first semantic recognition list, the parameter type corresponding to the navigation function may include location;
  • the third semantic recognition list may include song A, and in the first semantic recognition list , the parameter type corresponding to song A can be a song.
  • song A may have the same name as position A, but position A is not included in the third semantic recognition list.
  • the user may instruct the first device to perform an operation related to the navigation function through a voice instruction "navigate to location A".
  • location A has the same name as song A, the first device may identify location A as song A.
  • the first function of the first semantic recognition result may be a navigation function; the first parameter of the first semantic recognition result may be song A.
  • the first device may determine that the first semantic recognition result does not satisfy the first preset condition according to the first semantic recognition list and the third semantic recognition list.
  • the first device may determine to perform the second operation instructed by the second device.
  • the first device may revise the semantic recognition result, the first device determines the semantic recognition result (referred to as the second semantic recognition result for distinction) according to the first voice information, and the second semantic recognition result indicates the second function and the second semantic recognition result.
  • the second parameter (which may be the same parameter as the first parameter above), in the case that the preset multiple functions do not include the second function and the preset multiple parameters include the second parameter, modify the second semantic recognition result.
  • the second function is the first function, and the first semantic recognition result is obtained.
  • the first function and the second function are two different functions of the same type.
  • the determining a first semantic recognition result according to the first voice information includes: determining a second semantic recognition result according to the first voice information, where the second semantic recognition result includes a second semantic recognition result. function, the first parameter; if the first semantic recognition list does not include the second function, modify the second semantic recognition result according to the third semantic recognition list
  • the function is the first function, the first semantic recognition result is obtained, and the first function and the second function are two different functions of the same type.
  • the first device may recognize the first voice information to obtain a second semantic recognition result.
  • the second semantic recognition result may be, for example, an initial result obtained after semantic recognition.
  • the second function of the second semantic recognition result does not belong to the first semantic recognition list, which means that the first device may have relatively weak semantic recognition capability for the second function.
  • the first device may have learning, training capabilities, for example.
  • the first device may learn the voice instructions related to the second function for many times, thereby gradually improving the semantic recognition ability for the second function.
  • Part of the voice commands related to the second function may not be recognized by the first device at first, but with the learning and training of the semantic commands related to the second function by the first device, the voice commands related to the second function can be gradually recognized by the first device.
  • the number of voice commands related to the second function that is recognized and accurately recognized by the first device may gradually increase.
  • the parameter types corresponding to the second function may be relatively diverse, and the first device may not be able to completely store all parameters corresponding to the second function. That is, the probability that the first voice information indicating the second function can be accurately recognized by the first device may be relatively low.
  • the first parameter belongs to the third semantic recognition list, it means that the first device may have previously learned the first parameter. Then, the first device can modify the second function in the second semantic recognition result to the first function to obtain the first semantic recognition result.
  • the first semantic recognition result may be a revised semantic recognition result.
  • first function and the second function should be two different functions of the same type.
  • the first function may be a local translation function
  • the second function may be a cloud translation function. Both the first function and the second function may belong to translation type functions.
  • the first function may be a local navigation function
  • the second function may be a cloud navigation function. Both the first function and the second function may be functions of the navigation type.
  • the first function may be a local audio function
  • the second function may be a cloud audio function. Both the first function and the second function may belong to audio playback type functions.
  • the first function may be a local video function
  • the second function may be a cloud video function. Both the first function and the second function may belong to video playback type functions.
  • the first semantic recognition list may include the local navigation function, but not the cloud navigation function, and in the first semantic recognition list, the parameter type corresponding to the local navigation function may include the location; it is assumed that the third semantic recognition list may include the location A , and in the third semantic recognition list, the parameter type corresponding to position A may be position.
  • the user may instruct the first device to perform an operation related to the navigation function through a voice instruction "navigate to location A".
  • the voice instruction may be converted into a second semantic recognition result, and the second function of the second semantic recognition result may be a cloud navigation function; the first parameter of the second semantic recognition result may be location A.
  • the first device can modify the second function of the second semantic recognition result to a local navigation function to obtain the first semantic recognition result,
  • the first function of the first semantic recognition result is a local navigation function.
  • Both the local navigation function and the cloud navigation function belong to the navigation type, and the local navigation function and the cloud navigation function can be two different functions.
  • the first device may determine whether the first semantic recognition result satisfies the first preset condition according to the first semantic recognition list and the third semantic recognition list. Since the local navigation function belongs to the first semantic recognition list, the parameter type corresponding to the local navigation function in the first semantic recognition list includes location, and the parameter type corresponding to location A in the third semantic recognition list is location. Therefore, the first device can determine that the first semantic recognition result satisfies the first preset condition.
  • the first device may determine the first operation to be performed by the first device according to the first semantic recognition result.
  • the first semantic recognition result When the first intent indicated by the first semantic recognition result corresponds to the same parameter type as the first parameter indicated by the first semantic recognition result, the first semantic recognition result satisfies the first preset condition.
  • the third semantic recognition list is further used to indicate that the first parameter corresponds to a first parameter type, and the method further includes: acquiring a second semantic recognition list, where the second semantic recognition list includes multiple a plurality of intentions, and a plurality of parameter types corresponding to the plurality of intentions; the first semantic recognition result satisfies the first preset condition, further comprising: the first semantic recognition result further includes a first intention, the first semantic recognition result An intent belongs to the plurality of intents, and in the second semantic recognition list, the first intent corresponds to the first parameter type.
  • the third semantic recognition list is further used to indicate that the first parameter corresponds to a first parameter type
  • the method further includes: acquiring a second semantic recognition list, where the second semantic recognition list includes multiple a plurality of intents, and a plurality of parameter types corresponding to the plurality of intents; the first semantic recognition result does not meet the first preset condition, further comprising: the first semantic recognition result further includes a first intent, the The first intent belongs to the plurality of intents, and in the second semantic recognition list, the first intent does not correspond to the first parameter type.
  • the first device may store the intent and the corresponding parameter type in advance.
  • the correspondences (or associations) of the plurality of intents and the plurality of parameter types may be stored in the second semantic recognition list.
  • the parameter type corresponding to the hardware startup intent may be time, temperature, hardware identification, and the like.
  • the parameter type corresponding to the route planning intent may be location, time, and the like.
  • the parameter type corresponding to the intent to play audio may be singer, song, playlist, time, audio playback mode, and the like.
  • the parameter type corresponding to the video playback intent may be a movie, a TV series, an actor, a time, a video playback mode, and the like.
  • the first device may record parameters and corresponding parameter types in advance.
  • the correspondence or association of the plurality of parameters and the plurality of parameter types may be stored in the third semantic recognition list.
  • the correspondence between multiple parameters and multiple parameter types has been described above, and will not be described in detail here.
  • the user can realize a certain type of intention of the first device through a variety of voice information.
  • the slots in the speech message are usually filled with a limited type of parameters.
  • the number of parameters corresponding to one parameter type can be any number.
  • the user instructs the first device to perform an operation related to the path planning intent through a voice instruction.
  • the speech instructions can be converted into semantic recognition results related to the path planning intent.
  • the parameter type corresponding to the slot of the semantic recognition result may be, for example, parameter types such as position and time.
  • the parameter type corresponding to the slot of the semantic recognition result may not be temperature, singer, song, playlist, audio playback mode, movie, TV series, actor, video playback mode, and so on.
  • the user instructs the first device to perform an operation related to the hardware activation intention through a voice instruction.
  • the voice command can be converted into a semantic recognition result related to the hardware-on intent.
  • the parameter type corresponding to the slot of the semantic recognition result may be, for example, parameter types such as time, temperature, and hardware identification.
  • the parameter type corresponding to the slot of the semantic recognition result may not be location, singer, song, playlist, audio playback mode, movie, TV series, actor, video playback mode, or the like.
  • the first semantic recognition result may include a first intent and a first parameter.
  • the first device may determine a corresponding operation according to the first semantic recognition result obtained by itself.
  • the second semantic recognition list may include path planning intent, and in the second semantic recognition list, the parameter type corresponding to the path planning intent may include location;
  • the third semantic recognition list may include location A, location B, and in In the second semantic recognition list, the parameter types corresponding to position A and position B may both be positions.
  • the user may instruct the first device to perform an operation related to the route planning intention through a voice instruction "navigate to location A, and approach location B".
  • the voice instruction may be converted into a first semantic recognition result, and the first intent of the first semantic recognition result may be a path planning intent; the first parameter of the first semantic recognition result may include position A and position B.
  • the first device may determine that the first semantic recognition result satisfies the first preset condition according to the second semantic recognition list and the third semantic recognition list.
  • the first device may determine the first operation to be performed by the first device according to the first semantic recognition result.
  • the first device obtains in advance that the first parameter corresponds to the first parameter type, but the first intent does not correspond to the first parameter type, the accuracy of the analysis of the first voice information by the first device may be relatively low.
  • the specific meaning of a parameter may be wrong.
  • the first device only determines the corresponding operation according to the first semantic recognition result obtained by itself, an error in the voice response may occur.
  • the first device may choose to perform the operation instructed by the second device in response to the user's voice command.
  • the second semantic recognition list may include the intent to play audio
  • the parameter type corresponding to the navigation intent may include singers
  • the third semantic recognition list may include actor A
  • the parameter type corresponding to actor A can be actor.
  • actor A can also have the identity of a singer, and actor A is not only an actor but also a singer.
  • the parameter type corresponding to actor A does not include singer.
  • the user may instruct the first device to perform an operation related to the intention of playing the audio through the voice instruction "play actor A's song".
  • the first device may identify actor A as an actor.
  • the first intent of the first semantic recognition result may be the intent to play audio; the first parameter of the first semantic recognition result may be actor A.
  • the first device may determine that the first semantic recognition result does not satisfy the first preset condition according to the second semantic recognition list and the third semantic recognition list.
  • the first device may determine to perform the second operation instructed by the second device.
  • the first device can correct the semantic recognition result, and the first device determines the semantic recognition result (referred to as the third semantic recognition result for distinction) according to the first voice information, and the third semantic recognition result indicates the second intention and the third parameter (may be It is the same parameter as the first parameter above), in the case where the preset multiple intents do not include the second intent and the preset multiple parameters include the third parameter, the second intent in the corrected second semantic recognition result is For the first intent, the first semantic recognition result is obtained, and the first intent and the second intent are two different intents of the same type.
  • the determining a first semantic recognition result according to the first voice information includes: determining a third semantic recognition result according to the first voice information, where the third semantic recognition result includes a second semantic recognition result.
  • intent the first parameter; if the second semantic recognition list does not include the second intention, modify the second semantic recognition result according to the third semantic recognition list
  • the intent is the first intent, and the first semantic recognition result is obtained, and the first intent and the second intent are two different intents of the same type.
  • the first device may recognize the first voice information to obtain a third semantic recognition result.
  • the third semantic recognition result may be, for example, an initial result obtained after semantic recognition.
  • the second intent of the third semantic recognition result does not belong to the second semantic recognition list, which means that the first device may have relatively weak semantic recognition capability for the second intent.
  • the first device may have learning, training capabilities, for example.
  • the first device may learn the voice instructions related to the second intent multiple times, thereby gradually improving the semantic recognition capability for the second intent.
  • Part of the voice instructions related to the second intent may not be recognized by the first device at first, but with the learning and training of the semantic instructions related to the second intent by the first device, the voice instructions related to the second intent can be gradually recognized by the first device.
  • the number of voice instructions related to the second intent that is recognized and accurately recognized by the first device may gradually increase.
  • the types of parameters corresponding to the second intent may be relatively diverse, and the first device may not be able to completely store all parameters corresponding to the second intent. That is, the probability that the first voice information indicating the second intention can be accurately recognized by the first device may be relatively low.
  • the first parameter belongs to the third semantic recognition list, it means that the first device may have previously learned the first parameter. Then, the first device may modify the second intent in the third semantic recognition result to the first intent to obtain the first semantic recognition result.
  • the first semantic recognition result may be a revised semantic recognition result.
  • first intent and the second intent should be two different intents of the same type.
  • the first intent may be a local translation English intent
  • the second intent may be a cloud translation English intent. Both the first intent and the second intent may belong to the intent of translating English.
  • the first intent may be a local path planning intent
  • the second intent may be a cloud path planning intent. Both the first intent and the second intent may belong to a path planning type of intent.
  • the first intent may be an intent to play local audio
  • the second intent may be an intent to play cloud audio. Both the first intent and the second intent may belong to the play audio type intent.
  • the first intent may be an intent to play a local video
  • the second intent may be an intent to play a cloud video. Both the first intent and the second intent may belong to playing video type intents.
  • the second semantic recognition list may include the intent to play local audio, but not the intent to play cloud audio, and in the second semantic recognition list, the parameter type corresponding to the intent to play local audio may include singers; it is assumed that the third semantic recognition list may Singer A is included, and in the second semantic recognition list, the parameter type corresponding to singer A may be singer.
  • the user may instruct the first device to perform an operation related to the intention of playing the audio through the voice instruction "play singer A's song".
  • the voice instruction may be converted into a third semantic recognition result, and the second intention of the third semantic recognition result may be the intention to play cloud audio; the first parameter of the third semantic recognition result may be singer A.
  • the first device can modify the second intent of the third semantic recognition result to play the local audio intent to obtain the first semantic The recognition result, wherein the first intent of the first semantic recognition result is the intent to play local audio.
  • Playing local audio intent and playing cloud audio intent belong to playing audio type, and local audio intent and playing cloud audio intent can be two different intents.
  • the first device may determine whether the first semantic recognition result satisfies the first preset condition according to the second semantic recognition list and the third semantic recognition list.
  • the first device can determine that the first semantic recognition result satisfies the first preset condition.
  • the first device may determine the first operation to be performed by the first device according to the first semantic recognition result.
  • the first semantic recognition result satisfying the first preset condition includes: the first semantic recognition result includes a first indicator bit, and the first indicator bit indicates that the first semantic recognition result satisfies the first preset condition.
  • the first semantic recognition result does not meet the first preset condition, comprising: the first semantic recognition result includes a second indication bit, and the second indication bit indicates the first semantic recognition result The first preset condition is not met.
  • the first device may determine to perform the first operation determined by the first device according to the first semantic recognition result.
  • the specific content of the first indication bit may include information such as "local”, “end-side”, etc., to instruct the first apparatus itself to determine the operation to be performed.
  • the first device may determine to perform the second operation indicated by the second device.
  • the specific content of the second indication bit may include information such as "cloud”, “cloud side”, etc., to instruct the first device to perform an operation according to the instruction of the second device.
  • the first device may determine to execute the second instruction indicated by the second device. operate.
  • the first device may determine to execute the first device according to the first semantic recognition result. Determined first action.
  • the fourth semantic recognition result may indicate at least two of the first function, the first intent, and the first parameter.
  • the fourth semantic recognition result may indicate the first function and the first parameter.
  • the first semantic recognition result carrying the first indication bit can be determined; when the first function indicated by the fourth semantic recognition result and the fourth semantic recognition result When the first parameter indicated by the identification result corresponds to different parameter types, the first semantic identification result carrying the second indication bit may be determined.
  • the fourth semantic recognition result may indicate the first intent and the first parameter, then when the first intent indicated by the fourth semantic recognition result and the first parameter indicated by the fourth semantic recognition result correspond to the same parameter type, it may be determined.
  • the first semantic recognition result carrying the first indication bit; when the first intention indicated by the fourth semantic recognition result and the first parameter indicated by the fourth semantic recognition result correspond to different parameter types, it can be determined that the first intent that carries the second indication bit is A semantic recognition result.
  • the determining a first semantic recognition result according to the first voice information includes: determining a fourth semantic recognition result according to the first voice information, where the fourth semantic recognition result includes the first function and first parameter; the fourth semantic identification list includes the first function, the fourth semantic identification list indicates that the first function corresponds to the first parameter type, and the third semantic identification list includes all If the third semantic recognition list indicates that the first parameter corresponds to the first parameter type, the first semantic recognition result is determined, and the semantic recognition result includes the first indicator bit.
  • the first device may recognize the first voice information to obtain a fourth semantic recognition result.
  • the fourth semantic recognition result may be, for example, an initial result obtained after semantic recognition.
  • the first function of the fourth semantic recognition result does not belong to the first semantic recognition list, which means that the first device may have relatively weak semantic recognition capability for the first function.
  • the first device may have learning, training capabilities, for example.
  • the first device may learn the voice instructions related to the first function multiple times, thereby gradually improving the semantic recognition ability for the first function.
  • Part of the voice instructions related to the first function may not be recognized by the first device at first, but with the learning and training of the semantic instructions related to the first function by the first device, the voice instructions related to the first function can be gradually recognized by the first device.
  • the number of voice instructions related to the first function that is recognized and accurately recognized by the first device may gradually increase.
  • the fourth semantic recognition list may be used, for example, to record the functions newly learned (or trained, updated, etc.) of the first device.
  • the parameter types corresponding to the first function may be relatively diverse, and the first device may not be able to completely store all parameters corresponding to the first function. That is, the probability that the first voice information indicating the first function can be accurately recognized by the first device may be relatively low.
  • the first parameter belongs to the third semantic recognition list, it means that the first device may have previously learned the first parameter.
  • the first function corresponds to the first parameter type
  • the first parameter corresponds to the first parameter type, so the probability that the first device accurately recognizes the first voice information is relatively higher.
  • the first device may determine the first semantic recognition result, and indicate that the first semantic recognition result satisfies the first preset condition through the first indication bit in the first semantic recognition result.
  • the first semantic recognition result may be a revised semantic recognition result.
  • the first semantic recognition list may not include a navigation function; the fourth semantic recognition list may include a navigation function, and in the fourth semantic recognition list, the parameter type corresponding to the navigation function may include a location; it is assumed that the third semantic recognition list may Position A is included, and in the third semantic recognition list, the parameter type corresponding to position A may be position.
  • the user may instruct the first device to perform an operation related to the navigation function through a voice instruction "navigate to location A".
  • the voice instruction may be converted into a fourth semantic recognition result, and the first function of the fourth semantic recognition result may be a navigation function; the first parameter of the fourth semantic recognition result may be location A.
  • the first device can determine the first semantic recognition result according to the fourth semantic recognition result, the first semantic recognition list, the third semantic recognition list, and the fourth semantic recognition list, and the first semantic recognition result can be Including the first indicator bit.
  • the first device may determine that the first semantic recognition result satisfies the first preset condition.
  • the first device may determine the first operation to be performed by the first device according to the first semantic recognition result.
  • the method further includes: acquiring a fifth semantic recognition list, where the fifth semantic recognition list includes multiple intents and multiple parameter types corresponding to the multiple intents; acquiring a third semantic recognition list list, the third semantic recognition list includes multiple parameters and multiple parameter types corresponding to the multiple intents; the determining the first semantic recognition result according to the first voice information includes: according to the The first voice information determines a fifth semantic recognition result, where the fifth semantic recognition result includes a first intent and a first parameter; the fifth semantic recognition list includes the first intent, and the fifth semantic recognition list Indicates that the first intent corresponds to the first parameter type, the third semantic identification list includes the first parameter, and the third semantic identification list indicates that the first parameter corresponds to the first parameter type, The first semantic recognition result is determined, and the semantic recognition result includes the first indicator bit.
  • the first device may recognize the first voice information to obtain a fifth semantic recognition result.
  • the fifth semantic recognition result may be, for example, an initial result obtained after semantic recognition.
  • the first intent of the fifth semantic recognition result does not belong to the second semantic recognition list, which means that the first device may have relatively weak semantic recognition capability for the first intent.
  • the first device may have learning, training capabilities, for example.
  • the first device may learn the voice instructions related to the first intent multiple times, thereby gradually improving the semantic recognition capability for the first intent.
  • Some voice instructions related to the first intent may not be recognized by the first device at first, but with the learning and training of the semantic instructions related to the first intent by the first device, the voice instructions related to the first intent can be gradually recognized by the first device.
  • the number of voice instructions related to the first intent that is recognized and accurately recognized by the first device may gradually increase.
  • the fifth semantic recognition list may, for example, be used to record the intention of the first device to newly learn (or train, update, etc.).
  • the types of parameters corresponding to the first intent may be relatively diverse, and the first device may not be able to completely store all parameters corresponding to the first intent. That is, the probability that the first voice information indicating the first intention can be accurately recognized by the first device may be relatively low.
  • the first parameter belongs to the third semantic recognition list, it means that the first device may have previously learned the first parameter.
  • the first intent corresponds to the first parameter type
  • the first parameter corresponds to the first parameter type, so the probability that the first device accurately recognizes the first voice information is relatively higher.
  • the first device may determine the first semantic recognition result, and indicate that the first semantic recognition result satisfies the first preset condition through the first indication bit in the first semantic recognition result.
  • the first semantic recognition result may be a revised semantic recognition result.
  • the second semantic recognition list may not include the intention to play audio; the fifth semantic recognition list may include the intention to play audio, and in the fifth semantic recognition list, the parameter type corresponding to the intention to play audio may include singers; it is assumed that the third semantic The identification list may include singer A, and in the third semantic identification list, the parameter type corresponding to singer A may be singer.
  • the user may instruct the first device to perform an operation related to the intention of playing the audio through a voice instruction "navigate to singer A".
  • the voice instruction may be converted into a fifth semantic recognition result, and the first intent of the fifth semantic recognition result may be an intent to play audio; the first parameter of the fifth semantic recognition result may be singer A.
  • the first device can determine the first semantic recognition result according to the fifth semantic recognition result, the second semantic recognition list, the third semantic recognition list, the fifth semantic recognition list, and the first semantic recognition result.
  • a first indicator bit may be included.
  • the first device may determine that the first semantic recognition result satisfies the first preset condition.
  • the first device may determine the first operation to be performed by the first device according to the first semantic recognition result.
  • the third semantic recognition list may include the first parameter, and in the third semantic recognition list, the first parameter may correspond to the first parameter type.
  • the third semantic recognition list does not include the first parameter, or in the third semantic recognition list, the first parameter may correspond to other parameter types than the first parameter type. In this case, the first semantic recognition result recognized by the first device for the first voice information may be relatively inaccurate.
  • the first device may acquire a sixth semantic recognition result from the second device, and the sixth semantic recognition result may be determined by the second device according to the first voice information.
  • the first device may determine the second parameter and the second parameter type according to the sixth semantic recognition result indicated by the second device, and save the association relationship between the second parameter and the second parameter type.
  • the method further includes: determining a second parameter and a second parameter type according to the sixth semantic recognition result; recording the association relationship between the second parameter and the second parameter type in the sixth Three semantic recognition lists.
  • the sixth semantic recognition result may include the second parameter and the second parameter type.
  • the first apparatus may record the association relationship between the second parameter and the second parameter type in the third semantic recognition list.
  • the first device encounters a voice instruction related to the second parameter and the second parameter type again, since the third semantic recognition list includes the association relationship between the second parameter and the second parameter type, therefore The first device can more easily and accurately recognize the user's voice command.
  • the method further includes: acquiring N semantic recognition results from the second device; and recording the association relationship between the second parameter and the second parameter type in a third semantic recognition list, including: Each semantic recognition result in the N semantic recognition results includes the second parameter, the second parameter of each semantic recognition result corresponds to the second parameter type, and N is greater than the first preset threshold.
  • the association relationship between the second parameter and the second parameter type is recorded in the third semantic recognition list.
  • the first device may record the association relationship between the second parameter and the second parameter type in the third semantic recognition list. That is, the number of times the first device learns the association relationship between the second parameter and the second parameter type may be relatively large. For the relationship between parameters and parameter types, the representativeness of single-shot speech recognition may be relatively low. The first device learns the association relationship between the second parameter and the second parameter type multiple times, so that the association relationship between the second parameter and the second parameter type can be relatively accurate or relatively important, thereby helping to improve the accuracy of the first device in recognizing the voice command Spend.
  • FIG. 7 is a schematic flowchart of a voice interaction method 700 provided by an embodiment of the present application.
  • the method 700 shown in FIG. 7 can be applied to, for example, the voice interaction system 500 shown in FIG. 5 .
  • the first device performs a voice wake-up operation in response to a user's input operation.
  • the first device acquires first voice information from a voice sensor.
  • the first voice information may be indicated by the user in the first round of voice interaction.
  • the first round of voice interaction may be the first round of voice interaction or a non-first round of voice interaction in multiple rounds of voice interaction, and the first round of voice interaction may not be the last round of voice interaction in multiple rounds of voice interaction.
  • the first device determines a first semantic recognition result according to the first voice information.
  • the first device determines to execute the first operation determined by the first device according to the first semantic recognition result, or determines to execute the first operation determined by the second device. The indicated second action.
  • 701 to 704 For specific implementations of 701 to 704, reference may be made to 601, 602, 603a, and 603b shown in FIG. 6, and details are not described herein again.
  • the first device may perform the first operation or the second operation. That is, the execution of the first operation or the second operation by the first device may mean the end of the first round of voice interaction, or the start of the next round of voice interaction.
  • the first device acquires the second voice information from the voice sensor.
  • the second voice information may be indicated by the user in the second round of voice interaction.
  • the second round of voice interaction may be the next round of voice interaction after the first round of voice interaction.
  • the first device determines a seventh semantic recognition result according to the second speech information.
  • Multi-round voice interaction can be applied to scenarios with more voice interaction information.
  • the user and the first device can conduct voice dialogue with respect to a special scenario or a special field.
  • the user may not be able to completely achieve the purpose of voice control through one voice command.
  • the user can control the first device to purchase a plane ticket from city A to city B.
  • the number of flights from city A to city B can be very high.
  • the user may interact with the first device multiple times by voice to finally determine the flight information to be purchased.
  • the first device may ask the user which time period of the flight can be preferentially purchased, and the user may reply to the first device for a certain time period.
  • the first device may ask the user which singer's work is preferred to listen to, and the user may reply to the first device the name of a certain singer.
  • the second voice information may be related to the operation indicated by the first voice information. In this case, the first device may determine to perform the operation indicated by the second voice information to continue the current multiple rounds of voice interaction.
  • user responses may be somewhat random.
  • the user's reply may not be related to the first device query or the voice information to be acquired.
  • the first device asks the user which time period flight can be purchased first, but the user replies to the first device the name of a certain singer.
  • the first device asks the user in which time period the flight can be purchased first, but the user replies to the first device that a car accident is currently occurring.
  • the content of the previous voice interaction may be invalid, which may increase the number of manipulations of the first device by the user.
  • the user can indicate to the first device the flight information to be purchased, the user's identity information, etc.; and other irrelevant voice messages. If the first device ends the multiple rounds of voice interaction about purchasing plane tickets and responds to the user's unintentional instructions, this not only does not meet the user's expectations, but also invalidates the flight information and identity information previously indicated by the user. If the user wants to repurchase the plane ticket, the user needs to indicate the flight information, identity information, etc. to the first device again.
  • the first device may cause the first device to fail to respond to the user's instruction in some special scenarios, so that the user's voice instruction becomes ineffective.
  • the first device can choose whether to end multiple rounds of voice interaction according to the situation. For example, in the case that the first voice information and the second voice information are irrelevant or the correlation is smaller than the second preset threshold, or, in the case that the second voice information does not belong to the feedback for the first operation or the second operation, it is To satisfy the user's voice control experience, the first device may determine whether to end multiple rounds of voice interaction currently according to the second preset condition.
  • the second preset condition may be a preset condition for ending multiple rounds of voice interaction. For example, when the second preset condition is satisfied, the first device can end multiple rounds of voice interaction and start a new voice interaction; when the second preset condition is not met, the first device can continue the current multi-round voice interaction.
  • the first device may determine whether the second voice information is related to the operation indicated by the first voice information. For example, the first operation is used to ask the user question A, and the second voice information is a reply to the question A.
  • the second voice information is related to the operation indicated by the first voice information, for example, it may include that the first voice information is related to the second voice information, or the degree of correlation between the first voice information and the second voice information is higher than a second preset threshold.
  • the second voice information is related to the operation indicated by the first voice information: the same function, the same intent, and the same parameter.
  • the second voice information has nothing to do with the operation indicated by the first voice information, or the correlation between the second voice information and the operation indicated by the first voice information is lower than the second preset threshold, or, The first voice information is irrelevant to the second voice information, or the degree of correlation between the first voice information and the second voice information is lower than the second preset threshold.
  • the first device may need to end the current multiple rounds of voice interaction, or may need to repeat the operations corresponding to the previous discussion on voice interaction to continue the current multiple rounds of voice interaction.
  • the first device determines to perform the operation indicated by the first voice information according to the seventh semantic recognition result and the second preset condition.
  • the first device determines to perform the third operation determined by the first device according to the seventh semantic recognition result.
  • the first device determines to perform the fourth operation instructed by the second device according to the seventh semantic recognition result and the second preset condition.
  • the first device may still perform the operation in the previous round of voice interaction.
  • the operation indicated by the first voice information may be the first operation or the second operation.
  • the first apparatus may still determine to perform the first operation.
  • the current multi-round voice interaction may be the terminal-side multi-round voice interaction.
  • the operation performed by the first device in 704 is the second operation, the first device may still determine to perform the second operation in 707a.
  • the current multi-rounds of voice interaction may be cloud-side multi-rounds of voice interaction.
  • the first device obtains the second voice information, and can instruct the operation indicated by the second voice information.
  • the first device may determine to perform the third operation determined by the first device according to the seventh semantic recognition result; in 707c, the first device may determine to perform the fourth operation instructed by the second device.
  • the third operation by the first apparatus reference may be made to 603a shown in FIG. 6 , and details are not described herein again.
  • the fourth operation by the first apparatus reference may be made to 603b shown in FIG. 6 , and details are not described herein again.
  • the first device determines in 707b to perform the third operation.
  • the new device-side voice interaction can end the previous round of device-side voice interaction.
  • the first device determines in 707c to perform the fourth operation.
  • the new cloud-side voice interaction can end the previous round of terminal-side voice interaction.
  • the first device determines in 707b to perform the third operation.
  • the new device-side voice interaction can end the previous round of cloud-side voice interaction.
  • the first device determines in 707c to perform the fourth operation.
  • the new cloud-side voice interaction can end the previous round of cloud-side voice interaction.
  • the first preset condition may be used to indicate whether the first device determines a corresponding operation according to the semantic recognition result determined by itself
  • the second preset condition may be used to indicate whether the first device ends multiple rounds of voice interaction.
  • the first device may determine an operation to be performed by the first device for the second voice information according to the seventh semantic recognition result, the first preset condition, and the second preset condition.
  • the seventh semantic recognition result may not satisfy the second preset condition, which means that the first device may not end multiple rounds of voice interaction, and determine to execute the operation indicated by the first voice information, that is, execute the above-mentioned 707a.
  • the first device may judge whether the seventh semantic recognition result satisfies the first preset condition, or may not judge whether the seventh semantic recognition result satisfies the first preset condition.
  • the seventh semantic recognition result may satisfy the second preset condition and may satisfy the first preset condition.
  • the seventh semantic recognition result satisfies the second preset condition, which means that the first device can end multiple rounds of voice interaction and determine to perform the operation indicated by the second voice information.
  • the seventh semantic recognition result satisfies the first preset condition, which means that the first device can determine to perform the operation determined by the first device according to the seventh semantic recognition result, that is, perform the above-mentioned 707b.
  • the seventh semantic recognition result may satisfy the second preset condition and may not satisfy the first preset condition.
  • the seventh semantic recognition result satisfies the second preset condition, which means that the first device can end multiple rounds of voice interaction and determine to perform the operation indicated by the second voice information.
  • the seventh semantic recognition result does not satisfy the first preset condition, which means that the first device can determine to perform the operation instructed by the second device, that is, perform the above-mentioned 707c.
  • the method further includes: sending second voice information to a second device; and determining to perform a fourth operation instructed by the second device according to the seventh semantic recognition result and the second preset condition , including: in the seventh semantic recognition result, or when the seventh semantic recognition result does not satisfy the second preset condition, and the first semantic recognition result does not satisfy the first preset condition
  • obtain the eighth semantic recognition result from the second device according to the eighth semantic recognition result and the second preset condition, determine to execute the operation indicated by the first voice information, or determine to execute the fourth operation.
  • the seventh semantic recognition result recognized by the first device may be relatively inaccurate, and the eighth semantic recognition result recognized by the second device may be relatively accurate.
  • the first device may determine whether the eighth semantic recognition result satisfies the second preset condition, and then determine whether to end the current multiple rounds of voice interaction.
  • the first semantic recognition result may satisfy the first preset condition, which may mean that the current multi-round voice interaction may be terminal-side multi-round voice interaction. If the seventh semantic recognition result does not satisfy the first preset condition, and the eighth semantic recognition result satisfies the second preset condition, it may mean that the current multiple rounds of voice interaction on the device side can be ended, and the new voice interaction can be a cloud side voice interaction.
  • the first apparatus may not determine the operation to be performed according to the seventh semantic recognition result.
  • the first semantic recognition result does not satisfy the first preset condition, it may mean that the current multiple rounds of voice interaction belong to cloud-side voice interaction.
  • the first device may acquire the eighth semantic recognition result from the second device to continue the current cloud-side multi-round voice interaction. The first device may determine whether the eighth semantic recognition result satisfies the second preset condition, and then determine whether to end the current cloud-side multi-round voice interaction.
  • the eighth semantic recognition result may not satisfy the second preset condition, which means that the first device may not end multiple rounds of voice interaction, that is, determine to execute the operation indicated by the first voice information, that is, execute the above-mentioned 707a.
  • the eighth semantic recognition result can satisfy the second preset condition, which means that the first device can end multiple rounds of voice interaction, that is, it is determined to perform the operation instructed by the second device, that is, the above-mentioned 707c is performed.
  • the priority of the semantic recognition result related to the operation indicated by the first voice information is the first priority
  • the priority of the semantic recognition result unrelated to the operation indicated by the first voice information is the second priority
  • the priority of the seventh semantic recognition result is the first priority
  • the second voice information is related to the operation indicated by the first voice information.
  • the first device may determine to perform the operation indicated by the second voice information. If the priority of the seventh semantic recognition result is the second priority, it may be considered that the second voice information has nothing to do with the operation indicated by the first voice information. If the second priority is higher than the first priority, the second device may end the current multiple rounds of voice interaction, and may determine to perform the operation indicated by the second voice information. If the second priority is lower than the first priority, the second device may continue the current multiple rounds of voice interaction, and may determine to perform the operation indicated by the first voice information.
  • the priority of the eighth semantic recognition result is the first priority
  • the first device may determine to perform the operation instructed by the second device. If the priority of the eighth semantic recognition result is the second priority, it may be considered that the second voice information is irrelevant to the operation indicated by the first voice information. If the second priority is higher than the first priority, the second device may end the current multiple rounds of voice interaction, and may determine to perform the operation instructed by the second device. If the second priority is lower than the first priority, the second device may continue the current multiple rounds of voice interaction, and may determine to perform the operation indicated by the first voice information.
  • the seventh semantic recognition result satisfying the second preset condition includes: the priority of the seventh semantic recognition result is higher than the priority of the first semantic recognition result.
  • the first device may compare the priority of the seventh semantic recognition result with the priority of the first semantic recognition result, and determine whether to end the current multiple rounds of voice interaction.
  • the first apparatus may determine to perform the first operation, or may determine to perform the second operation.
  • the first apparatus may execute 707b or 707c.
  • the determining to perform the second operation instructed by the second device according to the first semantic recognition result and the first preset condition includes: when the first semantic recognition result does not satisfy the first predetermined condition.
  • the seventh semantic recognition result satisfies the second preset condition, including: the priority of the seventh semantic recognition result is higher than the priority of the sixth semantic recognition result.
  • the first device may obtain a sixth semantic recognition result from the second device to determine to perform the second operation instructed by the second device.
  • the first device may compare the priority of the seventh semantic recognition result with the priority of the sixth semantic recognition result, and determine whether to end the current multiple rounds of voice interaction.
  • the eighth semantic recognition result satisfying the second preset condition includes: the priority of the eighth semantic recognition result is higher than the priority of the first semantic recognition result.
  • the first device may compare the priority of the eighth semantic recognition result with the priority of the first semantic recognition result, and determine whether to end the current multiple rounds of voice interaction.
  • the first apparatus may determine to perform the first operation, or may determine to perform the second operation.
  • the determining to perform the second operation instructed by the second device according to the first semantic recognition result and the first preset condition includes: when the first semantic recognition result does not satisfy the first predetermined condition.
  • the eighth semantic recognition result satisfies the second preset condition, comprising: the priority of the eighth semantic recognition result is higher than the priority of the sixth semantic recognition result.
  • the first device may obtain a sixth semantic recognition result from the second device to determine to perform the second operation instructed by the second device.
  • the first device may compare the priority of the eighth semantic recognition result with the priority of the sixth semantic recognition result, and determine whether to end the current multiple rounds of voice interaction.
  • the user can end the current multiple rounds of voice interaction through a high-priority voice command.
  • the first device may record some high-priority instructions, for example. If the seventh semantic recognition result or the eighth semantic recognition result corresponds to a high-priority instruction, the first device may end the current multi-round speech interaction.
  • the priority of the seventh semantic recognition result is higher than the priority of the first semantic recognition result, including one or more of the following: the priority of the function indicated by the seventh semantic recognition result higher than the priority of the function indicated by the first semantic recognition result; the priority of the intent indicated by the seventh semantic recognition result is higher than the priority of the intent indicated by the first semantic recognition result; the seventh semantic The priority of the parameter indicated by the recognition result is higher than the priority of the parameter indicated by the first semantic recognition result.
  • the first device may compare the priority of the third function with the priority of the first function, and/or, compare the priority of the third intent with the priority of the first intent, and/or, compare the priority of the third parameter and The priority of the first parameter determines whether to end the current multiple rounds of voice interaction. That is, the first device may record some high-priority functions, and/or high-priority intents and/or high-priority parameters. If the first device identifies one or more of a high-priority function, a high-priority intent, and a high-priority parameter, the first device may end the current multi-round voice interaction and determine to perform the second voice information indication operation.
  • the seventh semantic recognition result includes a third function, and the third function may be a "safety control function".
  • the first semantic recognition result includes a first function, and the first function may be a "navigation function”.
  • Security Control Functions can take precedence over other functions, such as Navigation Functions.
  • the first device can determine that the seventh semantic recognition result satisfies the second preset condition according to the priority of the third function and the priority of the first function, and then determine to execute the operation indicated by the second voice information.
  • the seventh semantic recognition result includes a third intent, and the third intent may be an "accident mode intent".
  • the first semantic recognition result includes a first intent, and the first intent may be "play audio intent”.
  • An “incident mode intent” may refer to a user's intent to enable accident mode. Due to the urgency of accident safety, "incident mode intent” can take precedence over other intents, such as "play audio intent”.
  • the first device may determine that the seventh semantic recognition result satisfies the second preset condition according to the priority of the third intent and the priority of the first intent, and then determine to execute the operation indicated by the second voice information.
  • the seventh semantic recognition result includes a third parameter, and the third parameter may be "door lock”.
  • the first semantic recognition result includes a first parameter, and the first parameter may be "song A”.
  • the user may instruct the first device to perform an operation related to "door lock” through the second voice information. Since the opening or closing of "door lock” relatively easily affects the driving safety and parking safety of the vehicle, the priority of "door lock” can be higher than other parameters, for example, higher than "song A”.
  • the first device can determine that the seventh semantic recognition result satisfies the second preset condition according to the priority of the third parameter and the priority of the first parameter, and then determine to execute the operation indicated by the second voice information.
  • the priority of the seventh semantic recognition result is higher than the priority of the sixth semantic recognition result, including one or more of the following: the priority of the function indicated by the seventh semantic recognition result higher than the priority of the function indicated by the sixth semantic recognition result; the priority of the intent indicated by the seventh semantic recognition result is higher than the priority of the intent indicated by the sixth semantic recognition result; the seventh semantic The priority of the parameter indicated by the recognition result is higher than the priority of the parameter indicated by the sixth semantic recognition result.
  • the priority of the eighth semantic recognition result is higher than the priority of the first semantic recognition result, including one or more of the following: the priority of the function indicated by the eighth semantic recognition result The priority of the function indicated by the first semantic recognition result is higher than that of the function indicated by the first semantic recognition result; the priority of the intent indicated by the eighth semantic recognition result is higher than the priority of the intent indicated by the first semantic recognition result; the eighth semantic The priority of the parameter indicated by the recognition result is higher than the priority of the parameter indicated by the first semantic recognition result.
  • the priority of the eighth semantic recognition result is higher than the priority of the sixth semantic recognition result, including one or more of the following: the priority of the function indicated by the eighth semantic recognition result higher than the priority of the function indicated by the sixth semantic recognition result; the priority of the intent indicated by the eighth semantic recognition result is higher than the priority of the intent indicated by the sixth semantic recognition result; the eighth semantic The priority of the parameter indicated by the recognition result is higher than the priority of the parameter indicated by the sixth semantic recognition result.
  • FIG. 8 is a schematic flowchart of a voice interaction method 800 provided by an embodiment of the present application.
  • the method 800 shown in FIG. 8 can be applied to, for example, the voice interaction system 500 shown in FIG. 5 .
  • the first device performs a voice wake-up operation in response to a user's input operation.
  • the first device acquires voice information a from a voice sensor.
  • the voice information a may correspond to the above-mentioned first voice information, for example.
  • the voice information a may be, for example, "turn on device A".
  • the first device sends voice information a to the second device.
  • the first device determines a semantic recognition result a according to the speech information a.
  • the semantic recognition result a may, for example, correspond to the first semantic recognition result above.
  • the semantic recognition result a may include, for example, a function a, an intent a, and a parameter a; wherein, the function a may be "car control function"; the intent a may be "open”; the parameter a may be "device A" ".
  • the function a may correspond to the first function above, for example.
  • the intent a may, for example, correspond to the first intent above.
  • the parameter a may correspond to the first parameter above, for example.
  • the first device determines, according to the semantic recognition result a and the first preset condition, to execute the operation a determined by the first device according to the semantic recognition result a.
  • operation a may, for example, correspond to the first operation above.
  • the first device stores a semantic recognition list a (optionally, the semantic recognition list a may, for example, correspond to the first semantic recognition list above, or correspond to the general table above, and the general table may contain The first semantic recognition list, the second semantic recognition list, and the third semantic recognition list above).
  • the multiple functions of the semantic recognition list a may include "car control function”, the multiple intentions of the semantic recognition list a may include “open”, and the multiple parameters of the semantic identification list a may include “device A”.
  • the function "vehicle control function” may correspond to the intent "open”
  • the intent "open” may correspond to the parameter "device A".
  • the first device may determine that the semantic recognition result a satisfies the first preset condition according to the semantic recognition result a and the semantic recognition list a. Since the semantic recognition result a satisfies the first preset condition, the first device may determine to perform the operation a determined by the first device according to the semantic recognition result a. Operation a may be, for example, turning on device A.
  • the second device sends the semantic recognition result b to the first device.
  • the semantic recognition result b may correspond to the sixth semantic recognition result above, for example.
  • the semantic recognition result b may be the same as or similar to the semantic recognition result a, for example.
  • the semantic recognition result b may include, for example, function b, intent b, and parameter b; wherein, function b may be "car control function"; intent b may be "open”; parameter b may be "device A".
  • the parameter b may correspond to the second parameter above, for example.
  • the function b may correspond to the fourth function above, for example.
  • Intent b may, for example, correspond to the fourth intent above.
  • the parameter b may correspond to the fourth parameter above, for example.
  • the first device discards the semantic recognition result b from the second device.
  • the first device may determine to perform the operation a determined by the first device according to the semantic recognition result a, the first device may ignore the instruction of the second device for the voice information a.
  • 807 may be an optional step.
  • the first device may determine whether the currently recognized semantic recognition result has a relatively high accuracy rate according to the existing semantic recognition list. Under the circumstance that the currently recognized semantic recognition result may have a relatively high accuracy rate, the first device may choose to determine the operation to be performed by itself, which is conducive to quickly and accurately responding to the user's voice in the scene that the first device is good at instruction.
  • FIG. 9 is a schematic flowchart of a voice interaction method 900 provided by an embodiment of the present application.
  • the method 900 shown in FIG. 9 can be applied to, for example, the voice interaction system 500 shown in FIG. 5 .
  • the first device performs a voice wake-up operation in response to a user's input operation.
  • the first device acquires voice information b from a voice sensor.
  • the voice information b may correspond to the above-mentioned first voice information, for example.
  • the voice information b may be, for example, "navigate to location A".
  • the first device sends voice information b to the second device.
  • the first device determines a semantic recognition result c according to the speech information b.
  • the semantic recognition result c may, for example, correspond to the first semantic recognition result above.
  • the semantic recognition result c may include, for example, a function c, an intent c, and a parameter c; wherein, the function c may be "cloud navigation function"; the intent c may be "path planning"; the parameter c may be "location” B". "Position A" and "Position B" may be the same or different.
  • the function c for example, may correspond to the first function above.
  • the intent c may eg correspond to the first intent above.
  • the parameter c may, for example, correspond to the first parameter above.
  • the semantic recognition result c may include, for example, indication information a, where the indication information a is used to indicate that semantic recognition is not supported.
  • the first device determines to execute the operation b instructed by the second device according to the semantic recognition result c and the first preset condition.
  • operation b may, for example, correspond to the second operation above.
  • the first device stores a semantic recognition list b (optionally, the semantic recognition list b may, for example, correspond to the first semantic recognition list above, or correspond to the general table above, and the general table may contain The first semantic recognition list, the second semantic recognition list, and the third semantic recognition list above).
  • Multiple functions of semantic recognition list b may not include "cloud navigation function", and/or, multiple parameters of semantic recognition list b may not include "location B", and/or, in semantic recognition list b, function " “Cloud Navigation Function” may not correspond to the parameter "Location B".
  • the first device may determine, according to the semantic recognition result c and the semantic recognition list b, that the semantic recognition result a does not satisfy the first preset condition. Since the semantic recognition result a does not satisfy the first preset condition, the first device may determine to perform the operation b instructed by the second device.
  • the second device sends the semantic recognition result d for the speech information b to the first device.
  • the semantic recognition result d may correspond to the sixth semantic recognition result above, for example.
  • the semantic recognition result d may include, for example, function d, intent d, and parameter d; wherein, function d may be "cloud navigation function"; intent d may be "path planning"; parameter d may be "location” A".
  • the parameter d may correspond to the second parameter above, for example.
  • the function d may correspond to the fourth function above, for example.
  • Intent d may, for example, correspond to the fourth intent above.
  • the parameter d may correspond to the fourth parameter above, for example.
  • the first device performs operation b according to the semantic recognition result d.
  • Operation b may be, for example, a path plan to location A.
  • the example shown in FIG. 9 may further include the following steps.
  • the first device determines the parameter d and the parameter type a according to the semantic recognition result d, and records the association relationship between the parameter d and the parameter type a in the semantic recognition list b.
  • the parameter d may, for example, correspond to the second parameter above.
  • the parameter type a may, for example, correspond to the second parameter type above.
  • the semantic recognition list b may correspond to the third semantic recognition list above, for example.
  • the parameter type a may be "positional”.
  • the updated semantic recognition list b may include the following association relationship: "Location A"-"Location”.
  • the first device acquires the voice information c from the voice sensor.
  • the voice information c may correspond to the above-mentioned first voice information.
  • the voice information c may be, for example, "navigate to location A".
  • the first device sends voice information c to the second device.
  • the first device determines a semantic recognition result e according to the speech information c, where the semantic recognition result e includes a function c and a parameter d.
  • the semantic recognition result e may correspond to the second semantic recognition result above, for example.
  • Function c may, for example, correspond to the second function above.
  • the parameter d may, for example, correspond to the first parameter above.
  • the semantic recognition result e may include, for example, function c, intent c, and parameter d; wherein, function c may be "cloud navigation function”; intent c may be “path planning”; parameter d may be "location” A”.
  • the first device modifies the function c in the semantic recognition result e to the function d according to the semantic recognition result e and the semantic recognition list b, and obtains the semantic recognition result f.
  • the function d and the function c are two different functions of the same type.
  • the function d may correspond to the first function above, for example.
  • the semantic recognition result f may, for example, correspond to the first semantic recognition result above.
  • the first device may modify the "cloud navigation function" in the semantic recognition result e to "local navigation function” to obtain the semantic Identify the result f.
  • the semantic recognition result f may include, for example, a function d, an intent c, and a parameter d.
  • Function d may be "local navigation function”; intent c may be "path planning”; parameter d may be "location A”. Both "cloud navigation function” and “local navigation function” are navigation functions, and "cloud navigation function” and “local navigation function” are two different functions.
  • the function d may correspond to the first function above, for example.
  • the intent c may eg correspond to the first intent above.
  • the parameter d may, for example, correspond to the first parameter above.
  • the first device determines, according to the semantic recognition result f and the first preset condition, to execute the operation c determined by the first device according to the semantic recognition result f.
  • operation c may correspond to the first operation above.
  • the first device stores a semantic recognition list b (optionally, the semantic recognition list b may, for example, correspond to the first semantic recognition list above, or correspond to the general table above, and the general table may contain The first semantic recognition list, the second semantic recognition list, and the third semantic recognition list above).
  • the multiple functions of the semantic recognition list b may include a "local navigation function"; in the semantic recognition list b, the function "local navigation function” may correspond to the parameter type "location"; and, in the semantic recognition list c, the parameter "location" A" can correspond to the parameter type "position”.
  • the first device may determine that the semantic recognition result a satisfies the first preset condition according to the semantic recognition result f, the semantic recognition list b, and the semantic recognition list c. Since the semantic recognition result f satisfies the first preset condition, the first device may determine to perform the operation c determined by the first device according to the semantic recognition result f. Operation c may be, for example, route planning to location A.
  • the first device discards the semantic recognition result g for the speech information c from the second device.
  • the first device can determine to perform the operation c determined by the first device according to the semantic recognition result f, the first device can ignore the instruction of the second device for the voice information c.
  • 914 may be an optional step.
  • the first device may determine whether the currently recognized semantic recognition result has a relatively high accuracy rate according to the existing semantic recognition list. In the case that the currently recognized semantic recognition result may not have a relatively high accuracy rate, the first device may choose to perform an operation according to the instructions of other devices, thereby facilitating the first device to accurately respond to the user's voice instruction. In addition, the first device can also learn the user's voice command under the instruction of other devices, thereby helping to broaden the scenarios in which the first device can respond to the user's voice command by itself.
  • FIG. 10 is a schematic flowchart of a voice interaction method provided by an embodiment of the present application. The method shown in FIG. 10 can be applied to, for example, the voice interaction system 500 shown in FIG. 5 .
  • the first device performs a voice wake-up operation in response to a user's input operation.
  • the first device acquires voice information d from a voice sensor.
  • the voice information d may correspond to the above-mentioned first voice information, for example.
  • the voice information d may be, for example, "navigate to location C".
  • the first device sends voice information d to the second device.
  • the first device determines a semantic recognition result h according to the speech information d.
  • the semantic recognition result h may correspond to the first semantic recognition result above, for example.
  • the first device determines to execute the operation d indicated by the second device according to the semantic recognition result h and the first preset condition.
  • operation d may correspond to the second operation above.
  • the second device sends the semantic recognition result i for the speech information d to the first device.
  • the semantic recognition result i may, for example, correspond to the sixth semantic recognition result above.
  • the first device performs operation d according to the semantic recognition result i.
  • operation d may be, for example, to broadcast the following content: find multiple relevant destinations, please select position C-1, position C-2, and position C-3. Operation d can be used to query the exact navigation destination.
  • 1001 to 1007 may be the first round of cloud-side voice interaction.
  • the first device acquires the voice information e from the voice sensor.
  • the voice information e may correspond to the second voice information above, for example.
  • the voice information e may be irrelevant to the operation d.
  • the voice information d may be, for example, "turn on device A".
  • the first device sends voice information e to the second device.
  • the first device determines a semantic recognition result j according to the speech information e.
  • the semantic recognition result j may correspond to the seventh semantic recognition result above, for example.
  • the semantic recognition result j may indicate "open device A", for example.
  • the first device acquires the semantic recognition result k for the speech information e from the second device.
  • the semantic recognition result k may, for example, correspond to the eighth semantic recognition result above.
  • the first apparatus may determine not to perform the operation indicated by the semantic recognition result j.
  • the first device may acquire the semantic recognition result of the second device for the voice information e, so as to determine whether to perform the operation indicated by the voice information e.
  • the priority of the semantic recognition result j may be lower than the priority of the semantic recognition result g or the semantic recognition result i.
  • the first apparatus determines to perform operation d repeatedly.
  • the first apparatus may determine not to perform the operation indicated by the semantic recognition result k.
  • the first apparatus may repeatedly perform operation d, so that the first round of cloud-side voice interaction may be continuously performed.
  • the semantic recognition result k may have a lower priority than the semantic recognition result g or the semantic recognition result i.
  • the first device acquires voice information f from the voice sensor.
  • the voice information f may correspond to the second voice information above.
  • the voice information f and the operation d may be related.
  • the voice information f may be, for example, "position C-1, route position D".
  • the first device sends voice information f to the second device.
  • the first device determines a semantic recognition result m according to the speech information f.
  • the semantic recognition result 5 may, for example, correspond to the seventh semantic recognition result above.
  • the semantic recognition result m may, for example, indicate that the first preset condition is not satisfied.
  • the first device acquires a semantic recognition result n for the speech information f from the second device.
  • the semantic recognition result n may, for example, correspond to the eighth semantic recognition result above.
  • the first device may determine to perform the operation indicated by the second device, or repeat the previous round of operations.
  • the first device may acquire the semantic recognition result n of the voice information f by the second device, so as to determine whether to perform the operation indicated by the voice information f.
  • the semantic recognition result n is related to the operation d, which can mean that the speech information f is the reply for the last round of speech interaction.
  • the first device may determine to perform the operation e instructed by the second device.
  • 1013 to 1017 may be the second round of cloud-side voice interaction.
  • the first round of cloud-side voice interaction and the second round of cloud-side voice interaction may be two rounds of voice interaction for employees with multiple rounds of voice interaction.
  • the operation e may be, for example, to broadcast the following content: find multiple relevant waypoints, please select the position D-1, the position D-2, and the position D-3. Operation e can be used to inquire about the exact navigation route.
  • the first device acquires voice information g from the voice sensor.
  • the voice information g may correspond to the second voice information above.
  • the voice information g may be, for example, "the vehicle has an accident”.
  • the first device sends voice information f to the second device.
  • the first device determines a semantic recognition result p according to the speech information g.
  • the semantic recognition result p may, for example, correspond to the seventh semantic recognition result above.
  • the semantic recognition result p may include a function e, an intent e; wherein, the function e may be "safety control function"; and the intent c may be "enable accident mode”.
  • operation f may correspond to the third operation above.
  • the semantic recognition result p satisfies the second preset condition, which means that the first device can end the current multi-round voice interaction on the cloud side.
  • the first apparatus may determine to perform the operation indicated by the semantic recognition result p.
  • the first device may acquire the semantic recognition result of the second device for the voice information e, so as to determine whether to perform the operation indicated by the voice information e.
  • the priority of the semantic recognition result p may be higher than the priority of the semantic recognition result n or the semantic recognition result m.
  • the second device sends the semantic recognition result q for the speech information g to the first device.
  • the semantic recognition result q may, for example, indicate that the speech information g does not match the operation d.
  • the first device discards the semantic recognition result q from the second device.
  • the first device may determine to perform the operation f determined by the first device according to the semantic recognition result p, the first device may ignore the instruction of the second device for the speech information g.
  • the first device may determine whether to end the current multiple rounds of voice interaction according to preset conditions related to multiple rounds of voice interaction. This is conducive to adaptively retaining the advantages of multiple rounds of voice interaction, and is conducive to responding to the user's relatively urgent and relatively important voice commands in a timely manner under special circumstances.
  • FIG. 11 is a schematic structural diagram of an apparatus 1100 for voice interaction provided by an embodiment of the present application.
  • the apparatus 1100 includes an acquisition unit 1101 and a processing unit 1102 .
  • the apparatus 1100 may be used to perform each step of the voice interaction method provided by the embodiments of the present application.
  • the acquiring unit 1101 can be used to execute 602 in the method 600 shown in FIG. 6
  • the processing unit 1102 can be used to execute 603 a in the method 600 shown in FIG. 6
  • the apparatus 1100 further includes a sending unit, and the sending unit may be configured to execute 603b in the method 600 shown in FIG. 6 .
  • the acquiring unit 1101 may be configured to perform steps 702 and 705 in the method 700 shown in FIG. 7
  • the processing unit 1102 may be configured to perform steps 703 , 704 , 706 , and 707 in the method 700 shown in FIG. 7 .
  • the obtaining unit 1101 is configured to obtain the first voice information from the voice sensor.
  • the processing unit 1102 is configured to determine, according to the first voice information, to execute the target operation indicated by the first voice information.
  • the processing unit 1102 may include, for example, the semantic recognition module and the operation decision module in the example shown in FIG. 5 .
  • the processing unit 1102 is specifically configured to: determine a first semantic recognition result according to the first speech information; The first device determines the first operation according to the first semantic recognition result, or determines to execute the second operation instructed by the second device.
  • the processing unit 1102 is specifically configured to: determine to execute the first operation when the first semantic recognition result satisfies the first preset condition.
  • the first device is preset with multiple functions, and the first semantic recognition result satisfies a first preset condition, including: the first semantic recognition result indicates a first function, The first function belongs to the plurality of functions.
  • the first device is preset with multiple intents, and the first semantic recognition result satisfies a first preset condition, including: the first semantic recognition result indicates a first intent, so The first intent belongs to the plurality of intents.
  • the first device is preset with a plurality of parameters, and the first semantic recognition result satisfies a first preset condition, including: the first semantic recognition result indicates the first parameter, the The first parameter belongs to the plurality of parameters.
  • the first semantic recognition result indicates a first function and indicates a first parameter
  • the first semantic recognition result satisfies a first preset condition
  • the indicated first function corresponds to the same parameter type as the first parameter indicated by the first semantic recognition result.
  • the processing unit 1102 is specifically configured to: determine a second semantic recognition result according to the first voice information, where the second semantic recognition result indicates a second function and indicates the first parameter; In the case that the multiple functions preset by the first device do not include the second function, and the multiple parameters preset by the first device include the first parameter, correcting the second semantic recognition result
  • the second function in is the first function, and the first semantic recognition result is obtained, and the first function and the second function are two different functions of the same type.
  • the first semantic recognition result indicates a first intent and indicates a first parameter
  • the first semantic recognition result satisfies a first preset condition
  • the indicated first intent corresponds to the same parameter type as the first parameter indicated by the first semantic recognition result.
  • the processing unit 1102 is specifically configured to: determine a third semantic recognition result according to the first voice information, where the third semantic recognition result includes a second intent and indicates the first parameter; In the case where the multiple intents preset by the first device do not include the second intent, and the multiple parameters preset by the first device include the first parameter, correcting the third semantic recognition result
  • the second intent in is the first intent, and the first semantic recognition result is obtained, and the first intent and the second intent are two different intents of the same type.
  • the first semantic recognition result satisfying the first preset condition includes: the first semantic recognition result includes a first indicator bit, and the first indicator bit indicates the first indicator bit.
  • a semantic recognition result satisfies the first preset condition.
  • the processing unit 1102 is specifically configured to: determine a fourth semantic recognition result according to the first voice information, where the fourth semantic recognition result includes a first function and a first parameter;
  • the first function belongs to a plurality of functions preset by the first device, and the first parameter belongs to a plurality of parameters preset by the first device, and the first function and the first parameter correspond to the same
  • the first semantic recognition result is determined, and the semantic recognition result includes the first indicator bit.
  • the processing unit 1102 is specifically configured to: determine a fifth semantic recognition result according to the first voice information, where the fifth semantic recognition result includes a first intent and a first parameter;
  • the first intent belongs to multiple intents preset by the first device, and the first parameter belongs to multiple parameters preset by the first device, and the first intent and the first parameter correspond to the same
  • the first semantic recognition result is determined, and the semantic recognition result includes the first indicator bit.
  • the device further includes: a sending unit, configured to send the first voice information to the second device; the processing unit 1102 is further configured to discard the sixth semantic message from the second device Identify the results.
  • the processing unit 1102 is specifically configured to: in the case that the first semantic recognition result does not meet the first preset condition, obtain the sixth semantic recognition result from the second device, so The sixth semantic recognition result is used to indicate the second operation; according to the sixth semantic recognition result, it is determined to execute the second operation.
  • the processing unit 1102 is further configured to: determine the second parameter and the second parameter type according to the sixth semantic recognition result; the apparatus further includes: a storage unit, configured to save the second parameter The association relationship with the second parameter type.
  • the obtaining unit 1101 is further configured to obtain second voice information from the voice sensor; the processing unit 1102 is further configured to, according to the second voice information, determine a seventh semantic recognition result ; The processing unit 1102 is further configured to, according to the seventh semantic recognition result and the second preset condition, determine to execute the operation indicated by the first voice information, or determine to execute the operation indicated by the first device according to the second preset condition. 7. The third operation determined by the semantic recognition result, or the fourth operation instructed by the second device is determined to be performed.
  • the processing unit 1102 is specifically configured to, when the seventh semantic recognition result satisfies the first preset condition and satisfies the second preset condition, determine to execute the three operations; in the case that the seventh semantic recognition result does not meet the first preset condition and meets the second preset condition, determine to execute the fourth operation; if the seventh semantic recognition result does not When the second preset condition is satisfied, it is determined to execute the operation corresponding to the first semantic recognition result.
  • the seventh semantic recognition result satisfying the second preset condition includes: the priority of the seventh semantic recognition result is higher than the priority of the first semantic recognition result.
  • the priority of the seventh semantic recognition result is higher than the priority of the first semantic recognition result, including one or more of the following: the seventh semantic recognition result indicates that The priority of the function indicated by the first semantic recognition result is higher than the priority of the function indicated by the first semantic recognition result; the priority of the intent indicated by the seventh semantic recognition result is higher than the priority of the intent indicated by the first semantic recognition result; The priority of the parameter indicated by the seventh semantic recognition result is higher than the priority of the parameter indicated by the first semantic recognition result.
  • the processing unit is specifically configured to: in the case that the first semantic recognition result does not meet the first preset condition, obtain a sixth semantic recognition result from the second device, The sixth semantic recognition result is used to indicate the second operation; according to the sixth semantic recognition result, it is determined to execute the second operation; the seventh semantic recognition result satisfies the second preset condition, including : The priority of the seventh semantic recognition result is higher than the priority of the sixth semantic recognition result.
  • the priority of the seventh semantic recognition result is higher than the priority of the sixth semantic recognition result, including one or more of the following: the seventh semantic recognition result indicates that The priority of the function indicated by the sixth semantic recognition result is higher than the priority of the function indicated by the sixth semantic recognition result; the priority of the intent indicated by the seventh semantic recognition result is higher than the priority of the intent indicated by the sixth semantic recognition result; The priority of the parameter indicated by the seventh semantic recognition result is higher than the priority of the parameter indicated by the sixth semantic recognition result.
  • the device further includes: a sending unit, configured to send the second voice information to the second device; the processing unit 1102 is specifically configured to: when the seventh semantic recognition result does not satisfy the first In the case of a preset condition, or, in the case that the seventh semantic recognition result does not meet the second preset condition, and the first semantic recognition result does not meet the first preset condition, from The second device obtains an eighth semantic recognition result; according to the eighth semantic recognition result and the second preset condition, determine to perform the operation indicated by the first voice information, or determine to perform the operation Fourth operation.
  • a sending unit configured to send the second voice information to the second device
  • the processing unit 1102 is specifically configured to: when the seventh semantic recognition result does not satisfy the first In the case of a preset condition, or, in the case that the seventh semantic recognition result does not meet the second preset condition, and the first semantic recognition result does not meet the first preset condition, from The second device obtains an eighth semantic recognition result; according to the eighth semantic recognition result and the second preset condition, determine to perform the operation indicated by the
  • the sending unit may be, for example, the transceiver module in the example shown in FIG. 5 .
  • the processing unit 1102 is specifically configured to: in the case that the eighth semantic recognition result satisfies the second preset condition, determine to execute the fourth operation; If the recognition result does not satisfy the second preset condition, it is determined to execute the operation indicated by the first voice information.
  • the eighth semantic recognition result satisfying the second preset condition includes: the priority of the eighth semantic recognition result is higher than the priority of the first semantic recognition result.
  • the priority of the eighth semantic recognition result is higher than the priority of the first semantic recognition result, including one or more of the following: the eighth semantic recognition result indicates that The priority of the function indicated by the first semantic recognition result is higher than the priority of the function indicated by the first semantic recognition result; the priority of the intent indicated by the eighth semantic recognition result is higher than the priority of the intent indicated by the first semantic recognition result; The priority of the parameter indicated by the eighth semantic recognition result is higher than the priority of the parameter indicated by the first semantic recognition result.
  • the processing unit 1102 is specifically configured to: in the case that the first semantic recognition result does not meet the first preset condition, obtain a sixth semantic recognition result from the second device , the sixth semantic recognition result is used to indicate the second operation; according to the sixth semantic recognition result, it is determined to execute the second operation; the eighth semantic recognition result satisfies the second preset condition, Including: the priority of the eighth semantic recognition result is higher than the priority of the sixth semantic recognition result.
  • the priority of the eighth semantic recognition result is higher than the priority of the sixth semantic recognition result, including one or more of the following: the eighth semantic recognition result indicates that The priority of the function indicated by the sixth semantic recognition result is higher than the priority of the function indicated by the sixth semantic recognition result; the priority of the intent indicated by the eighth semantic recognition result is higher than the priority of the intent indicated by the sixth semantic recognition result; The priority of the parameter indicated by the eighth semantic recognition result is higher than the priority of the parameter indicated by the sixth semantic recognition result.
  • the second voice information is irrelevant to the operation indicated by the first voice information.
  • the apparatus further includes: a wake-up module, configured to perform a voice wake-up operation in response to a user's input operation.
  • a wake-up module configured to perform a voice wake-up operation in response to a user's input operation.
  • FIG. 12 is a schematic structural diagram of an apparatus 1200 for voice interaction provided by an embodiment of the present application.
  • the apparatus 1200 may include at least one processor 1202 and a communication interface 1203 .
  • the apparatus 1200 may further include one or more of the memory 1201 and the bus 1204 .
  • any two or all three of the memory 1201 , the processor 1202 and the communication interface 1203 can be connected to each other through the bus 1204 for communication.
  • the memory 1201 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM).
  • the memory 1201 may store a program.
  • the processor 1202 and the communication interface 1203 are used to execute various steps of the voice interaction method provided by the embodiments of the present application. That is, the processor 1202 may obtain the stored instructions from the memory 1201 through the communication interface 1203, so as to execute various steps of the voice interaction method provided by the embodiments of the present application.
  • the memory 1201 may have the function of the memory 152 shown in FIG. 1 to realize the above-mentioned function of storing programs.
  • the processor 1202 may adopt a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processor (graphic processing unit, GPU), or a or multiple integrated circuits for executing relevant programs to implement the functions required to be performed by the units in the apparatus for voice interaction provided by the embodiments of the present application, or to execute various steps of the methods for voice interaction provided by the embodiments of the present application.
  • the processor 1202 may have the function of the processor 151 shown in FIG. 1 , so as to realize the above-mentioned function of executing related programs.
  • the processor 1202 may also be an integrated circuit chip with signal processing capability.
  • each step of the voice interaction method provided by the embodiments of the present application may be completed by an integrated logic circuit of hardware in a processor or an instruction in the form of software.
  • the above-mentioned processor 1202 may also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other Programming logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application-specific integrated circuit
  • FPGA field programmable gate array
  • Programming logic devices discrete gate or transistor logic devices, discrete hardware components.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory, and the processor reads the information in the memory and, in combination with its hardware, completes the functions required to be performed by the units included in the device for voice interaction according to the embodiment of the present application, or executes the method for voice interaction provided in the embodiment of the present application. each step.
  • the communication interface 1203 may use a transceiver device such as, but not limited to, a transceiver to implement communication between the device and other devices or a communication network.
  • a transceiver device such as, but not limited to, a transceiver to implement communication between the device and other devices or a communication network.
  • the communication interface 1203 can also be, for example, an interface circuit.
  • the bus 1204 may include pathways for transferring information between various components of the device (eg, memory, processor, communication interface).
  • Embodiments of the present application further provide a computer program product including instructions, which, when executed by a computer, enable the computer to implement the voice interaction method provided by the above method embodiments.
  • An embodiment of the present application further provides a terminal device, where the terminal device includes any of the above apparatuses for voice interaction, for example, the apparatus for voice interaction shown in FIG. 11 or FIG. 12 .
  • the terminal may be a vehicle.
  • the terminal may also be a terminal for remotely controlling the vehicle.
  • the above voice interaction device may be installed on the vehicle or independent of the vehicle, for example, the target vehicle may be controlled by an unmanned aerial vehicle, another vehicle, a robot, or the like.
  • Computer readable media may include, but are not limited to, magnetic storage devices (eg, hard disks, floppy disks, or magnetic tapes, etc.), optical disks (eg, compact discs (CDs), digital versatile discs (DVDs), etc. ), smart cards and flash memory devices (eg, erasable programmable read-only memory (EPROM), cards, stick or key drives, etc.).
  • magnetic storage devices eg, hard disks, floppy disks, or magnetic tapes, etc.
  • optical disks eg, compact discs (CDs), digital versatile discs (DVDs), etc.
  • smart cards and flash memory devices eg, erasable programmable read-only memory (EPROM), cards, stick or key drives, etc.
  • the various storage media described herein may represent one or more devices and/or other machine-readable media for storing information.
  • the term "machine-readable medium” may include, but is not limited to, wireless channels and various other media capable of storing, containing, and/or carrying instructions and/or data.
  • the processor is a general-purpose processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components
  • the memory storage module
  • a component may be, but is not limited to, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer.
  • an application running on a computing device and the computing device may be components.
  • One or more components may reside within a process and/or thread of execution, and a component may be localized on one computer and/or distributed between 2 or more computers.
  • these components can execute from various computer readable media having various data structures stored thereon.
  • a component may, for example, be based on a signal having one or more data packets (eg, data from two components interacting with another component between a local system, a distributed system, and/or a network, such as the Internet interacting with other systems via signals) Communicate through local and/or remote processes.
  • data packets eg, data from two components interacting with another component between a local system, a distributed system, and/or a network, such as the Internet interacting with other systems via signals
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • User Interface Of Digital Computer (AREA)
  • Machine Translation (AREA)

Abstract

本申请提供了一种语音交互的方法和装置。第一装置通过语音传感器获取包含用户语音指令的第一语音信息。第一装置可以根据第一语音信息,确定执行第一语音信息指示的目标操作。第一装置也可以向第二装置发送第一语音信息。第一装置可以根据第一预设条件,确定执行第一装置根据第一语音信息确定的第一操作;第一装置也可以根据第一预设条件,确定执行第二装置指示的第二操作。本申请提供的方案有利于兼顾语音识别的准确性和效率,还有利于使装置向用户做出恰当、快捷的响应。

Description

语音交互的方法和装置 技术领域
本申请涉及电子设备领域,并且更具体地,涉及语音交互的方法和装置。
背景技术
用户和电子设备可以语音交互。用户可以向电子设备说出语音指令。电子设备可以获取该语音指令,并执行语音指令指示的操作。
电子设备自身可以识别语音指令。电子设备识别语音指令的能力可能相对有限。如果仅通过电子设备识别语音指令,识别结果可能不准确,进而电子设备可能无法给予用户恰当的回应。
电子设备还可以将与语音指令相关的语音信息上传至云端;云端可以对语音信息进行识别,并将识别后的结果反馈至电子设备。云端的语音识别能力、自然语言理解能力等可以相对较强。然而,电子设备与云端的交互可以依赖于电子设备的当前网络状态。也就是说,电子设备与云端的交互可能会产生相对较长的时延。如果通过云端识别语音指令,电子设备可能无法及时地从云端获取识别结果,进而无法快速地给予用户回应。
发明内容
本申请提供一种语音交互的方法和装置,目的是兼顾语音识别的准确性和效率,进而有利于向用户做出恰当、快捷的响应。
第一方面,提供了一种语音交互的方法,应用于第一装置,所述方法包括:
获取来自于语音传感器的第一语音信息;
根据所述第一语音信息,确定第一语义识别结果;
根据所述第一语义识别结果和第一预设条件,确定执行由所述第一装置根据所述第一语义识别结果确定的第一操作,或者,确定执行由第二装置指示的第二操作。
在一个可能的示例中,第二装置可以通过将语义识别结果、操作信息中的一个或多个发送给第一装置,以向第一装置指示第二操作。
在第一装置相对擅长的语音交互场景中,第一装置可以在不借助第二装置提供的信息的情况下,确定用户语音指令对应的操作。这样,有利于降低第一装置执行用户的语音指令的响应时延,提高响应效率。在第一装置相对不擅长的语音交互场景中,第一装置可以根据第二装置提供的信息,确定用户语音指令对应的操作。由此有利于提高第一装置响应用户的语音指令的准确性。通过上述方案,可以根据第一装置所擅长的语音交互场景,灵活选择语音指令的处理方式,平衡响应时延与响应准确性。
可选的,所述方法还包括:所述第一装置向第二装置发送所述第一语音信息。
第一装置向第二装置发送第一语音信息,第二装置可以向第一装置发送针对第一语音信息反馈。如果第一装置执行第一操作,第一装置可以根据第二装置的反馈,调整自身的 语义识别模型、语音控制模型等,以有利于提高第一装置输出语义识别结果的准确率、优化响应用户语音指令的操作的适用性等,或者,忽略第二装置的反馈,例如,在第二装置的反馈较第一装置的语音识别结果的时延较大时,忽略第二装置的反馈。如果第一装置执行第二操作,第一装置也可以相对更快地从第二装置获取针对第一语音信息反馈。因此有利于缩短第一装置响应用户指令的时间。
结合第一方面,在第一方面的某些实现方式中,所述根据所述第一语义识别结果和第一预设条件,确定执行由所述第一装置根据所述第一语义识别结果确定的第一操作,包括:
在所述第一语义识别结果满足所述第一预设条件的情况下,确定执行所述第一操作。
可选的,在所述第一语义识别结果不满足所述第一预设条件的情况下,确定执行由第二装置指示的第二操作。
第一预设条件有利于第一装置判断第一装置是否能够相对准确识别当前的语音信息,进而有利于兼顾语音识别的准确性和效率。
结合第一方面,在第一方面的某些实现方式中,所述第一装置预设有多个功能,所述第一语义识别结果满足第一预设条件,包括:
所述第一语义识别结果指示第一功能,所述第一功能属于所述多个功能。
可选的,所述第一装置预设有多个功能,所述第一语义识别结果不满足第一预设条件,包括:所述第一语义识别结果指示第一功能,所述第一功能不属于所述多个功能。
可选的,所述多个功能包括以下功能中的一项或多项:车控功能、导航功能、音频功能、视频功能。
第一装置预设的多个功能例如可以包括第一装置支持的多个功能。第一装置可以对第一装置预设的多个功能具有相对更高的语义识别能力。第一装置可以对第一装置未预设的其他功能具有相对更低的语义识别能力。第一装置可以根据第一装置预设的多个功能、第一功能,判断第一装置是否能够相对准确识别当前的语音信息,进而有利于兼顾语音识别的准确性和效率。
结合第一方面,在第一方面的某些实现方式中,所述第一装置预设有多个意图,所述第一语义识别结果满足第一预设条件,包括:
所述第一语义识别结果指示第一意图,所述第一意图属于所述多个意图。
可选的,所述方法还包括:所述第一装置预设有多个意图,所述第一语义识别结果不满足第一预设条件,包括:所述第一语义识别结果指示第一意图,所述第一意图不属于所述多个意图。
可选的,所述多个意图包括以下意图中的一项或多项:硬件开启意图、路径规划意图、播放音频意图、播放视频意图。
第一装置预设的多个意图例如可以包括第一装置支持的多个意图。第一装置可以对第一装置预设的多个意图具有相对更高的语义识别能力。第一装置可以对第一装置未预设的其他意图具有相对更低的语义识别能力。第一装置可以根据第一装置预设的多个意图、第一意图,判断第一装置是否能够相对准确识别当前的语音信息,进而有利于兼顾语音识别的准确性和效率。
结合第一方面,在第一方面的某些实现方式中,所述第一装置预设有多个参数,所述第一语义识别结果满足第一预设条件,包括:
所述第一语义识别结果包括第一参数,所述第一参数属于所述多个参数。
可选的,所述第一装置预设有多个参数,所述第一语义识别结果不满足第一预设条件,包括:所述第一语义识别结果指示第一参数,所述第一参数不属于所述多个参数。
第一装置预设的多个参数例如可以包括第一装置支持的多个参数。第一装置可以对第一装置预设的多个参数具有相对更高的语义识别能力。第一装置可以对第一装置未预设的其他参数具有相对更低的语义识别能力。第一装置可以根据第一装置预设的多个参数、第一参数,判断第一装置是否能够相对准确识别当前的语音信息,进而有利于兼顾语音识别的准确性和效率。
结合第一方面,在第一方面的某些实现方式中,所述第一装置预设有与所述第一功能对应的多个意图,所述第一语义识别结果满足第一预设条件,还包括:
所述第一语义识别结果还指示第一意图,所述第一意图属于所述多个意图。
可选的,所述第一装置预设有与所述第一功能对应的多个意图,所述第一语义识别结果不满足第一预设条件,还包括:所述第一语义识别结果还指示第一意图,所述第一意图不属于所述多个意图。
第一功能对应的意图通常不是无限制的。在多个功能和多个意图之间建立对应关系,有利于提高第一装置相对更准确地判断第一装置是否能够相对准确地识别当前的语音信息,进而有利于兼顾语音识别的准确性和效率。
结合第一方面,在第一方面的某些实现方式中,所述第一装置预设有与所述第一功能对应的多个参数,所述第一语义识别结果满足第一预设条件,还包括:
所述第一语义识别结果还指示第一参数,所述第一参数属于所述多个参数。
可选的,所述第一装置预设有与所述第一功能对应的多个参数,所述第一语义识别结果不满足第一预设条件,还包括:所述第一语义识别结果还指示第一参数,所述第一参数不属于所述多个参数。
可选的,所述多个参数类型包括以下参数类型中的一项或多项:硬件标识、时间、温度、位置、歌手、歌曲、歌单、音频播放模式、电影、电视剧、演员、视频播放模式。
第一功能对应的参数通常不是无限制的。在多个功能和多个参数之间建立对应关系,有利于提高第一装置相对更准确地判断第一装置是否能够相对准确地识别当前的语音信息,进而有利于兼顾语音识别的准确性和效率。
结合第一方面,在第一方面的某些实现方式中,所述第一装置预设有与所述第一意图对应的多个参数,所述第一语义识别结果满足第一预设条件,还包括:
所述第一语义识别结果还指示第一参数,所述第一参数属于所述多个参数。
可选的,所述第一装置预设有与所述第一意图对应的多个参数,所述第一语义识别结果不满足第一预设条件,还包括:所述第一语义识别结果还指示第一参数,所述第一参数不属于所述多个参数。
第一意图对应的参数通常不是无限制的。在多个意图和多个参数之间建立对应关系,有利于提高第一装置相对更准确地判断第一装置是否能够相对准确地识别当前的语音信息,进而有利于兼顾语音识别的准确性和效率。
结合第一方面,在第一方面的某些实现方式中,所述第一语义识别结果指示第一功能以及指示第一参数,所述第一语义识别结果满足第一预设条件,还包括:
所述第一语义识别结果指示的所述第一功能与所述第一语义识别结果指示的所述第一参数对应相同的参数类型。
可选的,所述第一语义识别结果指示第一功能以及指示第一参数,所述第一语义识别结果不满足第一预设条件,还包括:所述第一语义识别结果指示的所述第一功能与所述第一语义识别结果指示的所述第一参数对应不同的参数类型。
以下通过一些例子阐述功能可以对应的参数类型。
例如,车控功能对应的参数类型可以是时间、温度、硬件标识等。
又如,温控功能对应的参数类型可以是温度等。
又如,导航功能对应的参数类型可以是位置、时间等。
又如,音频功能对应的参数类型可以是歌手、歌曲、歌单、时间、音频播放模式等。
又如,视频功能对应的参数类型可以是电影、电视剧、演员、时间、视频播放模式等。
以下通过一些例子阐述参数可以对应的参数类型。
例如,空调、摄像头、座椅、车窗等对应的参数类型可以是硬件标识。
又如,5℃、28℃等对应的参数类型可以是温度。
又如,1小时、1分钟等对应的参数类型可以是时间。
又如,位置A、位置B等对应的参数类型可以是位置。
又如,歌手A、歌手B等对应的参数类型可以是歌手。
又如,歌曲A、歌曲B等对应的参数类型可以是歌曲。
又如,歌单A、歌单B等对应的参数类型可以是歌单。
又如,标准播放、高质量播放、无损播放等对应的参数类型可以是音频播放模式。
又如,电影A、电影B等对应的参数类型可以是电影。
又如,电视剧A、电视剧B对应的参数类型可以是电视剧。
又如,演员A、演员B等对应的参数类型可以是演员。
又如,标清播放、高清播放、超期播放、蓝光播放等对应的参数类型可以是视频播放模式。
如果第一装置事先获取到第一参数与第一参数类型对应,且第一功能与第一参数类型对应,则第一装置对第一语音信息的识别的准确率可能相对较高。如果第一装置事先获取到第一参数与第一参数类型对应,然而第一功能与第一参数类型不对应,则第一装置对第一语音信息的分析的准确率可能相对较低。通过参数类型建立多个参数和多个功能之间的关系,有利于提高第一装置相对更准确地判断第一装置是否能够相对准确地识别当前的语音信息,进而有利于兼顾语音识别的准确性和效率。
另外,如果多个功能均对应同一类参数,则该类参数可以在多个功能之间可以相互推演。这有利于降低在参数和功能之间的建立关系的复杂度。
结合第一方面,在第一方面的某些实现方式中,所述根据所述第一语音信息,确定第一语义识别结果,包括:
根据所述第一语音信息,确定第二语义识别结果,所述第二语义识别结果指示第二功能以及指示所述第一参数;
在所述第一装置预设的多个功能不包括所述第二功能,且所述第一装置预设的多个参数包括所述第一参数的情况下,修正所述第二语义识别结果中的所述第二功能为所述第一 功能,得到所述第一语义识别结果,所述第一功能与所述第二功能为类型相同的两个不同功能。
例如,第一功能可以是本地翻译功能,第二功能可以是云翻译功能。第一功能和第二功能均可以属于翻译类型的功能。
又如,第一功能可以是本地导航功能,第二功能可以是云导航功能。第一功能和第二功能均可以属于导航类型的功能。
又如,第一功能可以是本地音频功能,第二功能可以是云音频功能。第一功能和第二功能均可以属于音频播放类型的功能。
又如,第一功能可以是本地视频功能,第二功能可以是云视频功能。第一功能和第二功能均可以属于视频播放类型的功能。
第二功能不属于第一装置预设的多个功能,意味着第一装置可能对第二功能具有相对较弱的语义识别能力。第一装置例如可以多次学习有关第二功能的语音指令,从而逐渐提高针对第二功能的语义识别能力。也就是说,通过将第二功能修改为第一功能,使得第一装置可以在相对不擅长的领域内应用已学习的技能,有利于增加端侧决策的可应用场景,进而有利于提高语音识别的效率。
结合第一方面,在第一方面的某些实现方式中,所述第一语义识别结果指示第一意图以及指示第一参数,所述第一语义识别结果满足第一预设条件,还包括:
所述第一语义识别结果指示的所述第一意图与所述第一语义识别结果指示的所述第一参数对应相同的参数类型。
可选的,所述第一语义识别结果指示第一意图以及指示第一参数,所述第一语义识别结果不满足第一预设条件,还包括:所述第一语义识别结果指示的所述第一意图与所述第一语义识别结果指示的所述第一参数对应不同的参数类型。
以下通过一些例子阐述意图可以对应的参数类型。
例如,硬件开启意图对应的参数类型可以是时间、温度、硬件标识等。
又如,路径规划意图对应的参数类型可以是位置、时间等。
又如,播放音频意图对应的参数类型可以是歌手、歌曲、歌单、时间、音频播放模式等。
又如,播放视频意图对应的参数类型可以是电影、电视剧、演员、时间、视频播放模式等。
以下通过一些例子阐述参数可以对应的参数类型。
例如,空调、摄像头、座椅、车窗等对应的参数类型可以是硬件标识。
又如,5℃、28℃等对应的参数类型可以是温度。
又如,1小时、1分钟等对应的参数类型可以是时间。
又如,位置A、位置B等对应的参数类型可以是位置。
又如,歌手A、歌手B等对应的参数类型可以是歌手。
又如,歌曲A、歌曲B等对应的参数类型可以是歌曲。
又如,歌单A、歌单B等对应的参数类型可以是歌单。
又如,标准播放、高质量播放、无损播放等对应的参数类型可以是音频播放模式。
又如,电影A、电影B等对应的参数类型可以是电影。
又如,电视剧A、电视剧B对应的参数类型可以是电视剧。
又如,演员A、演员B等对应的参数类型可以是演员。
又如,标清播放、高清播放、超期播放、蓝光播放等对应的参数类型可以是视频播放模式。
如果第一装置事先获取到第一参数与第一参数类型对应,且第一意图与第一参数类型对应,则第一装置对第一语音信息的分析的准确率可能相对较高。如果第一装置事先获取到第一参数与第一参数类型对应,然而第一意图与第一参数类型不对应,则第一装置对第一语音信息的分析的准确率可能相对较低。通过参数类型建立多个参数和多个意图之间的关系,有利于提高第一装置相对更准确地判断第一装置是否能够相对准确地识别当前的语音信息,进而有利于兼顾语音识别的准确性和效率。
另外,如果多个意图均对应同一类参数,则该类参数可以在多个意图之间可以相互推演。这有利于降低在参数和意图之间的建立关系的复杂度。
结合第一方面,在第一方面的某些实现方式中,所述根据所述第一语音信息,确定第一语义识别结果,包括:
根据所述第一语音信息,确定第三语义识别结果,所述第三语义识别结果指示第二意图以及指示所述第一参数;
在所述第一装置预设的多个意图不包括所述第二意图,且所述第一装置预设的多个参数包括所述第一参数的情况下,修正所述第三语义识别结果中的所述第二意图为所述第一意图,得到所述第一语义识别结果,所述第一意图与所述第二意图为类型相同的两个不同意图。
例如,第一意图可以是本地翻译英文意图,第二意图可以是云翻译英文意图。第一意图和第二意图均可以属于翻译英文类型的意图。
又如,第一意图可以是本地路径规划意图,第二意图可以是云路径规划意图。第一意图和第二意图均可以属于路径规划类型的意图。
又如,第一意图可以是播放本地音频意图,第二意图可以是播放云音频意图。第一意图和第二意图均可以属于播放音频类型的意图。
又如,第一意图可以是播放本地视频意图,第二意图可以是播放云视频意图。第一意图和第二意图均可以属于播放视频类型的意图。
第二意图不属于第一装置预设的多个意图,意味着第一装置可能对第二意图具有相对较弱的语义识别能力。第一装置例如可以多次学习有关第二意图的语音指令,从而逐渐提高针对第二意图的语义识别能力。也就是说,通过将第二意图修改为第一意图,使得第一装置可以在相对不擅长的领域内应用已学习的技能,有利于增加端侧决策的可应用场景,进而有利于提高语音识别的效率。
结合第一方面,在第一方面的某些实现方式中,所述第一语义识别结果满足所述第一预设条件,包括:
所述第一语义识别结果包括第一指示位,所述第一指示位指示所述第一语义识别结果满足所述第一预设条件。
可选的,所述第一语义识别结果不满足所述第一预设条件,包括:所述第一语义识别结果包括第二指示位,所述第二指示位指示所述第一语义识别结果不满足所述第一预设条 件。
结合第一方面,在第一方面的某些实现方式中,
所述根据所述第一语音信息,确定第一语义识别结果,包括:
根据所述第一语音信息,确定第四语义识别结果,所述第四语义识别结果包括第一功能和第一参数;
在所述第一功能属于所述第一装置预设的多个功能,且所述第一参数属于所述第一装置预设的多个参数,且所述第一功能和所述第一参数对应相同的参数类型的情况下,确定所述第一语义识别结果,所述语义识别结果包括所述第一指示位。
第一装置可能对第一功能具有相对较弱的语义识别能力。第一装置例如可以多次学习有关第一功能的语音指令,从而逐渐提高针对第一功能的语义识别能力。也就是说,通过在语义识别结果中携带第一指示位,使得第一装置可以在相对不擅长的领域内应用已学习的技能,有利于增加端侧决策的可应用场景,进而有利于提高语音识别的效率。
结合第一方面,在第一方面的某些实现方式中,
所述根据所述第一语音信息,确定第一语义识别结果,包括:
根据所述第一语音信息,确定第五语义识别结果,所述第五语义识别结果包括第一意图和第一参数;
在所述第一意图属于所述第一装置预设的多个意图,且所述第一参数属于所述第一装置预设的多个参数,且所述第一意图和所述第一参数对应相同的参数类型的情况下,确定所述第一语义识别结果,所述语义识别结果包括所述第一指示位。
第一意图可能对第一意图具有相对较弱的语义识别能力。第一装置例如可以多次学习有关第一意图的语音指令,从而逐渐提高针对第一意图的语义识别能力。也就是说,通过在语义识别结果中携带第一指示位,使得第一装置可以在相对不擅长的领域内应用已学习的技能,有利于增加端侧决策的可应用场景,进而有利于提高语音识别的效率。
结合第一方面,在第一方面的某些实现方式中,所述方法还包括:
向所述第二装置发送所述第一语音信息;
丢弃来自所述第二装置的第六语义识别结果。
结合第一方面,在第一方面的某些实现方式中,所述根据所述第一语义识别结果和第一预设条件,确定执行由第二装置指示的第二操作,包括:
在所述第一语义识别结果不满足所述第一预设条件的情况下,获取来自第二装置的第六语义识别结果,所述第六语义识别结果用于指示所述第二操作;
根据所述第六语义识别结果,确定执行所述第二操作。
第一预设条件有利于第一装置判断第一装置是否能够相对准确识别当前的语音信息,进而有利于兼顾语音识别的准确性和效率。
结合第一方面,在第一方面的某些实现方式中,所述方法还包括:
根据所述第六语义识别结果,确定第二参数以及第二参数类型;
保存所述第二参数与所述第二参数类型的关联关系。
第一装置可以学习新的参数。这有利于增加端侧决策的可应用场景,进而有利于提高语音识别的效率。
结合第一方面,在第一方面的某些实现方式中,所述方法还包括:
获取来自于所述语音传感器的第二语音信息;
根据所述第二语音信息,确定第七语义识别结果;
根据所述第七语义识别结果、第二预设条件,确定执行由所述第一语音信息指示的操作,或者,确定执行由所述第一装置根据所述第七语义识别结果确定的第三操作,或者,确定执行由所述第二装置指示的第四操作。
在多轮语音交互场景下,用户和第一装置可以针对某个特殊场景或特殊领域进行语音对话。在一种可能的场景中,用户可能无法通过一次语音指令彻底达到语音操控的目的。多轮语音交互中相邻两轮语音交互通常具有关联性。然而,用户的回复可能具有一定随机性。用户的回复可能与第一装置询问或待获取的语音信息无关。如果第一装置完全追随用户的回复,则可能导致先前语音交互的内容被作废,进而可能会增加用户对第一装置的操控次数。如果第一装置完全不理会用户的回复,则可能导致第一装置在某些特殊场景下无法响应用户的指示,使得用户的语音指令失去效力。第二预设条件用于指示第一装置是否结束多轮语音交互,有利于第一装置相对恰当地选择是否跳出多轮语音交互。
结合第一方面,在第一方面的某些实现方式中,所述根据所述第七语义识别结果、第二预设条件,确定执行由所述第一语音信息指示的操作,或者,确定执行由所述第一装置根据所述第七语义识别结果确定的第三操作,或者,确定执行由所述第二装置指示的第四操作,包括:
在所述第七语义识别结果满足所述第一预设条件且满足所述第二预设条件的情况下,确定执行所述第三操作;
在所述第七语义识别结果不满足所述第一预设条件且满足所述第二预设条件的情况下,确定执行所述第四操作;
在所述第七语义识别结果不满足所述第二预设条件的情况下,确定执行所述与所述第一语义识别结果对应的操作。
在一个示例中,如果第一操作被执行,且第一装置确定执行第三操作,则新的端侧语音交互可以结束前一轮端侧语音交互。
在一个示例中,如果第二操作被执行,且第一装置确定执行第三操作,则新的端侧语音交互可以结束前一轮云侧语音交互。
在一个示例中,如果第一操作被执行,且第一装置确定执行第四操作,则新的云侧语音交互可以结束前一轮端侧语音交互。
在一个示例中,如果第二操作被执行,且第一装置确定执行第四操作,则新的云侧语音交互可以结束前一轮端云语音交互。
第一装置可以综合判断第一预设条件和第二预设条件,有利于第一装置相对恰当地选择是否跳出多轮语音交互,且有利于兼顾语音识别的准确性和效率。
结合第一方面,在第一方面的某些实现方式中,所述第七语义识别结果满足所述第二预设条件,包括:
所述第七语义识别结果的优先级高于所述第一语义识别结果的优先级。
用户可以通过高优先级的语音指令,结束当前的多轮语音交互。
结合第一方面,在第一方面的某些实现方式中,所述第七语义识别结果的优先级高于所述第一语义识别结果的优先级,包括以下中的一项或多项:
所述第七语义识别结果指示的功能的优先级高于所述第一语义识别结果指示的功能的优先级;
所述第七语义识别结果指示的意图的优先级高于所述第一语义识别结果指示的意图的优先级;
所述第七语义识别结果指示的参数的优先级高于所述第一语义识别结果指示的参数的优先级。
功能、意图、参数更能够反映当前语音交互场景。功能的优先级、意图的优先级、参数的优先级有利于使一装置相对准确地判断是否跳出当前多轮语音交互。
结合第一方面,在第一方面的某些实现方式中,所述根据所述第一语义识别结果和第一预设条件,确定执行由第二装置指示的第二操作,包括:
在所述第一语义识别结果不满足所述第一预设条件的情况下,获取来自所述第二装置的第六语义识别结果,所述第六语义识别结果用于指示所述第二操作;
根据所述第六语义识别结果,确定执行所述第二操作;
所述第七语义识别结果满足所述第二预设条件,包括:
所述第七语义识别结果的优先级高于所述第六语义识别结果的优先级。
第六语义识别结果由第二装置指示,第六语义识别结果可以具有相对更高的准确性。通过比较第七语义识别结果的优先级和第六语义识别结果的优先级,有利于第一装置相对恰当地选择是否跳出多轮语音交互。
结合第一方面,在第一方面的某些实现方式中,所述第七语义识别结果的优先级高于所述第六语义识别结果的优先级,包括以下中的一项或多项:
所述第七语义识别结果指示的功能的优先级高于所述第六语义识别结果指示的功能的优先级;
所述第七语义识别结果指示的意图的优先级高于所述第六语义识别结果指示的意图的优先级;
所述第七语义识别结果指示的参数的优先级高于所述第六语义识别结果指示的参数的优先级。
功能、意图、参数更能够反映当前语音交互场景。功能的优先级、意图的优先级、参数的优先级有利于使一装置相对准确地判断是否跳出当前多轮语音交互。
结合第一方面,在第一方面的某些实现方式中,所述方法还包括:
向所述第二装置发送第二语音信息;
所述根据所述第七语义识别结果、第二预设条件,确定执行由所述第一语音信息指示的操作,或者,确定执行由所述第二装置指示的第四操作,包括:
在所述第七语义识别结果不满足所述第一预设条件的情况下,或者,在所述第七语义识别结果不满足所述第二预设条件,且所述第一语义识别结果不满足所述第一预设条件的情况下,从所述第二装置获取第八语义识别结果;
根据所述第八语义识别结果、所述第二预设条件,确定执行所述由所述第一语音信息指示的操作,或者,确定执行所述第四操作。
在第七语义识别结果不满足第一预设条件的情况下,第一装置识别得到的第七语义识别结果可能相对不准确,第二装置识别得到的第八语义识别结果可能相对准确。第一装置 可以判断第八语义识别结果是否满足第二预设条件,进而有利于相对准确地确定是否结束当前的多轮语音交互。
如果第七语义识别结果不满足第二预设条件,意味着第一装置可以不根据第七语义识别结果确定要执行的操作。在所述第一语义识别结果不满足所述第一预设条件的情况下,可以意味着当前的多轮语音交互属于云侧语音交互。在此情况下,第一装置可以从第二装置获取第八语义识别结果,以继续当前的云侧多轮语音交互。这有利于维持云侧多轮语音交互。
结合第一方面,在第一方面的某些实现方式中,所述根据所述第八语义识别结果、所述第二预设条件,确定执行所述由所述第一语音信息指示的操作,或者,确定执行所述第四操作,包括:
在所述第八语义识别结果满足所述第二预设条件的情况下,确定执行所述第四操作;
在所述第八语义识别结果不满足所述第二预设条件的情况下,确定执行所述由所述第一语音信息指示的操作。
第八语义识别结果由第二装置指示,第八语义识别结果可以具有相对更高的准确性。根据第八语义识别结果的优先级,有利于第一装置相对恰当地选择是否跳出多轮语音交互。
结合第一方面,在第一方面的某些实现方式中,所述第八语义识别结果满足所述第二预设条件,包括:
所述第八语义识别结果的优先级高于所述第一语义识别结果的优先级。
用户可以通过高优先级的语音指令,结束当前的多轮语音交互。
结合第一方面,在第一方面的某些实现方式中,所述第八语义识别结果的优先级高于所述第一语义识别结果的优先级,包括以下中的一项或多项:
所述第八语义识别结果指示的功能的优先级高于所述第一语义识别结果指示的功能的优先级;
所述第八语义识别结果指示的意图的优先级高于所述第一语义识别结果指示的意图的优先级;
所述第八语义识别结果指示的第五参数的优先级高于所述第一语义识别结果指示的参数的优先级。
功能、意图、参数更能够反映当前语音交互场景。功能的优先级、意图的优先级、参数的优先级有利于使一装置相对准确地判断是否跳出当前多轮语音交互。
结合第一方面,在第一方面的某些实现方式中,所述根据所述第一语义识别结果和第一预设条件,确定执行由第二装置指示的第二操作,包括:
在所述第一语义识别结果不满足所述第一预设条件的情况下,获取来自所述第二装置的第六语义识别结果,所述第六语义识别结果用于指示所述第二操作;
根据所述第六语义识别结果,确定执行所述第二操作;
所述第八语义识别结果满足所述第二预设条件,包括:
所述第八语义识别结果的优先级高于所述第六语义识别结果的优先级。
第六语义识别结果、第八语义识别结果均由第二装置指示,第六语义识别结果、第八语义识别结果均可以具有相对更高的准确性。通过比较第八语义识别结果的优先级和第六 语义识别结果的优先级,有利于第一装置相对恰当地选择是否跳出多轮语音交互。
结合第一方面,在第一方面的某些实现方式中,所述第八语义识别结果的优先级高于所述第六语义识别结果的优先级,包括以下中的一项或多项:
所述第八语义识别结果指示的功能的优先级高于所述第六语义识别结果指示的功能的优先级;
所述第八语义识别结果指示的意图的优先级高于所述第六语义识别结果指示的意图的优先级;
所述第八语义识别结果指示的参数的优先级高于所述第六语义识别结果指示的参数的优先级。
功能、意图、参数更能够反映当前语音交互场景。功能的优先级、意图的优先级、参数的优先级有利于使一装置相对准确地判断是否跳出当前多轮语音交互。
结合第一方面,在第一方面的某些实现方式中,所述第二语音信息与所述第一语音信息指示的操作无关。
例如,第二语音信息与第一语音信息所指示的操作的关联性低于第二预设阈值。又如,第一语音信息与第二语音信息无关。又如,第一语音信息与第二语音信息的关联度低于第二预设阈值。又如,在第一语音信息与第二语音信息指示的功能、意图、参数中的一个或多个不同。
结合第一方面,在第一方面的某些实现方式中,所述方法还包括:
响应用户的输入操作,执行语音唤醒操作。
第二方面,提供了一种语音交互的装置,包括:
获取单元,用于获取来自于语音传感器的第一语音信息;
处理单元,用于根据所述第一语音信息,确定第一语义识别结果;
所述处理单元,还用于根据所述第一语义识别结果和第一预设条件,确定执行由所述第一装置根据所述第一语义识别结果确定的第一操作,或者,确定执行由第二装置指示的第二操作。
结合第二方面,在第二方面的某些实现方式中,所述处理单元具体用于:
在所述第一语义识别结果满足所述第一预设条件的情况下,确定执行所述第一操作。
结合第二方面,在第二方面的某些实现方式中,所述第一装置预设有多个功能,所述第一语义识别结果满足第一预设条件,包括:
所述第一语义识别结果指示第一功能,所述第一功能属于所述多个功能。
结合第二方面,在第二方面的某些实现方式中,所述第一装置预设有多个意图,所述第一语义识别结果满足第一预设条件,包括:
所述第一语义识别结果指示第一意图,所述第一意图属于所述多个意图。
结合第二方面,在第二方面的某些实现方式中,所述第一装置预设有多个参数,所述第一语义识别结果满足第一预设条件,包括:
所述第一语义识别结果指示第一参数,所述第一参数属于所述多个参数。
结合第二方面,在第二方面的某些实现方式中,所述第一语义识别结果指示第一功能以及指示第一参数,所述第一语义识别结果满足第一预设条件,还包括:
所述第一语义识别结果指示的所述第一功能与所述第一语义识别结果指示的所述第 一参数对应相同的参数类型。
结合第二方面,在第二方面的某些实现方式中,所述处理单元具体用于:
根据所述第一语音信息,确定第二语义识别结果,所述第二语义识别结果指示第二功能以及指示所述第一参数;
在所述第一装置预设的多个功能不包括所述第二功能,且所述第一装置预设的多个参数包括所述第一参数的情况下,修正所述第二语义识别结果中的所述第二功能为所述第一功能,得到所述第一语义识别结果,所述第一功能与所述第二功能为类型相同的两个不同功能。
结合第二方面,在第二方面的某些实现方式中,所述第一语义识别结果指示第一意图以及指示第一参数,所述第一语义识别结果满足第一预设条件,还包括:
所述第一语义识别结果指示的所述第一意图与所述第一语义识别结果指示的所述第一参数对应相同的参数类型。
结合第二方面,在第二方面的某些实现方式中,所述处理单元具体用于:
根据所述第一语音信息,确定第三语义识别结果,所述第三语义识别结果指示第二意图以及指示第一参数;
在所述第一装置预设的多个意图不包括所述第二意图,且所述第一装置预设的多个参数包括所述第一参数的情况下,修正所述第三语义识别结果中的所述第二意图为所述第一意图,得到所述第一语义识别结果,所述第一意图与所述第二意图为类型相同的两个不同意图。
结合第二方面,在第二方面的某些实现方式中,所述第一语义识别结果满足所述第一预设条件,包括:
所述第一语义识别结果包括第一指示位,所述第一指示位指示所述第一语义识别结果满足所述第一预设条件。
结合第二方面,在第二方面的某些实现方式中,
所述处理单元具体用于:
根据所述第一语音信息,确定第四语义识别结果,所述第四语义识别结果包括第一功能和第一参数;
在所述第一功能属于所述第一装置预设的多个功能,且所述第一参数属于所述第一装置预设的多个参数,且所述第一功能和所述第一参数对应相同的参数类型的情况下,确定所述第一语义识别结果,所述语义识别结果包括所述第一指示位。
结合第二方面,在第二方面的某些实现方式中,
所述处理单元具体用于:
根据所述第一语音信息,确定第五语义识别结果,所述第五语义识别结果包括第一意图和第一参数;
在在所述第一意图属于所述第一装置预设的多个意图,且所述第一参数属于所述第一装置预设的多个参数,且所述第一意图和所述第一参数对应相同的参数类型的情况下,确定所述第一语义识别结果,所述语义识别结果包括所述第一指示位。
结合第二方面,在第二方面的某些实现方式中,所述装置还包括:
发送单元,用于向所述第二装置发送所述第一语音信息;
所述处理单元还用于,丢弃来自所述第二装置的第六语义识别结果。
结合第二方面,在第二方面的某些实现方式中,所述处理单元具体用于:
在所述第一语义识别结果不满足所述第一预设条件的情况下,获取来自第二装置的第六语义识别结果,所述第六语义识别结果用于指示所述第二操作;
根据所述第六语义识别结果,确定执行所述第二操作。
结合第二方面,在第二方面的某些实现方式中,所述处理单元还用于:
根据所述第六语义识别结果,确定第二参数以及第二参数类型;
所述装置还包括存储单元,用于保存所述第二参数与所述第二参数类型的关联关系。
结合第二方面,在第二方面的某些实现方式中,
所述获取单元还用于,获取来自于所述语音传感器的第二语音信息;
所述处理单元还用于,根据所述第二语音信息,确定第七语义识别结果;
所述处理单元还用于,根据所述第七语义识别结果、第二预设条件,确定执行由所述第一语音信息指示的操作,或者,确定执行由所述第一装置根据所述第七语义识别结果确定的第三操作,或者,确定执行由所述第二装置指示的第四操作。
结合第二方面,在第二方面的某些实现方式中,所述处理单元具体用于,
在所述第七语义识别结果满足所述第一预设条件且满足所述第二预设条件的情况下,确定执行所述第三操作;
在所述第七语义识别结果不满足所述第一预设条件且满足所述第二预设条件的情况下,确定执行所述第四操作;
在所述第七语义识别结果不满足所述第二预设条件的情况下,确定执行所述与所述第一语义识别结果对应的操作。
结合第二方面,在第二方面的某些实现方式中,所述第七语义识别结果满足所述第二预设条件,包括:
所述第七语义识别结果的优先级高于所述第一语义识别结果的优先级。
结合第二方面,在第二方面的某些实现方式中,所述第七语义识别结果的优先级高于所述第一语义识别结果的优先级,包括以下中的一项或多项:
所述第七语义识别结果指示的功能的优先级高于所述第一语义识别结果指示的功能的优先级;
所述第七语义识别结果指示的意图的优先级高于所述第一语义识别结果指示的意图的优先级;
所述第七语义识别结果指示的参数的优先级高于所述第一语义识别结果指示的参数的优先级。
结合第二方面,在第二方面的某些实现方式中,所述处理单元具体用于:
在所述第一语义识别结果不满足所述第一预设条件的情况下,获取来自所述第二装置的第六语义识别结果,所述第六语义识别结果用于指示所述第二操作;
根据所述第六语义识别结果,确定执行所述第二操作;
所述第七语义识别结果满足所述第二预设条件,包括:
所述第七语义识别结果的优先级高于所述第六语义识别结果的优先级。
结合第二方面,在第二方面的某些实现方式中,所述第七语义识别结果的优先级高于 所述第六语义识别结果的优先级,包括以下中的一项或多项:
所述第七语义识别结果指示的功能的优先级高于所述第六语义识别结果指示的功能的优先级;
所述第七语义识别结果指示的意图的优先级高于所述第六语义识别结果指示的意图的优先级;
所述第七语义识别结果指示的参数的优先级高于所述第六语义识别结果指示的参数的优先级。
结合第二方面,在第二方面的某些实现方式中,所述装置还包括:
发送单元,用于向所述第二装置发送第二语音信息;
所述处理单元具体用于:
在所述第七语义识别结果不满足所述第一预设条件的情况下,或者,在所述第七语义识别结果不满足所述第二预设条件,且所述第一语义识别结果不满足所述第一预设条件的情况下,从所述第二装置获取第八语义识别结果;
根据所述第八语义识别结果、所述第二预设条件,确定执行所述由所述第一语音信息指示的操作,或者,确定执行所述第四操作。
结合第二方面,在第二方面的某些实现方式中,所述处理单元具体用于:
在所述第八语义识别结果满足所述第二预设条件的情况下,确定执行所述第四操作;
在所述第八语义识别结果不满足所述第二预设条件的情况下,确定执行所述由所述第一语音信息指示的操作。
结合第二方面,在第二方面的某些实现方式中,所述第八语义识别结果满足所述第二预设条件,包括:
所述第八语义识别结果的优先级高于所述第一语义识别结果的优先级。
结合第二方面,在第二方面的某些实现方式中,所述第八语义识别结果的优先级高于所述第一语义识别结果的优先级,包括以下中的一项或多项:
所述第八语义识别结果指示的功能的优先级高于所述第一语义识别结果指示的功能的优先级;
所述第八语义识别结果指示的意图的优先级高于所述第一语义识别结果指示的意图的优先级;
所述第八语义识别结果指示的参数的优先级高于所述第一语义识别结果指示的参数的优先级。
结合第二方面,在第二方面的某些实现方式中,所述处理单元具体用于:
在所述第一语义识别结果不满足所述第一预设条件的情况下,获取来自所述第二装置的第六语义识别结果,所述第六语义识别结果用于指示所述第二操作;
根据所述第六语义识别结果,确定执行所述第二操作;
所述第八语义识别结果满足所述第二预设条件,包括:
所述第八语义识别结果的优先级高于所述第六语义识别结果的优先级。
结合第二方面,在第二方面的某些实现方式中,所述第八语义识别结果的优先级高于所述第六语义识别结果的优先级,包括以下中的一项或多项:
所述第八语义识别结果指示的功能的优先级高于所述第六语义识别结果指示的功能 的优先级;
所述第八语义识别结果指示的意图的优先级高于所述第六语义识别结果指示的意图的优先级;
所述第八语义识别结果指示的参数的优先级高于所述第六语义识别结果指示的参数的优先级。
结合第二方面,在第二方面的某些实现方式中,所述第二语音信息与所述第一语音信息指示的操作无关。
结合第二方面,在第二方面的某些实现方式中,所述装置还包括:
唤醒模块,用于响应用户的输入操作,执行语音唤醒操作。
第三方面,提供了一种语音交互的装置,所述装置包括:处理器和存储器,所述处理器与存储器耦合,所述存储器用于存储计算机程序,处理器,用于执行所述存储器中存储的计算机程序,以使得所述装置执行上述第一方面的任一种可能的实现方式所述的方法。
第四方面,提供了一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行上述第一方面中的任意一种实现方式所述的方法。
第五方面,提供了一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面中的任意一种实现方式所述的方法。
第六方面,提供了一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行上述第一方面中的任意一种实现方式所述的方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行第一方面中的任意一种实现方式所述的方法。
第七方面,提供了一种语音交互系统,所述语音交互系统包括上述第一方面的任一种可能的实现方式所述的第一装置、上述第一方面的任一种可能的实现方式所述的第二装置,所述第一装置用于执行上述第一方面的任一种可能的实现方式所述的方法。
本申请实施例提供的方案可以便于第一装置判断自身是否有能力独立识别用户语音指令。第一装置相对擅长的语音交互场景中,第一装置可以独立确定用户语音指令对应的操作,进而有利于降低第一装置执行用户的语音指令的响应时延,提高响应效率。在第一装置相对不擅长的语音交互场景中,第一装置可以选择执行其他装置指示的操作,有利于提高第一装置响应用户的语音指令的准确性。另外,将传感器所采集的语音指令,既由本地处理器处理,又发给云端处理,并适应性地选择执行来自本地处理器或云端反馈的操作,可以平衡响应效率和响应准确度。本申请实施例提供的方案可以使第一装置不断学习新的语音指令,以拓宽第一装置相对擅长的语音交互场景。本申请提供的方案有利于使第一装置可以恰当地选择是否跳出多轮语音交互,进而有利于提升语音交互效果。
附图说明
图1是一种语音交互系统的示意图。
图2是一种系统架构的示意图。
图3是一种语音交互系统的示意图。
图4是一种语音交互系统的示意图。
图5是本申请实施例提供的一种语音交互系统。
图6是本申请实施例提供的一种语音交互的方法的示意性流程图。
图7是本申请实施例提供的一种语音交互的方法的示意性流程图。
图8是本申请实施例提供的一种语音交互的方法的示意性流程图。
图9是本申请实施例提供的一种语音交互的方法的示意性流程图。
图10是本申请实施例提供的一种语音交互的方法的示意性流程图。
图11是本申请实施例提供的一种语音交互的装置的示意性结构图。
图12是本申请实施例提供的一种语音交互的装置的示意性结构图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
下面介绍几种可能的语音识别的应用场景。
应用场景1:智能驾驶的应用场景
在智能驾驶的应用场景中,用户可以通过语音控制智能驾驶设备。例如,用户可以向车载的语音助手发出语音指令,以控制智能驾驶设备。在一些可能的示例中,用户通过语音,可以调节座椅靠背的倾斜度、调节车内空调的温度、开启或关闭座椅加热器、开启或关闭车灯、开启或关闭车窗、开启或关闭后备箱、规划导航路线、播放个性化歌单等。在智能驾驶的应用场景中,语音交互有利于为用户提供便捷的驾驶环境。
应用场景2:智能家居的应用场景
在智能家居的应用场景中,用户可以通过语音控制智能家居设备。例如,用户可以向物联网设备(例如,智能家居设备)或物联网控制设备(如手机等)发出语音指令,以控制物联网设备。在一些可能的示例中,用户通过语音,可以控制智能空调的温度、控制智能电视播放用户指定的电视剧、控制智能烹饪设备在用户指定的时间启动、控制智能窗帘开启或关闭、控制智能灯具调整色温等。在智能家居的应用场景中,语音交互有利于为用户提供舒适的家居环境。
图1是一种语音交互系统100的示意图。
执行设备110可以是具有语音识别能力、自然语言理解能力等的设备。执行设备110例如可以是服务器。可选的,执行设备110还可以与其它计算设备配合,例如:数据存储、路由器、负载均衡器等设备。执行设备110可以布置在一个物理站点上,或者分布在多个物理站点上。执行设备110可以使用数据存储系统150中的数据,或者调用数据存储系统150中的程序代码实现语音识别、机器学习、深度学习、模型训练等功能中的至少一种。图1中的数据存储系统150可以集成在执行设备110上,也可以设置在云上或其它网络服务器上。
用户可以操作各自的本地设备(例如本地设备101和本地设备102)与执行设备110进行交互。图1所示的本地设备例如可以表示各类语音交互终端。
用户的本地设备可以通过有线或无线通信网络与执行设备110进行交互,通信网络的制式或标准不做限定,可以是广域网、局域网、点对点连接等方式,或它们的任意组合。
在一种实现中,本地设备101可以为执行设备110提供本地数据或反馈计算结果。
在另一种实现中,执行设备110的所有或部分功能可以由本地设备实现。例如,本地设备101实现执行设备110的功能并为自己的用户提供服务,或者为本地设备102的用户提供服务。
图2是一种系统架构200的示意图。
数据采集设备260可以用于采集训练数据。数据采集设备260还可以用于将训练数据存入数据库230。训练设备220可以基于数据库230中维护的训练数据训练得到目标模型/规则201。这里训练得到的目标模型/规则201可以用于执行本申请实施例的语音交互的方法。训练设备220也不一定完全基于数据库230维护的训练数据进行目标模型/规则201的训练,也可以从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
所述数据库230中维护的训练数据不一定都来自于数据采集设备260的采集,也有可能是从其他设备接收得到的。在一个示例中,数据库230中的训练数据可以通过客户设备240获取,或者可以通过执行设备210获取。客户设备240例如可以包括各类语音交互终端。执行设备210可以是具有语音识别能力、自然语言理解能力等的设备。例如,通过数据采集设备260获得语音信息并进行相关处理,可以获得输入文本的文本特征、目标语音的音标特征等训练数据;还可以通过数据采集设备260获取输入文本的文本特征和目标语音的音标特征。又如,语音信息可以直接作为训练数据。在另一个示例中,同一账户可以登录在多个客户设备240上,该多个客户设备240采集到的数据均可以维护在数据库230中。
可选地,上述训练数据例如可以包括语音、语料、热词等数据中的一种或多种。语音可以指负载了一定语言意义的声音。语料即语言材料,可以指用文本以及文本上下文关系等描述现实世界中的语言以及语言的上下文关系。热词,即热门词汇。热词可以是一种词汇现象,热词可以反映一些人在一个时期内相对关注的问题、话题、事物等。
在一个可能的示例中,上述训练数据例如可以包括输入语音(输入语音例如可以来自用户,也可以是其他设备获取到的语音)。
在另一个可能的示例中,上述训练数据例如可以包括输入语音的特征向量(如音标特征,音标特征例如可以反映输入语音的音标)。输入语音的特征向量可以通过对输入语音进行特征提取得到。
在另一个可能的示例中,上述训练数据例如可以包括与输入语音对应的目标文本等。
在再一个可能的示例中,上述训练数据例如可以包括与输入语音对应的目标文本的文本特征。目标文本可以通过对输入语音进行特征预处理后得到。目标文本的文本特征可以通过对目标文本进行特征提取得到。
举例说明,假设输入语音的发音为“nǐhǎo”,“nǐhǎo”对应的目标文本可以为“你好”。对“nǐhǎo”进行特征提取可以得到输入语音的音标特征。对“nǐhǎo”进行特征预处理和特征提取后可以得到目标文本“你好”的文本特征。
应理解,输入语音可以由客户设备240发送给数据采集设备260,也可以由数据采集 设备260从存储装置读取获得,还可以通过实时采集获得。
可选地,数据采集设备260可以从上述音标特征和/或文本特征中确定训练数据。
上述对输入语音的特征预处理可以包括归一化、字音转换、韵律停顿预测等处理。归一化可以指将文本中的数字、符号等非汉字字符按语义转成汉字。音字转换可以指为各个语音预测其对应的拼音,进而生成各个语音的汉字文本序列。韵律停顿预测可以指预测重音标记、韵律短语、语调短语标记等。
特征预处理可以由数据采集设备260执行,也可以由客户设备240或者其他设备执行。当数据采集设备260获取到输入语音时,数据采集设备260可以对输入语音进行特征预处理和特征提取,得到目标文本的文本特征。或者,当客户设备240获取到输入语音时,客户设备240可以对输入语音进行特征预处理,得到目标文本;数据采集设备260可以对目标文本进行特征提取。
下面以汉语为例,对输入语音的特征提取进行介绍。
例如输入语音的发音为“nǐmenhǎo”,则可以生成如下音标特征:
S_n_i_3_SP0_m_en_0_SP1_h_ao_3_E
在“nǐmenhǎo”的音标特征中,“S”可以为句首标记,也可以理解为开始标记;“E”可以为句尾标记,也可以理解为结束标记;数字“0”、“1”、“2”、“3”、“4”可以为声调标记;“SP0”、“SP1”可以为不同的停顿等级标记;汉语拼音的声母和韵母可以作为音素;不同音素/标记之间可以使用空格“_”隔开。在该示例中,音标特征中可以有13个音标特征元素。
又例如另一段输入语音对应的目标文本为“大家好”,则可以生成如下文本特征:
S_d_a_4_SP0_j_ia_1_SP1_h_ao_3_E
在“大家好”的文本特征中,“S”可以为句首标记,也可以理解为开始标记;“E”可以为句尾标记,也可以理解为结束标记;数字“0”、“1”、“3”、“4”可以为声调标记;“SP0”、“SP1”可以为不同的停顿等级标记;汉语拼音的声母和韵母可以作为音素;不同音素/标记之间可以使用空格“_”隔开。在该示例中,文本特征中可以有13个文本特征元素。
需要说明的是,在本申请实施例中,对语言种类不存在限定,除上述汉语例子以外,还可以是英语、德语、日语等其他语言。本申请实施例主要以汉语为例进行说明。
下面对训练设备220基于语言训练数据训练得到目标模型/规则201的过程进行介绍。
训练设备220可以将获取的训练数据输入到目标模型/规则201中。例如,可以根据目标模型/规则201输出的音标特征结果与当前输入语音对应的音标特征进行对比,或者可以根据目标模型/规则201输出的文本特征结果与当前输入语音对应的文本特征进行对比,从而完成对目标模型/规则201的训练。
根据训练设备220训练得到目标模型/规则201,可以是基于神经网络搭建的模型,这里的神经网络可以是卷积神经网络(convolutional neuron network,CNN)、循环神经网络(recurrent neural network,RNN)、时间递归神经网络(long-short term memory,LSTM)、双向时间递归神经网络(bidirectional long-short term memory,BLSTM)、深度卷积神经网络(deep convolutional neural networks,DCNN)等等。进一步的,目标模型/规则201可以是基于自关注神经网络(self-attention neural network)实现的。目标模型/规则201的类 型例如可以属于自动语音识别技术(automatic speech recognition,ASR)模型、自然语言处理(natural language processing,NLP)模型等。
上述训练设备220得到的目标模型/规则201可以应用于不同的系统或设备中。在图2所示的系统化架构200中,执行设备210可以配置有输入/输出(input/output,I/O)接口212。通过该I/O接口212,执行设备210能够与执行设备210的外部设备进行数据交互。如图2所示,“用户”可以通过客户设备240向I/O接口212输入数据。例如,用户可以通过客户设备240向I/O接口212输入中间预测结果,再由客户设备240将经过一定处理后得到的中间预测结果经I/O接口212发送给执行设备210。中间预测结果例如可以是与输入语音对应的目标文本等。
可选的,训练设备220可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则201,该相应的目标模型/规则201即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
执行设备210可以调用数据存储系统250中的数据、代码等,也可以将数据、指令等存入数据存储系统250中。
可选地,执行设备210还可以将训练设备220得到的目标模型/规则201进行分割,得到其子模型/子规则,并将得到的子模型/子规则分别部署在客户设备240和执行设备210。在一个示例中,执行设备210可以将目标模型/规则201的个性化子模型发送给客户设备240,由客户设备240将其部署在设备内。可选的,目标模型/规则201的通用子模型在训练过程中没有更新参数,因此不做改变。
例如,训练设备220可以通过数据库230获取训练数据。训练设备220可以对训练数据训练获得语音模型。训练设备220可以将训练获得的语音模型发送给执行设备210,由执行设备210对该语音模型进行划分,从而获得个性化语音子模型和通用语音子模型。或者,训练设备220可以先对训练获得的语音模型进行划分,从而获得个性化语音子模型和通用语音子模型,并将个性化语音子模型和通用语音子模型发送给执行设备210。
可选的,目标模型/规则201可以是在基础语音模型的基础上训练获得的。在训练过程中,目标模型/规则201的一部分可以更新,目标模型/规则201的另一部分可以不更新。目标模型/规则201的更新部分可以对应于个性化语音子模型。目标模型/规则201的不更新部分可以对应于通用语音子模型。基础语音模型可以是训练设备220利用多人的语音、语料等预先训练好的,也可以是已有的语音模型。
客户设备240和计算模块211可以配合工作。客户设备240和计算模块211可以根据上述个性化语音子模型和通用语音子模型,对输入到客户设备240的数据和/或输入到执行设备210的数据(例如来自于客户设备240的中间预测结果)进行处理。在一个示例中,客户设备240可以对输入的用户语音进行处理,获得该用户语音对应的音标特征或文本特征;然后,客户设备240可以将该音标特征或文本特征输入到计算模块211。在其他示例中,执行设备210的预处理模块213可以从根据I/O接口112接收到输入语音,并对输入语音进行特征预处理和特征提取,得到目标文本的文本特征。预处理模块213可以将目标文本的文本特征输入到计算模块211计算模块211可以将该音标特征或文本特征输入到目标模型/规则201中,从而得到语音识别的输出结果(例如语义识别结果、与语音指令对应的操作等)。计算模块211可以将该输出结果输入到客户设备240,使得客户设备240 可以执行相应操作以响应用户的语音指令。
I/O接口212可以将输入数据发送给执行设备210相应模块,也可以将输出结果返回给客户设备240,提供给用户。例如,I/O接口212可以将输入语音对应的中间预测结果发送给计算模块211,也可以将识别语音后得到的结果返回给客户设备240。
在图2所示的系统架构200中,用户可以向客户设备240中输入语音、语料等数据,可以在客户设备240查看执行设备210输出的结果,具体的呈现形式可以是声音或者声音与显示的组合等具体方式。客户设备240也可以作为数据采集端将采集到的语音、语料等数据存入数据库230。当然,也可以不经过客户设备240进行采集,而是由其他设备将用户的语音、语料等数据及I/O接口212的输出结果,作为新的样本数据存入数据库230。
在图2所示的系统架构200中,根据客户设备240数据处理能力的不同,执行设备210和数据存储系统250可以集成在不同的设备中。例如,当客户设备240的数据处理能力较强时,执行设备210和数据存储系统250可以集成在客户设备240中;而当客户设备240数据处理能力不是很强时,执行设备210和数据存储系统250可以集成在专门的数据处理设备中。图2中的数据库230、训练设备220以及数据采集设备260既可以集成在专门的数据处理设备中,也可以设置在云上或网络上的其它服务器上,还可以分别设置在客户设备240和数据处理设备中。
值得注意的是,图2仅是本申请实施例提供的一种系统架构的示意图,图2中所示设备、器件、模块等之间的位置关系不构成任何限制。例如,在图2中,数据存储系统250相对执行设备210是外部存储器,在其它情况下,也可以将数据存储系统250置于执行设备210中。又如,在一些可能的示例中,执行设备210可以置于客户设备240中。目标模型/规则201的通用语音子模型可以是客户设备240的出场语音模型。在客户设备240出厂后,目标模型/规则201的个性化语音子模型可以根据客户设备240采集到的数据进行更新。
为了更好地理解本申请实施例的方案,下面先结合图3、图4介绍一些语音交互系统。其中,图3、图4可以是两种语音交互系统的示意图。
图3所示的语音交互系统300可以包括至少一个第一装置。第一装置可以是各类语音交互终端。
在图3所示的语音交互系统300中,第一装置可以通过交互接口获取用户的语音指令。第一装置可以识别该语音指令,并得到识别后的结果。第一装置可以根据识别后的结果执行相应操作,以响应用户的语音指令。可选的,第一装置还可以通过存储数据的存储器以及数据处理的处理器进行机器学习、深度学习、模型训练等相关处理。
图3所示的语音交互系统300与图1所示的语音交互系统100可以有一定的对应关系。在图3所示的语音交互系统300中,第一装置例如可以相当于图1所示的本地设备101或本地设备102。
图3所示的语音交互系统300与图2所示的系统架构200可以有一定的对应关系。在图3所示的语音交互系统300中,第一装置例如可以相当于图2所示的客户设备240。可选的,图3所示第一装置可以具有图2所示的执行设备210的部分或全部的功能。
在图3所示的语音交互系统300中,语音识别可以主要依赖于第一装置。在第一装置的语音识别能力、自然语言理解能力等较强的情况下,第一装置通常可以快速响应用户的 语音指令。这就意味着图3所示的语音交互系统300对第一装置在处理能力方面的要求相对较高。如果第一装置的语音识别能力、自然语言理解能力等相对较弱,第一装置无法准确响应用户的语音指令,进而可能降低用户的语音交互体验感。
图4所示的语音交互系统400可以包括至少一个第一装置和至少一个第二装置。第一装置可以是各类语音交互终端。第二装置可以是具有语音识别能力、自然语言理解能力等的云端设备,如服务器。
在图4所示的语音交互系统400中,第一装置可以接收或获取用户的语音,该语音可以包含用户的语音指令。第一装置可以将用户的语音转发给第二装置。第一装置可以从第二装置获取识别语音后得到的结果。第一装置可以根据识别语音后得到的结果执行操作,以响应用户的语音指令。
在图4所示的语音交互系统400中,第二装置可以通过交互接口来获取来自第一装置的语音,并对该语音进行语音识别。第二装置还可以将识别语音后得到的结果转发给第一装置。图4中所示的存储器可以是一个统称,包括本地存储以及存储历史数据的数据库。图4中的数据库可以在第二装置上,也可以在其它装置上。
可选的,第一装置、第二装置均可以通过存储数据的存储器以及数据处理的处理器进行机器学习、深度学习、模型训练、语音识别等相关处理。
图4所示的语音交互系统400与图1所示的语音交互系统100可以有一定的对应关系。在图4所示的语音交互系统400中,第一装置例如可以相当于图1所示的本地设备101或本地设备102;第二装置例如可以相当于图1所示的执行设备110。
图4所示的语音交互系统400与图2所示的系统架构200可以有一定的对应关系。在图4所示的语音交互系统400中,第一装置例如可以相当于图2所示的客户设备240;第二装置例如可以相当于图2所示的执行设备210。
在图4所示的语音交互系统400中,语音识别可以主要依赖于第二装置。第一装置可以对语音不作处理,或仅做简单的预处理。图4所示的语音交互系统400有利于降低对第一装置在处理能力方面的要求。然而第二装置与第一装置交互可能会产生时延。这可能不利于第一装置快速响应用户的语音指令,进而可能降低用户的语音交互体验感。
图5是本申请实施例提供的一种语音交互系统500。语音交互系统500可以包括第一装置、第二装置。
第一装置可以是各类语音交互设备,例如车、车机、车载电脑(On-board computer,或On-board PC)、芯片(如车载芯片、语音处理芯片等)、处理器、手机、个人计算机、智能手环、平板电脑、智能摄像头、机顶盒、游戏机、语音车载设备、智能汽车、媒体消费设备、智能家居设备、可发声的穿戴式设备上的智能语音助手、智能音箱或可与人对话的各类机器或设备等。第一装置例如可以是图1所示的本地设备101或本地设备102,或者是本地设备101或本地设备102的一个单元或模块。第一装置又如可以是图2所示的客户设备240,或者是客户设备240的一个单元或模块。
可选的,第一装置例如可以包括语义识别模块、操作决策模块、收发模块。
语义识别模块可以用于对来自语音传感器的语音信息进行识别。语音信息例如可以是语音传感器直接获取到的音频,还可以是经过处理后的、携带有该音频的内容的信号。语义识别模块可以输出语义识别后得到的语义识别结果。语义识别结果可以是能够反映用户 语音内容的结构化信息。语义识别模块例如可以存储语音交互模型,如ASR模型、NLP模型等。语义识别模块可以通过语音交互模型,对语音信息进行语义识别。
第一收发模块可以用于将来自语音传感器的语音信息转发至第二装置,并从第二装置获取由第二装置指示的语音分析结果,语音分析结果例如可以包括语义识别结果和/或操作信息。
操作决策模块可以用于确定响应用户的操作。例如,操作决策模块可以从语义识别模块获取语义识别结果;操作决策模块可以根据语义识别结果,确定执行相应的操作。又如,操作决策模块可以从收发模块获取由第二装置指示的语音分析结果,进而确定执行第二装置指示的操作。又如,操作决策模块可以确定响应语音信息的操作是由语义识别模块指示还是由第二装置指示。
第二装置可以独立于第一装置。第二装置例如可以是远端服务平台或服务端。第二装置可以由一个或多个服务器实现。服务器例如可以包括网络服务器、应用服务器、管理服务器、云服务器、边缘服务器、虚拟服务器(利用多个物理资源虚拟成的服务器)等服务器中的一种或多种。第二装置还可以是其他具有语音识别能力、自然语言理解能力等的设备,例如车机、车载电脑、芯片(如车载芯片、语音处理芯片等)、处理器、手机、个人计算机、或平板电脑等具有数据处理功能的设备。第二装置例如可以是图1所示的执行设备110,或者是执行设备110的一个单元或模块。第二装置又如可以是图2所示的执行设备210,或者是执行设备210的一个单元或模块。
可选的,第二装置例如可以包括第二收发模块、语音分析模块。
第二收发模块可以用于从第一装置获取语音信息。第二收发模块还可以用于将语音分析模块输出的语音分析结果发送至第一装置。
语音分析模块可以用于从第二收发模块获取来自第一装置的语音信息,并可以对该语音信息进行语音分析。语音分析模块例如可以存储语音交互模型,AS、NLP模型等。语音分析模块可以通过语音交互模型,对语音信息进行语音处理。
在一个示例中,语音分析模块可以输出语义识别后得到的语义识别结果。
在另一个示例中,语音分析模块可以对语义识别结果进行分析,得到与语音信息对应的操作信息。该操作信息可以用于指示由第一装置执行的操作。操作信息的表现形式例如可以是指令。
图6是本申请实施例提供的一种语音交互的方法600的示意性流程图。图6所示的方法600例如可以应用于图5所示的语音交互系统500。
601,第一装置响应用户的输入操作,执行语音唤醒操作。
例如,用户可以对第一装置说出唤醒词。第一装置可以检测到用户输入的唤醒词,进而唤醒第一装置的语音交互功能。
又如,用户可以按下唤醒按钮以向第一装置输入操作。第一装置可以检测到用户针对唤醒按钮的输入操作,进而唤醒第一装置的语音交互功能。
601可以是一个可选的步骤。可选的,第一装置可以包括唤醒模块,该唤醒模块可以用于检测用户的输入操作。
602,第一装置获取来自于语音传感器的第一语音信息。
语音传感器可以用于录取用户的语音。语音传感器可以是具有录音功能的装置。语音 传感器例如可以包括麦克风。
可选的,语音传感器又如可以是具有录音功能和数据处理能力的装置。语音传感器例如还可以对用户的语音进行相关的语音处理,例如降噪处理、放大处理、编码调制处理等。
在其他示例中,上述语音处理也可以由其他模块执行。
第一语音信息例如可以是语音传感器直接获取到的音频,还可以是经过处理后的、携带有该音频的内容的信号。
在一个示例中,用户说出第一语音指令,第一语音指令用于指示第一装置执行目标操作。通过语音传感器,第一语音指令可以被转换为第一语音信息,从而第一语音信息可以用于指示目标操作。
603a,所述第一装置根据所述第一语音信息,确定执行所述第一语音信息指示的目标操作。
通过分析第一语音信息,第一装置可以感知或获知第一语音指令的具体含义,进而可以确定第一语音信息指示的目标操作。可选的,第一装置可以执行目标操作以响应用户的第一语音指令。
可选的,所述第一装置根据所述第一语音信息,确定执行所述第一语音信息指示的目标操作,包括:所述第一装置根据所述第一语音信息,确定第一语义识别结果;所述第一装置根据所述第一语义识别结果,确定执行第一操作。第一操作可以是第一装置根据所述第一语音信息确定的操作。第一操作可以对应上述目标操作。
第一语音信息转换为第一语义识别结果的过程例如可以包括:将包含音频内容的第一语音信息转换为第一文字信息;对第一文字信息进行语义提取,得到结构化的第一语义识别结果。第一语义识别结果可以包括以下结构化信息中的一项或多项:功能、意图、参数。第一语义识别结果的功能可以为功能结果的具体值。第一语义识别结果的功能可以表示一类策略。第一语义识别结果的功能还可以被称为领域。意图可以是意图结果的具体值。参数可以是槽位结果的具体值。
下面通过一个例子,阐述第一装置根据第一语义信息确定第一操作的一种可能的方法。
用户可以对第一装置说出第一语音指令。第一语音指令可以被转换为第一文字信息,第一文字信息例如可以是“打开设备A”。设备A例如可以是空调。第一文字信息的自然含义可以指示第一操作。然而仅凭第一文字信息,第一装置通常无法直接确定第一语音指令的自然含义,也无法确定第一语音指令指示的操作。
第一文字信息可以被转换为第一语义识别结果,第一语义识别结果可以指示第一操作。
在一个示例中,第一语义识别结果可以包括第一功能、第一意图、第一参数;第一功能例如可以是“车控功能”;第一意图可以为“打开”;第一参数可以为“设备A”。第一装置可以根据第一语义识别结果,确定:第一操作可以为车控功能内的操作;第一操作的意图可以是打开或开启;被打开或开启的对象可以为设备A。
在一个示例中,第一语义识别结果可以包括第一功能、第一意图、第一参数;第一功能例如可以是“车控功能”;第一意图可以为“打开设备A”;第一参数可以为“空”。第一装置可以根据第一语义识别结果,确定:第一操作可以为车控功能内的操作;第一操作的 意图可以是打开设备A,或使设备A处于开启状态;第一语义识别结果可以不包含设备A的开启状态的具体参数(如温度参数、定时参数、模式参数等)。
在一个示例中,第一语义识别结果可以包括第一功能、第一参数;第一功能例如可以是“车控功能”;第一参数可以为“设备A”。第一装置可以根据第一语义识别结果,确定:第一操作可以为车控功能内的操作;第一操作的对象可以是设备A。
在一个示例中,第一语义识别结果可以包括第一意图、第一参数;第一功能例如可以是第一意图可以为“打开”;第一参数可以为“设备A”。第一装置可以根据第一语义识别结果,确定:第一操作的意图可以是打开或开启;被打开或开启的对象可以为设备A。
在一个示例中,第一语义识别结果可以包括第一意图;第一意图可以为“打开设备A”。第一装置可以根据第一语义识别结果,确定:第一操作的意图可以是打开设备A,或使设备A处于开启状态;第一语义识别结果可以不包含设备A的开启状态的具体参数(如温度参数、定时参数、模式参数等)。
在一个示例中,第一语义识别结果可以包括第一参数;第一参数可以为“设备A”。第一装置可以根据第一语义识别结果,确定:第一操作的对象可以是设备A。
可选的,所述第一装置根据所述第一语音信息,确定执行所述第一语音信息指示的目标操作,包括:所述第一装置根据所述第一语音信息,确定第一语义识别结果;所述第一装置根据所述第一语义识别结果和第一预设条件,确定执行第一操作由所述第一装置根据所述第一语义识别结果确定的第一操作。
也就是说,第一装置可以判断第一预设条件是否满足,进而根据判断结果,确定是否执行由所述第一装置根据所述第一语义识别结果确定的第一操作。例如,在第一语义识别结果满足第一预设条件的情况下,第一装置可以执行603a。
603b,所述第一装置向第二装置发送所述第一语音信息。
例如,第一装置为手机,第二装置为服务器。也就是说,手机可以向服务器发送第一语音信息。
又如,第一装置为手机,第二装置为手机。也就是说,手机可以向手机发送第一语音信息。
一种可能的方式是,第一装置通过其他装置向第二装置发送第一语音信息。其他装置例如可以是手机等。
在一个示例中,603b与603a可以被同步执行或在一定时间内先后执行。本申请实施例可以不限定603b与603a的执行顺序。可选的,在上述第一语义识别结果满足上述第一预设条件的情况下,第一装置还可以执行603b。
第一装置与第二装置均可以对第一语音信息进行语义识别。
可选的,第二装置可以根据第一语音信息,向第一装置发送第六语义识别结果。
也就是说,第二装置可以针对第一语音信息,向第一装置反馈识别结果。
可选的,基于第一语音信息,第一装置与第二装置可以得到相同或不同的语义识别结果。例如,第一装置可以根据第一语音信息,确定第一语义识别结果,第一语义识别结果可以用于指示第一操作;第二装置可以根据第一语音信息,确定第六语义识别结果,第六语义识别结果可以用于指示第二操作;第一操作、第二操作中的至少一个可以对应上述目标操作。
有关603a的内容阐述了第一装置根据第一语音信息确定第一语义识别结果的一些可能的示例,第二装置根据第一语音信息确定第六语义识别结果的方式可以参照603a,在此就不再详细赘述。
可选的,除了通过语义识别结果,第二装置还可以通过目标操作信息向第一装置指示目标操作。目标操作信息例如可以表现为操作信令。
在一种可能的情况下,第一装置可以根据第一装置自身分析出的第一语义识别结果,确定响应第一语音指令的第一操作。可选的,第一装置可以从第二装置获取第六语义识别结果和/或目标操作信息。第六语义识别结果和/或目标操作信息可以由第二装置根据第一语音信息确定。第六语义识别结果可以用于反映第一语音信息的含义,并用于间接指示第一装置执行第二操作;目标操作信息可以用于直接指示第一装置执行第二操作。由于第二装置反馈的第六语义识别结果和/或目标操作信息可能存在时延,第一装置可以在执行第一操作后,获取第六语义识别结果和/或目标操作信息。第一装置可以根据第六语义识别结果和/或目标操作信息,调整自身的语义识别模型、语音控制模型等,以有利于提高第一装置输出语义识别结果的准确率、优化响应用户语音指令的操作的适用性等。
可选的,第一装置可以丢弃来自所述第二装置的第六语义识别结果。
第一装置可以接收第六语义识别结果后丢弃第六语义识别结果。或者,第一装置可以不接收第六语义识别结果。也就是说,如果第一装置可以根据第一语音信息确定要执行的操作,第二装置针对第一语音信息的反馈。
在另一种可能的情况下,在第一装置根据第一语音信息确定要执行的目标操作之前,第一装置可能不确定自身是否具有识别语音信息的能力。换句话说,第一装置可能无法根据第一语音信息,确定第一语音信息指示的目标操作。第一装置可以预先将第一语音信息发送给第二装置。第二装置可以对第一语音信息进行识别和处理,得到第六语义识别结果,和/或目标操作信息;第六语义识别结果可以用于反映第一语音信息的含义,并用于间接指示第一装置执行目标操作;目标操作信息可以用于直接指示第一装置执行目标操作。如果第一装置可以根据第一语音信息,确定目标操作,第一装置可以跳过或丢弃第二装置根据第一语音信息反馈的结果。如果第一装置无法根据第一语音信息,确定目标操作,由于第一装置预先向第二装置发送了第一语音信息,第一装置可以相对更快地从第二装置获取第六语义识别结果和/或目标操作信息。因此有利于缩短第一装置响应用户指令的时间。
在另一个示例中,603b与603a中有且仅有一个被第一装置执行。
也就是说,第一装置可以确定下面有且仅有一项被执行:第一装置根据第一语音信息,确定待执行的第一操作;第一装置根据第二装置反馈的结果,确定待执行的第二操作。
例如,在第一装置相对擅长的语音交互场景中,第一装置可以在不借助第二装置提供的信息的情况下,确定用户语音指令对应的操作。由此有利于提高第一装置响应用户的语音指令的效率。另外,第一装置可以选择不向第二装置发送第一语音信息。由此有利于减少第一装置的信令传输数量。在第一装置相对不擅长的语音交互场景中,第一装置可以根据第二装置提供的信息,确定用户语音指令对应的操作。由此有利于提高第一装置响应用户的语音指令的准确性。
可选的,所述第一装置根据所述第一语音信息,确定第一语义识别结果;所述第一装置根据所述第一语义识别结果和第一预设条件,确定执行由第二装置指示的第二操作。
第一装置可以判断第一预设条件是否满足,进而根据判断结果,确定是否执行由第二装置指示的第二操作。例如,在第一语义识别结果不满足第一预设条件的情况下,第一装置可以执行603b,不执行603a;在第一语义识别结果满足第一预设条件的情况下,第一装置可以执行603a、不执行603b。
下面通过一些示例,阐述第一装置如何根据所述第一语义识别结果和第一预设条件,确定是执行由第一装置根据第一语义识别结果确定的第一操作,还是执行由第二装置指示的第二操作。
可选的,在所述第一语义识别结果满足所述第一预设条件的情况下,第一装置可以确定执行由所述第一装置根据所述第一语义识别结果确定的第一操作。
在第一语义识别结果满足第一预设条件的情况下,第一装置可以根据第一语义识别结果,可以确定执行相应的操作,使得第一装置可以响应用户的语音指令。
可选的,在所述第一语义识别结果不满足所述第一预设条件的情况下,确定执行由第二装置指示的第二操作。
在第一语义识别结果不满足第一预设条件的情况下,第一装置可以根据第二装置的指示确定要执行的操作,使得第一装置可以响应用户的语音指令。在一个可能的示例中,第二装置可以通过语义识别结果和/或操作信息向第一装置指示第二操作。
可选的,所述根据所述第一语义识别结果和第一预设条件,确定执行由第二装置指示的第二操作,包括:在所述第一语义识别结果不满足所述第一预设条件的情况下,获取来自第二装置的第六语义识别结果,所述第六语义识别结果用于指示所述第二操作;根据所述第六语义识别结果,确定执行所述第二操作。
第二装置可以对第一语音信息进行识别,得到第六语义识别结果。第二装置可以通过第六语义识别结果,向第一装置间接指示第一装置要执行的第二操作,以使得第一装置可以响应用户的语音指令。也就是说,第一装置可以根据第六语义识别结果,确定第一装置要执行的操作为第二操作。
可选的,第一语义识别结果满足第一预设条件,是指第一语义识别结果所包括(所对应或所指示)的功能是第一装置预设具有优先处理级别的功能。第一装置可以预设多个功能,当第一语义识别结果所包括(所对应或所指示)的第一功能属于该多个功能时,则第一语义识别结果满足第一预设条件。以预设的多个功能为第一语义识别列表为例进行描述,其它预设多个功能的形式与之类似,所述方法还包括:第一装置获取第一语义识别列表,所述第一语义识别列表包括多个功能;所述第一语义识别结果满足第一预设条件,包括:所述第一语义识别结果包括第一功能,所述第一功能属于所述多个功能。
可选的,第一语义识别结果不满足第一预设条件,是指第一语义识别结果所包括(所对应或所指示)的功能不是第一装置预设具有优先处理级别的功能。第一装置可以预设多个功能,当第一语义识别结果所包括(所对应或所指示)的第一功能不属于该多个功能时,则第一语义识别结果不满足第一预设条件。以预设的多个功能为第一语义识别列表为例进行描述,其它预设多个功能的形式与之类似,所述方法还包括:第一装置获取第一语义识别列表,所述第一语义识别列表包括多个功能;所述第一语义识别结果不满足第一预设条件,包括:所述第一语义识别结果包括第一功能,所述第一功能不属于所述多个功能。
也就是说,第一装置可以对第一语音信息进行语义识别,得到第一语义识别结果,第 一语义识别结果可以包括第一功能;第一装置可以在第一语义识别列表中查找第一功能。在一个示例中,如果第一语义识别列表包括第一功能,第一装置可以判断第一语义识别结果满足第一预设条件,进而第一装置可以确定执行由所述第一装置根据所述第一语义识别结果确定的第一操作。在另一个示例中,如果第一语义识别列表不包括第一功能,第一装置可以判断第一语义识别结果不满足第一预设条件,进而第一装置可以确定执行由所述第二装置指示的第二操作。
第一语义识别列表例如可以是第一装置预先存储的列表。第一语义识别列表例如可以包括第一装置支持的多个功能。第一装置可以对第一语义识别列表中的多个功能具有相对更高的语义识别能力。第一装置可以对第一语义识别列表以外的其他功能具有相对更低的语义识别能力。如果第一功能包含于第一语义识别列表,可能意味着第一语义识别结果的准确性相对较高。如果第一功能未包含于第一语义识别列表,可能意味着第一语义识别结果的准确性相对较低。
例如,第一语义识别列表中的多个功能可以包括“车控功能”。也就是说,针对“车控功能”,第一装置的语义识别能力可以相对较好。第一文字信息例如可以是“打开设备A”。第一文字信息可以指示“车控功能”。第一语义识别结果可以包括第一功能,第一功能可以为“车控功能”。第一装置可以根据第一语义识别列表、第一语义识别结果,判断第一语义识别结果满足第一预设条件,并确定执行由第一装置根据第一语义识别结果确定的第一操作。
又如,第一语义识别列表中的多个功能可以包括“本地翻译功能”,但不包括“云翻译功能”。也就是说,针对“本地翻译功能”,第一装置的语义识别能力可以相对较好;针对“云翻译功能”,第一装置的语义识别能力可以相对较差(例如第一装置可能无法及时获知当前的热词、无法实现全部外语翻译等)。第一文字信息例如可以是“用外语A翻译以下内容”,其中“外语A”可以不属于第一装置可翻译的外语。第一语义识别结果可以包括第一功能,第一功能可以为“云翻译功能”。第一装置可以根据第一语义识别列表、第一语义识别结果,判断第一语义识别结果不满足第一预设条件。可选的,第一装置可以确定执行由第二装置指示的第二操作。
可选的,第一语义识别结果满足第一预设条件,是指第一语义识别结果所包括(所对应或所指示)的意图是第一装置预设具有优先处理级别的意图。第一装置可以预设多个意图,当第一语义识别结果所包括(所对应或所指示)的第一意图属于该多个意图时,则第一语义识别结果满足第一预设条件。以预设的多个意图为第二语义识别列表为例进行描述,其它预设多个意图的形式与之类似,所述方法还包括:第一装置获取第二语义识别列表,所述第二语义识别列表包括多个意图;所述第一语义识别结果满足第一预设条件,包括:所述第一语义识别结果包括第一意图,所述第一意图属于所述多个意图。
可选的,第一语义识别结果不满足第一预设条件,是指第一语义识别结果所包括(所对应或所指示)的意图不是第一装置预设具有优先处理级别的意图。第一装置可以预设多个意图,当第一语义识别结果所包括(所对应或所指示)的第一意图不属于该多个意图时,则第一语义识别结果不满足第一预设条件。以预设的多个意图为第二语义识别列表为例进行描述,其它预设多个意图的形式与之类似,所述方法还包括:第一装置获取第二语义识别列表,所述第二语义识别列表包括多个意图;所述第一语义识别结果不满足第一预设条 件,包括:所述第一语义识别结果包括第一意图,所述第一意图不属于所述多个意图。
第一装置可以对第一语音信息进行语义识别,得到第一语义识别结果,第一语义识别结果可以包括第一意图;第一装置可以在第二语义识别列表中查找第一意图。在一个示例中,如果第二语义识别列表包括第一意图,则第一装置可以判断第一语义识别结果满足第一预设条件,进而第一装置可以确定执行由所述第一装置根据所述第一语义识别结果确定的第一操作。在另一个示例中,如果第二语义识别列表不包括第一意图,第一装置可以判断第一语义识别结果不满足第一预设条件,进而第一装置可以确定执行由所述第二装置指示的第二操作。
第二语义识别列表例如可以是第一装置预先存储的列表。第二语义识别列表例如可以包括第一装置支持的多个意图。第一装置可以对第二语义识别列表中的多个意图具有相对更高的语义识别能力。第一装置可以对第二语义识别列表以外的其他意图具有相对更低的语义识别能力。如果第一意图包含于第二语义识别列表,可以意味着第一语义识别结果的准确性相对较高。如果第一意图未包含于第二语义识别列表,可以意味着第一语义识别结果的准确性相对较低。
例如,第二语义识别列表中的多个意图可以包括“打开设备A”。也就是说,针对“打开设备A”这一意图,第一装置的语义识别能力可以相对较好。第一文字信息例如可以是“打开设备A”。第一文字信息可以指示“打开设备A”。第一语义识别结果可以包括第一意图,第一意图可以为“打开设备A”。第一装置可以根据第二语义识别列表、第一语义识别结果,判断第一语义识别结果满足第一预设条件,并确定执行由第一装置根据第一语义识别结果确定的第一操作。
又如,第二语义识别列表中的多个意图可以包括“本地音频播放意图”,但不包括“云端音频播放意图”。也就是说,针对“本地音频播放意图”,第一装置的语义识别能力可以相对较好(例如第一装置可以根据本地存储的音频数据,识别出语音指令中指示的音频资源);针对“云端音频播放意图”,第一装置的语义识别能力可以相对较差(例如第一装置可能不支持对云端音频数据的识别能力等)。第一文字信息例如可以是“从歌曲A的1分钟开始播放”,其中“歌曲A”可以属于云端音频数据。第一语义识别结果可以包括第一意图,第一意图可以为“播放歌曲A”。第一装置可以根据第二语义识别列表、第一语义识别结果,判断第一语义识别结果不满足第一预设条件。可选的,第一装置可以确定执行由第二装置指示的第二操作。
可选的,第一语义识别结果满足第一预设条件,是指第一语义识别结果所包括(所对应或所指示)的参数指示第一装置预设具有优先处理级别的功能,意图,场景,设备或位置等信息。第一装置可以预设多个参数,当第一语义识别结果所包括的第一参数属于该多个参数时,则第一语义识别结果满足第一预设条件。以预设的多个参数为第三语义识别列表为例进行描述,其它预设多个参数的形式与之类似,所述方法还包括:第一装置获取第三语义识别列表,所述第三语义识别列表包括多个参数;所述第一语义识别结果满足第一预设条件,包括:所述第一语义识别结果包括第一参数,所述第一参数属于所述多个参数。
可选的,第一语义识别结果不满足第一预设条件,是指第一语义识别结果所包括(所对应或所指示)的参数指示第一装置预设不具有优先处理级别的功能,意图,场景,设备或位置等信息。第一装置可以预设多个参数,当第一语义识别结果所包括的第一参数不属 于该多个参数时,则第一语义识别结果不满足第一预设条件。以预设的多个参数为第三语义识别列表为例进行描述,其它预设多个参数的形式与之类似,所述方法还包括:第一装置获取第三语义识别列表,所述第三语义识别列表包括多个参数;所述第一语义识别结果不满足第一预设条件,包括:所述第一语义识别结果包括第一参数,所述第一参数不属于所述多个参数。
第一装置可以对第一语音信息进行语义识别,得到第一语义识别结果,第一语义识别结果可以包括第一参数;第一装置可以在第三语义识别列表中查找第一参数。在一个示例中,如果第三语义识别列表包括第一参数,则第一装置可以判断第一语义识别结果满足第一预设条件,进而第一装置可以确定执行由所述第一装置根据所述第一语义识别结果确定的第一操作。在另一个示例中,如果第三语义识别列表不包括第一参数,则第一装置可以判断第一语义识别结果不满足第一预设条件,进而第一装置可以确定执行由所述第二装置指示的第二操作。
第三语义识别列表例如可以是第一装置预先存储的列表。第三语义识别列表例如可以包括第一装置支持的多个参数。第一装置可以对第三语义识别列表中的多个参数具有相对更高的语义识别能力。第一装置可以对第三语义识别列表以外的其他参数具有相对更低的语义识别能力。如果第一参数包含于第三语义识别列表,可以意味着第一语义识别结果的准确性相对较高。如果第一参数未包含于第三语义识别列表,可以意味着第一语义识别结果的准确性相对较低。
例如,第三语义识别列表中的多个参数可以包括“设备A”。也就是说,针对“设备A”这一参数,第一装置的语义识别能力可以相对较好。第一文字信息例如可以是“打开设备A”。第一文字信息可以指示“设备A”。第一语义识别结果可以包括第一参数,第一参数可以为“设备A”。第一装置可以根据第三语义识别列表、第一语义识别结果,判断第一语义识别结果满足第一预设条件,并确定执行由第一装置根据第一语义识别结果确定的第一操作。
又如,第三语义识别列表中的多个参数可以不包括“位置A”。也就是说,针对“位置A”这一参数,第一装置的语义识别能力可以相对较差。第一文字信息例如可以是“导航去位置B,途径位置A”。第一语义识别结果可以包括第一参数,第一参数可以为“位置A”。第一装置可以根据第三语义识别列表、第一语义识别结果,判断第一语义识别结果不满足第一预设条件。可选的,第一装置可以确定执行由第二装置指示的第二操作。
以上对第一语义识别结果是否满足第一预设条件的多种实现方式可以独立作为实现手段,也可以结合作为实现手段。例如,可选的,第一语义识别结果满足第一预设条件可以包括以下至少两项:所述第一语义识别结果包括第一功能,所述第一功能属于所述第一语义识别列表;所述第一语义识别结果包括第一意图,所述第一意图属于所述第二语义识别列表;所述第一语义识别结果包括第一参数,所述第一参数属于所述第三语义识别列表。
例如,第一语义识别结果的第一功能为“车控功能”,第一语义识别结果的第一意图为“打开”,第一语义识别结果的第一参数为“设备A”。在“车控功能”属于所述第一语义识别列表,“打开”属于所述第二语义识别列表,且“设备A”属于所述第三语义识别列表的情况下,第一语义识别结果可以满足第一预设条件。
在一个可能的示例中,第一语义识别列表、第二语义识别列表、第三语义识别列表例 如可以是三个相互独立的列表。
在另一个可能的示例中,第一语义识别列表、第二语义识别列表、第三语义识别列表中的至少两个可以来自同一列表。例如,第一语义识别列表、第二语义识别列表、第三语义识别列表可以是属于一个总表。该总表例如可以包括多个子列表,第一语义识别列表、第二语义识别列表、第三语义识别列表可以例如可以是该总表的3个子列表。
可选的,所述第一语义识别列表还包括与所述第一功能对应的多个意图,所述第一语义识别结果满足第一预设条件,还包括:所述第一语义识别结果还包括第一意图,所述第一意图属于所述多个意图。
可选的,所述第一语义识别列表还包括与所述第一功能对应的多个意图,所述第一语义识别结果不满足第一预设条件,还包括:所述第一语义识别结果还包括第一意图,所述第一意图不属于所述多个意图。
第一功能对应的意图通常不是无限制的。例如,导航功能例如可以对应路径规划意图、语音包意图等意图;导航功能通常不会对应硬件开启意图或硬件关闭意图等。又如,音频功能例如可以对应播放音频意图、歌词意图等;音频功能通常不会对应路径规划等意图。第一装置可以将多个功能和多个意图之间的对应关系预先记录在第一语义识别列表中。
如果第一语义识别列表指示第一语义识别结果的第一功能和第一意图对应,则第一语义识别列表可以满足第一预设条件。
如果第一语义识别结果的第一功能、第一意图均属于第一语义识别列表,但是在第一语义识别列表中第一意图与第一功能不具有对应关系,则可以判断第一语义识别结果不满足第一预设条件。
如果第一语义识别结果的第一功能、第一意图中的一个或多个不属于第一语义识别列表,则可以判断第一语义识别结果不满足第一预设条件。
例如,第一语义识别结果的第一功能为“车控功能”,第一语义识别结果的第一意图为“打开设备A”;在“车控功能”、“打开设备A”均属于所述第一语义识别列表,且在第一语义识别列表中“打开设备A”与“车控功能”对应的情况下,第一语义识别结果可以满足第一预设条件。第一装置可以确定执行由第一装置根据第一语义识别结果确定的第一操作。
又如,第一语义识别结果的第一功能为“导航功能”,第一语义识别结果的第一意图为“播放歌曲A”。其中,“歌曲A”的标题与“位置A”的地名可以在文字上相同。也就是说,相同的文字可以代表不同的含义。第一语义识别列表的多个功能可以包括“导航功能”;第一语义识别列表的多个意图可以包括“播放歌曲A”。然而,在第一语义识别列表中,“导航功能”与“播放歌曲A”不对应。在第一语义识别列表中,“导航功能”可以对应“途经位置A”、“目的地位置A”、“始发地位置A”等;在第一语义识别列表中,“音频功能”可以对应“播放歌曲A”等。由此可以判断第一语义识别结果不满足第一预设条件。可选的,第一装置可以确定执行由第二装置指示的第二操作。
又如,第一语义识别结果的第一功能为“车控功能”,第一语义识别结果的第一意图为“打开装置B”。第一语义识别列表的多个功能可以包括“车控功能”;第一语义识别列表的多个意图可以不包括“打开装置B”(例如车内未设置装置B)。由此可以判断第一语义识别结果不满足第一预设条件。可选的,第一装置可以确定执行由第二装置指示的 第二操作。
可选的,所述第一语义识别列表还包括与所述第一功能对应的多个参数,所述第一语义识别结果满足第一预设条件,还包括:所述第一语义识别结果还包括第一参数,所述第一参数属于所述多个参数。
可选的,所述第一语义识别列表还包括与所述第一功能对应的多个参数,所述第一语义识别结果不满足第一预设条件,还包括:所述第一语义识别结果还包括第一参数,所述第一参数不属于所述多个参数。
第一功能对应的参数通常不是无限制的。例如,导航功能例如可以对应位置等参数;导航功能通常不会对应音频播放模式等参数。又如,音频功能例如可以对应歌手、歌曲等参数;音频功能通常不会对应温度等参数。第一装置可以将多个功能和多个参数之间的对应关系预先记录在第一语义识别列表中。
如果第一语义识别列表指示第一语义识别结果的第一功能和第一参数对应,则第一语义识别列表可以满足第一预设条件。
如果第一语义识别结果的第一功能、第一参数均属于第一语义识别列表,但是在第一语义识别列表中第一参数与第一功能不具有对应关系,则可以判断第一语义识别结果不满足第一预设条件。
如果第一语义识别结果的第一功能、第一参数中的一个或多个不属于第一语义识别列表,则可以判断第一语义识别结果不满足第一预设条件。
例如,第一语义识别结果的第一功能为“温控功能”,第一语义识别结果的第一参数为“28℃”;在“温控功能”、“28℃”均属于所述第一语义识别列表,且在第一语义识别列表中,“温控功能”与“28℃”对应的情况下,第一语义识别结果可以满足第一预设条件。第一装置可以确定执行由第一装置根据第一语义识别结果确定的第一操作。
又如,第一语义识别结果的第一功能为“音频功能”,第一语义识别结果的第一参数为“高清播放”。第一语义识别列表的多个功能可以包括“音频功能”;第一语义识别列表的多个参数可以包括“高清播放”。然而,在第一语义识别列表中,“音频功能”与“高清播放”不对应。在第一语义识别列表中,“音频功能”可以对应“标准播放”、“高质量播放”、“无损播放”;在第一语义识别列表中,“音频功能”可以与“标清播放”、“高清播放”、“超清播放”、“蓝光播放”等对应。由此可以判断第一语义识别结果不满足第一预设条件。可选的,第一装置可以确定执行由第二装置指示的第二操作。
又如,第一语义识别结果的第一功能为“播放功能”,第一语义识别结果的第一参数为“歌手A”。第一语义识别列表的多个功能可以包括“播放功能”;第一语义识别列表的多个参数可以不包括“歌手A”(例如第一装置先前未播放过歌手A的歌曲)。由此可以判断第一语义识别结果不满足第一预设条件。可选的,第一装置可以确定执行由第二装置指示的第二操作。
可选的,所述第一语义识别结果满足第一预设条件,还包括:在所述第一语义识别列表中,所述第一参数与所述第一意图对应。
可选的,所述第一语义识别结果不满足第一预设条件,还包括:在所述第一语义识别列表中,所述第一参数与所述第一意图不对应。
除了第一功能与第一意图可以有对应关系,第一功能与第一参数可以有对应关系,第 一意图与第一参数也可以有对应关系。
例如,第一语义识别结果的第一功能为“车控功能”,第一语义识别结果的第一意图为“打开空调”,第一语义识别结果的第一参数为“28℃”;在“车控功能”、“打开空调”、“28℃”均属于所述第一语义识别列表,且在第一语义识别列表中“车控功能”与“打开空调”,且“打开空调”与“28℃”对应的情况下,第一语义识别结果可以满足第一预设条件。第一装置可以确定执行由第一装置根据第一语义识别结果确定的第一操作。
又如,第一语义识别结果的第一功能为“车控功能”,第一语义识别结果的第一意图为“打开空调”,第一语义识别结果的第一参数为“5℃”。第一语义识别列表的多个功能可以包括“车控功能”。第一语义识别列表的多个参数可以包括“打开空调”。第一语义识别列表的多个参数可以包括“5℃”。在第一语义识别列表中,“车控功能”可以与“打开空调”对应;“车控功能”可以与“5℃”对应。然而,在第一语义识别列表中,“打开空调”与“5℃”可以不对应。在第一语义识别列表中,“打开空调”例如可以与17~30℃的温度值对应,“5℃”可能超出了空调的可调温度范围;在第一语义识别列表中,“打开车载冰箱”例如可以与“5℃”对应。由此可以判断第一语义识别结果不满足第一预设条件。可选的,第一装置可以确定执行由第二装置指示的第二操作。
可选的,所述第二语义识别列表还包括与所述第一意图对应的多个参数,所述第一语义识别结果满足第一预设条件,还包括:所述第一语义识别结果还包括第一参数,所述第一参数属于所述多个参数。
可选的,所述第二语义识别列表还包括与所述第一意图对应的多个参数,所述第一语义识别结果不满足第一预设条件,还包括:所述第一语义识别结果还包括第一参数,所述第一参数不属于所述多个参数。
第一意图对应的参数通常不是无限制的。例如,开启硬件意图例如可以对应硬件标识等参数;开启硬件意图通常不会对应位置等参数。又如,路径规划意图例如可以对应位置等参数;路径规划意图通常不会对应歌曲等参数。第一装置可以将多个意图和多个参数之间的对应关系预先记录在第二语义识别列表中。
如果第二语义识别列表指示第一语义识别结果的第一意图和第一参数对应,则第二语义识别列表可以满足第一预设条件。
如果第一语义识别结果的第一意图、第一参数均属于第二语义识别列表,但是在第二语义识别列表中第一参数与第一意图不具有对应关系,则可以判断第一语义识别结果不满足第一预设条件。
如果第一语义识别结果的第一意图、第一参数中的一个或多个不属于第二语义识别列表,则可以判断第一语义识别结果不满足第一预设条件。
例如,第一语义识别结果的第一意图为“打开设备A”,第一语义识别结果的第一参数为“1小时”;在“打开设备A”、“1小时”均属于所述第二语义识别列表,且在第二语义识别列表中,“打开设备A”与“1小时”对应的情况下,第一语义识别结果可以满足第一预设条件。第一装置可以确定执行由第一装置根据第一语义识别结果确定的第一操作。
又如,第一语义识别结果的第一意图为“播放音频”,第一语义识别结果的第一参数为“照片A”。第二语义识别列表的多个意图可以包括“播放音频”。第二语义识别列表的多个参数可以包括“照片A”。然而,在第二语义识别列表中,“播放音频”与“照片A”可 以不对应。在第二语义识别列表中,“播放音频”例如可以与“歌手”、“歌曲”、“歌单”等参数对应;在第二语义识别列表中,“照片A”例如可以与“分享”、“上传”等意图对应。由此可以判断第一语义识别结果不满足第一预设条件。可选的,第一装置可以确定执行由第二装置指示的第二操作。
又如,第一语义识别结果的第一意图为“播放视频”,第一语义识别结果的第一参数为“演员A”。第二语义识别列表的多个意图可以包括“播放视频”;第二语义识别列表的多个参数可以不包括“演员A”(例如第一装置先前未播放过演员A的影视作品)。由此可以判断第一语义识别结果不满足第一预设条件。可选的,第一装置可以确定执行由第二装置指示的第二操作。
可选的,第一语义识别结果可以指示第一功能、第一意图和第一参数中的至少两种,例如,第一语义识别结果可以指示第一功能和第一参数,则在第一语义识别结果指示的第一功能与第一语义识别结果指示的第一参数对应相同的参数类型时,第一语义识别结果满足第一预设条件;在第一语义识别结果指示的第一功能与第一语义识别结果指示的第一参数对应不同的参数类型时,第一语义识别结果不满足第一预设条件。再如,第一语义识别结果可以指示第一意图和第一参数,则在第一语义识别结果指示的第一意图与第一语义识别结果指示的第一参数对应相同的参数类型时,第一语义识别结果满足第一预设条件;在第一语义识别结果指示的第一意图与第一语义识别结果指示的第一参数对应不同的参数类型时,第一语义识别结果不满足第一预设条件。
在第一语义识别结果指示的第一功能与第一语义识别结果指示的第一参数对应相同的参数类型时,第一语义识别结果满足第一预设条件。在一种实现中,所述第三语义识别列表还用于指示所述第一参数对应第一参数类型,所述方法还包括:获取第一语义识别列表,所述第一语义识别列表包括多个功能,以及与所述多个功能对应的多个参数类型;所述第一语义识别结果满足第一预设条件,还包括:所述第一语义识别结果还包括第一功能,所述第一功能属于所述多个功能,在所述第一语义识别列表中,所述第一功能与所述第一参数类型对应。
在第一语义识别结果指示的第一功能与第一语义识别结果指示的第一参数对应不同的参数类型时,第一语义识别结果不满足第一预设条件。在一种实现中,所述第三语义识别列表还用于指示所述第一参数对应第一参数类型,所述方法还包括:获取第一语义识别列表,所述第一语义识别列表包括多个功能,以及与所述多个功能对应的多个参数类型;所述第一语义识别结果不满足第一预设条件,还包括:所述第一语义识别结果还包括第一功能,所述第一功能属于所述多个功能,在所述第一语义识别列表中,所述第一功能与所述第一参数类型不对应。
第一装置可以事先记录功能和对应的一个或多个参数类型。多个功能和多个参数类型的对应关系(或关联关系)可以被存储在第一语义识别列表中。
例如,车控功能对应的参数类型可以是时间、温度、硬件标识等。
又如,温控功能对应的参数类型可以是温度等。
又如,导航功能对应的参数类型可以是位置、时间等。
又如,音频功能对应的参数类型可以是歌手、歌曲、歌单、时间、音频播放模式等。
又如,视频功能对应的参数类型可以是电影、电视剧、演员、时间、视频播放模式等。
第一装置可以事先存储参数和对应的一个或多个参数类型。多个参数和多个参数类型的对应关系或关联关系可以被存储在第三语义识别列表中。
例如,空调、摄像头、座椅、车窗等对应的参数类型可以是硬件标识。
又如,5℃、28℃等对应的参数类型可以是温度。
又如,1小时、1分钟等对应的参数类型可以是时间。
又如,位置A、位置B等对应的参数类型可以是位置。
又如,歌手A、歌手B等对应的参数类型可以是歌手。
又如,歌曲A、歌曲B等对应的参数类型可以是歌曲。
又如,歌单A、歌单B等对应的参数类型可以是歌单。
又如,标准播放、高质量播放、无损播放等对应的参数类型可以是音频播放模式。
又如,电影A、电影B等对应的参数类型可以是电影。
又如,电视剧A、电视剧B对应的参数类型可以是电视剧。
又如,演员A、演员B等对应的参数类型可以是演员。
又如,标清播放、高清播放、超期播放、蓝光播放等对应的参数类型可以是视频播放模式。
用户可以通过多种多样的语音信息,实现第一装置的某一类功能。对于该类功能而言,语音信息中的槽位通常填入有限类型的参数。一个参数类型对应的参数数量例如可以是任意多个。
例如,用户可以通过语音指令指示第一装置执行导航功能的相关操作。该语音指令可以被转换为与导航功能相关的语义识别结果。与该语义识别结果的槽位对应的参数类型例如可以是位置、时间等参数类型。与语义识别结果的槽位对应的参数类型可以不是温度、歌曲、歌单、音频播放模式、电影、电视剧、视频播放模式等。
又如,用户可以通过语音指令指示第一装置执行车控功能的相关操作。该语音指令可以被转换为与车控功能相关的语义识别结果。与该语义识别结果的槽位对应的参数类型例如可以是时间、温度、硬件标识等参数类型。与语义识别结果的槽位对应的参数类型可以不是位置、歌手、歌曲、歌单、音频播放模式、电影、电视剧、演员、视频播放模式等。
第一语义识别结果可以包括第一功能、第一参数。
如果第一装置事先获取到第一参数与第一参数类型对应,且第一功能与第一参数类型对应,则第一装置对第一语音信息的分析的准确率可能相对较高。在此情况下,第一装置可以根据自身得出的第一语义识别结果确定相应的操作。
例如,假设第一语义识别列表可以包括导航功能,且在第一语义识别列表中,导航功能对应的参数类型可以包括位置;假设第三语义识别列表可以包括位置A,且在第一语义识别列表中,位置A对应的参数类型可以是位置。在一种可能的场景下,用户可以通过语音指令“导航去位置A”,指示第一装置执行与导航功能相关的操作。该语音指令可以被转换为第一语义识别结果,第一语义识别结果的第一功能可以为导航功能;第一语义识别结果的第一参数可以为位置A。第一装置可以根据第一语义识别列表、第三语义识别列表,判断第一语义识别结果满足第一预设条件。第一装置可以根据第一语义识别结果确定第一装置要执行的第一操作。
如果第一装置事先获取到第一参数与第一参数类型对应,然而第一功能与第一参数类 型不对应,则第一装置对第一语音信息的分析的准确率可能相对较低,例如第一参数的具体含义可能存在错误。在此情况下,第一装置如果仅根据自身得出的第一语义识别结果,确定相应的操作,可能会出现响应错误的情况。第一装置可以选择执行第二装置指示的操作,以响应用户的语音指令。
例如,假设第一语义识别列表可以包括导航功能,且在第一语义识别列表中,导航功能对应的参数类型可以包括位置;假设第三语义识别列表可以包括歌曲A,且在第一语义识别列表中,歌曲A对应的参数类型可以是歌曲。其中,歌曲A可以与位置A同名,但位置A未包含于第三语义识别列表中。在一种可能的场景下,用户可以通过语音指令“导航去位置A”,指示第一装置执行与导航功能相关的操作。然而由于位置A与歌曲A同名,第一装置可能将位置A识别为歌曲A。比如,第一语义识别结果的第一功能可以为导航功能;第一语义识别结果的第一参数可以为歌曲A。第一装置可以根据第一语义识别列表、第三语义识别列表,判断第一语义识别结果不满足第一预设条件。可选的,第一装置可以确定执行由第二装置指示的第二操作。
可选的,第一装置可以修正语义识别结果,第一装置根据第一语音信息确定语义识别结果(为了区别,称为第二语义识别结果),第二语义识别结果指示第二功能以及指示第二参数(可以与以上第一参数是同一个参数),在预设的多个功能不包括第二功能且预设的多个参数包括第二参数的情况下,修正第二语义识别结果中的第二功能为第一功能,得到第一语义识别结果,第一功能与第二功能为类型相同的两个不同功能。
在一种实现中,所述根据所述第一语音信息,确定第一语义识别结果,包括:根据所述第一语音信息,确定第二语义识别结果,所述第二语义识别结果包括第二功能、所述第一参数;在所述第一语义识别列表不包括所述第二功能的情况下,根据所述第三语义识别列表,修正所述第二语义识别结果中的所述第二功能为所述第一功能,得到所述第一语义识别结果,所述第一功能与所述第二功能为类型相同的两个不同功能。
第一装置可以对第一语音信息进行识别,得到第二语义识别结果。第二语义识别结果例如可以是语义识别后得到的初始结果。第二语义识别结果的第二功能不属于第一语义识别列表,意味着第一装置可能对第二功能具有相对较弱的语义识别能力。
然而,第一装置例如可以具有学习、训练的能力。第一装置例如可以多次学习有关第二功能的语音指令,从而逐渐提高针对第二功能的语义识别能力。有关第二功能的部分语音指令可能最开始无法被第一装置识别,但伴随第一装置对有关第二功能的语义指令的学习和训练,有关第二功能的语音指令可以逐渐被第一装置准确识别,且第一装置准确识别的有关第二功能的语音指令的数量可以逐渐增多。
在一种可能的场景中,第二功能对应的参数类型可能相对多样化,第一装置可能无法完全储备第二功能对应的全部参数。也就是说,指示第二功能的第一语音信息可以被第一装置准确识别的概率可能相对较低。然而,如果第一参数属于第三语义识别列表,则意味着第一装置可能事先学习过第一参数。那么第一装置可以将第二语义识别结果中的第二功能修改为第一功能,得到第一语义识别结果。第一语义识别结果可以是经过修正的语义识别结果。
需要注意的是,第一功能和第二功能应当是相同类型的两个不同功能。
例如,第一功能可以是本地翻译功能,第二功能可以是云翻译功能。第一功能和第二 功能均可以属于翻译类型的功能。
又如,第一功能可以是本地导航功能,第二功能可以是云导航功能。第一功能和第二功能均可以属于导航类型的功能。
又如,第一功能可以是本地音频功能,第二功能可以是云音频功能。第一功能和第二功能均可以属于音频播放类型的功能。
又如,第一功能可以是本地视频功能,第二功能可以是云视频功能。第一功能和第二功能均可以属于视频播放类型的功能。
例如,假设第一语义识别列表可以包括本地导航功能,不包括云导航功能,且在第一语义识别列表中,本地导航功能对应的参数类型可以包括位置;假设第三语义识别列表可以包括位置A,且在第三语义识别列表中,位置A对应的参数类型可以是位置。在一种可能的场景下,用户可以通过语音指令“导航去位置A”,指示第一装置执行与导航功能相关的操作。该语音指令可以被转换为第二语义识别结果,第二语义识别结果的第二功能可以为云导航功能;第二语义识别结果的第一参数可以为位置A。由于第一参数属于第三语义识别列表(例如第一装置曾导航到位置A),因此第一装置可以将第二语义识别结果的第二功能修改为本地导航功能,得到第一语义识别结果,其中,第一语义识别结果的第一功能为本地导航功能。本地导航功能与云导航功能均属于导航类型,且本地导航功能与云导航功能可以是两个不同的功能。之后,第一装置可以根据第一语义识别列表、第三语义识别列表,判断第一语义识别结果是否满足第一预设条件。由于本地导航功能属于第一语义识别列表,在第一语义识别列表中本地导航功能对应的参数类型包括位置,且在第三语义识别列表中位置A对应的参数类型为位置。因此,第一装置可以判断第一语义识别结果满足第一预设条件。第一装置可以根据第一语义识别结果确定第一装置要执行的第一操作。
在第一语义识别结果指示的第一意图与第一语义识别结果指示的第一参数对应相同的参数类型时,第一语义识别结果满足第一预设条件。在一种实现中,所述第三语义识别列表还用于指示所述第一参数对应第一参数类型,所述方法还包括:获取第二语义识别列表,所述第二语义识别列表包括多个意图,以及与所述多个意图对应的多个参数类型;所述第一语义识别结果满足第一预设条件,还包括:所述第一语义识别结果还包括第一意图,所述第一意图属于所述多个意图,在所述第二语义识别列表中,所述第一意图与所述第一参数类型对应。
在第一语义识别结果指示的第一意图与第一语义识别结果指示的第一参数对应不同的参数类型时,第一语义识别结果不满足第一预设条件。在一种实现中,所述第三语义识别列表还用于指示所述第一参数对应第一参数类型,所述方法还包括:获取第二语义识别列表,所述第二语义识别列表包括多个意图,以及与所述多个意图对应的多个参数类型;所述第一语义识别结果不满足第一预设条件,还包括:所述第一语义识别结果还包括第一意图,所述第一意图属于所述多个意图,在所述第二语义识别列表中,所述第一意图与所述第一参数类型不对应。
第一装置可以事先存储意图和对应的参数类型。多个意图和多个参数类型的对应关系(或关联关系)可以存储在第二语义识别列表中。
例如,硬件开启意图对应的参数类型可以是时间、温度、硬件标识等。
又如,路径规划意图对应的参数类型可以是位置、时间等。
又如,播放音频意图对应的参数类型可以是歌手、歌曲、歌单、时间、音频播放模式等。
又如,播放视频意图对应的参数类型可以是电影、电视剧、演员、时间、视频播放模式等。
第一装置可以事先记录参数和对应的参数类型。多个参数和多个参数类型的对应关系或关联关系可以存储在第三语义识别列表中。上文已经阐述了多个参数和多个参数类型的对应关系,在此就不再详细赘述。
用户可以通过多种多样的语音信息,实现第一装置的某一类意图。对于该类意图而言,语音信息中的槽位通常填入有限类型的参数。一个参数类型对应的参数数量例如可以是任意多个。
例如,用户通过语音指令指示第一装置执行路径规划意图的相关操作。该语音指令可以被转换为与路径规划意图相关的语义识别结果。与该语义识别结果的槽位对应的参数类型例如可以是位置、时间等参数类型。与语义识别结果的槽位对应的参数类型可以不是温度、歌手、歌曲、歌单、音频播放模式、电影、电视剧、演员、视频播放模式等。
又如,用户通过语音指令指示第一装置执行硬件开启意图的相关操作。该语音指令可以被转换为与硬件开启意图相关的语义识别结果。与该语义识别结果的槽位对应的参数类型例如可以是时间、温度、硬件标识等参数类型。与语义识别结果的槽位对应的参数类型可以不是位置、歌手、歌曲、歌单、音频播放模式、电影、电视剧、演员、视频播放模式等。
第一语义识别结果可以包括第一意图、第一参数。
如果第一装置事先获取到第一参数与第一参数类型对应,且第一意图与第一参数类型对应,则第一装置对第一语音信息的分析的准确率可能相对较高。在此情况下,第一装置可以根据自身得出的第一语义识别结果确定相应的操作。
例如,假设第二语义识别列表可以包括路径规划意图,且在第二语义识别列表中,路径规划意图对应的参数类型可以包括位置;假设第三语义识别列表可以包括位置A、位置B,且在第二语义识别列表中,位置A、位置B对应的参数类型均可以是位置。在一种可能的场景下,用户可以通过语音指令“导航去位置A,途径位置B”,指示第一装置执行与路径规划意图相关的操作。该语音指令可以被转换为第一语义识别结果,第一语义识别结果的第一意图可以为路径规划意图;第一语义识别结果的第一参数可以包括位置A、位置B。第一装置可以根据第二语义识别列表、第三语义识别列表,判断第一语义识别结果满足第一预设条件。第一装置可以根据第一语义识别结果确定第一装置要执行的第一操作。
如果第一装置事先获取到第一参数与第一参数类型对应,然而第一意图与第一参数类型不对应,则第一装置对第一语音信息的分析的准确率可能相对较低,例如第一参数的具体含义可能存在错误。在此情况下,第一装置如果仅根据自身得出的第一语义识别结果确定相应的操作,可能会出现语音响应错误的情况。第一装置可以选择执行第二装置指示的操作,以响应用户的语音指令。
例如,假设第二语义识别列表可以包括播放音频意图,且在第二语义识别列表中,导航意图对应的参数类型可以包括歌手;假设第三语义识别列表可以包括演员A,且在第二语义识别列表中,演员A对应的参数类型可以是演员。其中,演员A还可以有歌手的身 份,演员A不仅是演员而且是歌手。然而在第三语义识别列表中,演员A对应的参数类型不包括歌手。在一种可能的场景下,用户可以通过语音指令“播放演员A的歌曲”,指示第一装置执行与播放音频意图相关的操作。然而由于第一装置可能无法将演员A识别为歌手,第一装置可能将演员A识别为演员。比如,第一语义识别结果的第一意图可以为播放音频意图;第一语义识别结果的第一参数可以为演员A。第一装置可以根据第二语义识别列表、第三语义识别列表,判断第一语义识别结果不满足第一预设条件。可选的,第一装置可以确定执行由第二装置指示的第二操作。
第一装置可以修正语义识别结果,第一装置根据第一语音信息确定语义识别结果(为了区别,称为第三语义识别结果),第三语义识别结果指示第二意图以及指示第三参数(可以与以上第一参数是同一个参数),在预设的多个意图不包括第二意图且预设的多个参数包括第三参数的情况下,修正第二语义识别结果中的第二意图为第一意图,得到第一语义识别结果,第一意图与第二意图为类型相同的两个不同意图。
在一种实现中,所述根据所述第一语音信息,确定第一语义识别结果,包括:根据所述第一语音信息,确定第三语义识别结果,所述第三语义识别结果包括第二意图、所述第一参数;在所述第二语义识别列表不包括所述第二意图的情况下,根据所述第三语义识别列表,修正所述第三语义识别结果中的所述第二意图为所述第一意图,得到所述第一语义识别结果,所述第一意图与所述第二意图为类型相同的两个不同意图。
第一装置可以对第一语音信息进行识别,得到第三语义识别结果。第三语义识别结果例如可以是语义识别后得到的初始结果。第三语义识别结果的第二意图不属于第二语义识别列表,意味着第一装置可能对第二意图具有相对较弱的语义识别能力。
然而,第一装置例如可以具有学习、训练的能力。第一装置例如可以多次学习有关第二意图的语音指令,从而逐渐提高针对第二意图的语义识别能力。有关第二意图的部分语音指令可能最开始无法被第一装置识别,但伴随第一装置对有关第二意图的语义指令的学习和训练,有关第二意图的语音指令可以逐渐被第一装置准确识别,且第一装置准确识别的有关第二意图的语音指令的数量可以逐渐增多。
在一种可能的场景中,第二意图对应的参数类型可能相对多样化,第一装置可能无法完全储备第二意图对应的全部参数。也就是说,指示第二意图的第一语音信息可以被第一装置准确识别的概率可能相对较低。然而,如果第一参数属于第三语义识别列表,则意味着第一装置可能事先学习过第一参数。那么第一装置可以将第三语义识别结果中的第二意图修改为第一意图,得到第一语义识别结果。第一语义识别结果可以是经过修正的语义识别结果。
需要注意的是,第一意图和第二意图应当是相同类型的两个不同意图。
例如,第一意图可以是本地翻译英文意图,第二意图可以是云翻译英文意图。第一意图和第二意图均可以属于翻译英文类型的意图。
又如,第一意图可以是本地路径规划意图,第二意图可以是云路径规划意图。第一意图和第二意图均可以属于路径规划类型的意图。
又如,第一意图可以是播放本地音频意图,第二意图可以是播放云音频意图。第一意图和第二意图均可以属于播放音频类型的意图。
又如,第一意图可以是播放本地视频意图,第二意图可以是播放云视频意图。第一意 图和第二意图均可以属于播放视频类型的意图。
例如,假设第二语义识别列表可以包括播放本地音频意图,不包括播放云音频意图,且在第二语义识别列表中,播放本地音频意图对应的参数类型可以包括歌手;假设第三语义识别列表可以包括歌手A,且在第二语义识别列表中,歌手A对应的参数类型可以是歌手。在一种可能的场景下,用户可以通过语音指令“播放歌手A的歌曲”,指示第一装置执行与播放音频意图相关的操作。该语音指令可以被转换为第三语义识别结果,第三语义识别结果的第二意图可以为播放云音频意图;第三语义识别结果的第一参数可以为歌手A。由于第一参数属于第三语义识别列表(例如第一装置曾播放过歌手A的歌曲),因此第一装置可以将第三语义识别结果的第二意图修改为播放本地音频意图,得到第一语义识别结果,其中,第一语义识别结果的第一意图为播放本地音频意图。播放本地音频意图与播放云音频意图均属于播放音频类型,且本地音频意图与播放云音频意图可以是两个不同的意图。之后,第一装置可以根据第二语义识别列表、第三语义识别列表,判断第一语义识别结果是否满足第一预设条件。由于播放本地音频意图属于第二语义识别列表,在第二语义识别列表中播放本地音频意图对应的参数类型包括歌手,且在第三语义识别列表中歌手A对应的参数类型为歌手。因此,第一装置可以判断第一语义识别结果满足第一预设条件。第一装置可以根据第一语义识别结果确定第一装置要执行的第一操作。
可选的,所述第一语义识别结果满足所述第一预设条件,包括:所述第一语义识别结果包括第一指示位,所述第一指示位指示所述第一语义识别结果满足所述第一预设条件。
可选的,所述第一语义识别结果不满足所述第一预设条件,包括:所述第一语义识别结果包括第二指示位,所述第二指示位指示所述第一语义识别结果不满足所述第一预设条件。
如果第一语义识别结果中携带有第一指示位,则第一装置可以确定执行由第一装置根据第一语义识别结果确定的第一操作。第一指示位的具体内容可以包括“本地”、“端侧”等信息,以指示第一装置自身确定要执行的操作。
如果第一语义识别结果中携带有第二指示位,则第一装置可以确定执行由第二装置指示的第二操作。第二指示位的具体内容可以包括“云”、“云侧”等信息,以指示第一装置根据第二装置的指示执行操作。
可选的,如果第一语义识别结果未携带有第一指示位,且第一语义识别结果的其他内容不满足第一预设条件,则第一装置可以确定执行由第二装置指示的第二操作。
可选的,如果第一语义识别结果携带有第二指示位,且第一语义识别结果的其他内容不满足第一预设条件,则第一装置可以确定执行第一装置根据第一语义识别结果确定的第一操作。
第四语义识别结果可以指示第一功能、第一意图和第一参数中的至少两种,例如,第四语义识别结果可以指示第一功能和第一参数,则在第四语义识别结果指示的第一功能与第四语义识别结果指示的第一参数对应相同的参数类型时,可以确定携带第一指示位的第一语义识别结果;在第四语义识别结果指示的第一功能与第四语义识别结果指示的第一参数对应不同的参数类型时,可以确定携带第二指示位的第一语义识别结果。再如,第四语义识别结果可以指示第一意图和第一参数,则在第四语义识别结果指示的第一意图与第四语义识别结果指示的第一参数对应相同的参数类型时,可以确定携带第一指示位的第一语 义识别结果;在第四语义识别结果指示的第一意图与第四语义识别结果指示的第一参数对应不同的参数类型时,可以确定携带第二指示位的第一语义识别结果。
在第四语义识别结果指示的第一功能与第四语义识别结果指示的第一参数对应相同的参数类型时,可以确定第一语义识别结果携带第一指示位。在一种实现中,所述根据所述第一语音信息,确定第一语义识别结果,包括:根据所述第一语音信息,确定第四语义识别结果,所述第四语义识别结果包括第一功能、第一参数;在所述第四语义识别列表包括所述第一功能,所述第四语义识别列表指示所述第一功能与第一参数类型对应,所述第三语义识别列表包括所述第一参数,所述第三语义识别列表指示所述第一参数与第一参数类型对应的情况下,确定所述第一语义识别结果,所述语义识别结果包括所述第一指示位。
第一装置可以对第一语音信息进行识别,得到第四语义识别结果。第四语义识别结果例如可以是语义识别后得到的初始结果。第四语义识别结果的第一功能不属于第一语义识别列表,意味着第一装置可能对第一功能具有相对较弱的语义识别能力。
然而,第一装置例如可以具有学习、训练的能力。第一装置例如可以多次学习有关第一功能的语音指令,从而逐渐提高针对第一功能的语义识别能力。有关第一功能的部分语音指令可能最开始无法被第一装置识别,但伴随第一装置对有关第一功能的语义指令的学习和训练,有关第一功能的语音指令可以逐渐被第一装置准确识别,且第一装置准确识别的有关第一功能的语音指令的数量可以逐渐增多。可选的,第四语义识别列表例如可以用于记录第一装置新学习(或训练、更新等)的功能。
在一种可能的场景中,第一功能对应的参数类型可能相对多样化,第一装置可能无法完全储备第一功能对应的全部参数。也就是说,指示第一功能的第一语音信息可以被第一装置准确识别的概率可能相对较低。然而,如果第一参数属于第三语义识别列表,则意味着第一装置可能事先学习过第一参数。在第四语义识别列表中,第一功能与第一参数类型对应,且在第三语义识别列表中,第一参数与第一参数类型对应,因此第一装置准确识别第一语音信息的概率相对较高。第一装置可以确定第一语义识别结果,并通过第一语义识别结果中的第一指示位指示第一语义识别结果满足第一预设条件。第一语义识别结果可以是经过修正的语义识别结果。
例如,假设第一语义识别列表可以不包括导航功能;第四语义识别列表可以包括导航功能,且在第四语义识别列表中,导航功能对应的参数类型可以包括位置;假设第三语义识别列表可以包括位置A,且在第三语义识别列表中,位置A对应的参数类型可以是位置。在一种可能的场景下,用户可以通过语音指令“导航去位置A”,指示第一装置执行与导航功能相关的操作。该语音指令可以被转换为第四语义识别结果,第四语义识别结果的第一功能可以为导航功能;第四语义识别结果的第一参数可以为位置A。由于位置A属于第三语义识别列表(例如第一装置曾导航到位置A),在第三语义识别列表中位置A对应的参数类型为位置,且在第四语义识别列表中,导航功能对应的参数类型包括位置,因此,第一装置可以根据第四语义识别结果、第一语义识别列表、第三语义识别列表、第四语义识别列表,确定第一语义识别结果,该第一语义识别结果可以包括第一指示位。第一装置可以判断第一语义识别结果满足第一预设条件。第一装置可以根据第一语义识别结果确定第一装置要执行的第一操作。
在第四语义识别结果指示的第一意图与第四语义识别结果指示的第一参数对应相同 的参数类型时,可以确定携带第一指示位的第一语义识别结果。在一种实现中,所述方法还包括:获取第五语义识别列表,所述第五语义识别列表包括多个意图,以及与所述多个意图对应的多个参数类型;获取第三语义识别列表,所述第三语义识别列表包括多个参数,以及与所述多个意图对应的多个参数类型;所述根据所述第一语音信息,确定第一语义识别结果,包括:根据所述第一语音信息,确定第五语义识别结果,所述第五语义识别结果包括第一意图、第一参数;在所述第五语义识别列表包括所述第一意图,所述第五语义识别列表指示所述第一意图与第一参数类型对应,所述第三语义识别列表包括所述第一参数,所述第三语义识别列表指示所述第一参数与第一参数类型对应的情况下,确定所述第一语义识别结果,所述语义识别结果包括所述第一指示位。
第一装置可以对第一语音信息进行识别,得到第五语义识别结果。第五语义识别结果例如可以是语义识别后得到的初始结果。第五语义识别结果的第一意图不属于第二语义识别列表,意味着第一装置可能对第一意图具有相对较弱的语义识别能力。
然而,第一装置例如可以具有学习、训练的能力。第一装置例如可以多次学习有关第一意图的语音指令,从而逐渐提高针对第一意图的语义识别能力。有关第一意图的部分语音指令可能最开始无法被第一装置识别,但伴随第一装置对有关第一意图的语义指令的学习和训练,有关第一意图的语音指令可以逐渐被第一装置准确识别,且第一装置准确识别的有关第一意图的语音指令的数量可以逐渐增多。可选的,第五语义识别列表例如可以用于记录第一装置新学习(或训练、更新等)的意图。
在一种可能的场景中,第一意图对应的参数类型可能相对多样化,第一装置可能无法完全储备第一意图对应的全部参数。也就是说,指示第一意图的第一语音信息可以被第一装置准确识别的概率可能相对较低。然而,如果第一参数属于第三语义识别列表,则意味着第一装置可能事先学习过第一参数。在第五语义识别列表中,第一意图与第一参数类型对应,且在第三语义识别列表中,第一参数与第一参数类型对应,因此第一装置准确识别第一语音信息的概率相对较高。第一装置可以确定第一语义识别结果,并通过第一语义识别结果中的第一指示位指示第一语义识别结果满足第一预设条件。第一语义识别结果可以是经过修正的语义识别结果。
例如,假设第二语义识别列表可以不包括播放音频意图;第五语义识别列表可以包括播放音频意图,且在第五语义识别列表中,播放音频意图对应的参数类型可以包括歌手;假设第三语义识别列表可以包括歌手A,且在第三语义识别列表中,歌手A对应的参数类型可以是歌手。在一种可能的场景下,用户可以通过语音指令“导航去歌手A”,指示第一装置执行与播放音频意图相关的操作。该语音指令可以被转换为第五语义识别结果,第五语义识别结果的第一意图可以为播放音频意图;第五语义识别结果的第一参数可以为歌手A。由于歌手A属于第三语义识别列表(例如第一装置曾导航到歌手A),在第三语义识别列表中歌手A对应的参数类型为歌手,且在第五语义识别列表中,播放音频意图对应的参数类型包括歌手,因此,第一装置可以根据第五语义识别结果、第二语义识别列表、第三语义识别列表、第五语义识别列表,确定第一语义识别结果,该第一语义识别结果可以包括第一指示位。第一装置可以判断第一语义识别结果满足第一预设条件。第一装置可以根据第一语义识别结果确定第一装置要执行的第一操作。
在上述示例中,第三语义识别列表可以包括第一参数,且在第三语义识别列表中,第 一参数可以对应第一参数类型。在另一些可能的情况下,如果第三语义识别列表不包括第一参数,或者在第三语义识别列表中,第一参数可以与除第一参数类型以外的其他参数类型对应。在此情况下,第一装置对第一语音信息识别出的第一语义识别结果可能相对不准确。第一装置可以从第二装置获取第六语义识别结果,第六语义识别结果可以是第二装置根据第一语音信息确定的。
可选的,第一装置可以根据第二装置指示的第六语义识别结果,确定第二参数以及第二参数类型,并保存第二参数与第二参数类型的关联关系。在一种实现中,所述方法还包括:根据所述第六语义识别结果,确定第二参数以及第二参数类型;将所述第二参数与所述第二参数类型的关联关系记录在第三语义识别列表中。
第六语义识别结果可以包括第二参数和第二参数类型。第一装置可以将第二参数和第二参数类型的关联关系记录在第三语义识别列表中。在后续的语音交互过程中,如果第一装置再次遇到与第二参数和第二参数类型相关的语音指令,由于第三语义识别列表包括了第二参数和第二参数类型的关联关系,因此第一装置可以更容易准确识别用户的语音指令。
可选的,所述方法还包括:从第二装置获取N个语义识别结果;所述将所述第二参数与所述第二参数类型的关联关系记录在第三语义识别列表中,包括:在所述N个语义识别结果中的每个语义识别结果均包括所述第二参数,所述每个语义识别结果的第二参数均对应第二参数类型,且N大于第一预设阈值的情况下,将所述第二参数与所述第二参数类型的关联关系记录在第三语义识别列表中。
在第二参数和第二参数类型的关联关系出现多次后,第一装置可以将第二参数与第二参数类型的关联关系记录在第三语义识别列表中。也就是说,第一装置学习第二参数和第二参数类型的关联关系的次数可以相对较多。针对参数和参数类型的关联关系,单次语音识别的代表性可能相对较低。第一装置多次获知第二参数和第二参数类型的关联关系,使得第二参数和第二参数类型之间的关联关系可以相对准确或相对重要,进而有利于提高第一装置识别语音指令准确度。
图7是本申请实施例提供的一种语音交互的方法700的示意性流程图。图7所示的方法700例如可以应用于图5所示的语音交互系统500。
701,第一装置响应用户的输入操作,执行语音唤醒操作。
702,第一装置获取来自于语音传感器的第一语音信息。
在一个可能的示例中,第一语音信息可以由用户在第一轮语音交互中指示。第一轮语音交互可以是多轮语音交互中的首轮语音交互或非首轮语音交互,第一轮语音交互可以不是多轮语音交互中最后一轮语音交互。
703,所述第一装置根据所述第一语音信息,确定第一语义识别结果。
704,第一装置根据所述第一语义识别结果和第一预设条件,确定执行由所述第一装置根据所述第一语义识别结果确定的第一操作,或者,确定执行由第二装置指示的第二操作。
701至704的具体实施方式可以参照图6所示的601、602、603a、603b,在此不再详细赘述。
可选的,在第一轮语音交互结束前后,第一装置可以执行第一操作或第二操作。也就 是说,第一装置执行第一操作或第二操作可以意味着第一轮语音交互结束,或者意味着下一轮语音交互开始。
705,第一装置获取来自于语音传感器的第二语音信息。
第二语音信息可以由用户在第二轮语音交互中指示。可选的,第二轮语音交互可以是第一轮语音交互的下一轮语音交互。
706,所述第一装置根据所述第二语音信息,确定第七语义识别结果。
705至706的具体实施方式可以参照702至703,在此不再详细赘述。
多轮语音交互可以应用于语音交互信息较多的场景。在多轮语音交互场景下,用户和第一装置可以针对某个特殊场景或特殊领域进行语音对话。在一种可能的场景中,用户可能无法通过一次语音指令彻底达到语音操控的目的。例如,通过语音交互,用户可以控制第一装置购买城市A到城市B的飞机票。然而,城市A到城市B航班数量可能非常多。用户可以和第一装置多次语音交互,以最终确定需要购买的航班信息。
多轮语音交互中相邻两轮语音交互通常具有关联性。例如,第一装置可以询问用户哪个时间段的航班可以优先购买,用户可以向第一装置回复某个时间段。又如,第一装置可以询问用户优选收听哪个歌手的作品,用户可以向第一装置回复某个歌手的姓名。也就是说,第二语音信息可以与第一语音信息所指示的操作相关。在此情况下,第一装置可以确定执行第二语音信息指示的操作,以继续当前的多轮语音交互。
然而,用户的回复可能具有一定随机性。用户的回复可能与第一装置询问或待获取的语音信息无关。例如,第一装置询问用户哪个时间段的航班可以优先购买,用户却向第一装置回复某个歌手的姓名。又如,第一装置询问用户哪个时间段的航班可以优先购买,用户却向第一装置回复当前发生车祸。
如果第一装置完全追随用户的回复,则可能导致先前语音交互的内容被作废,进而可能会增加用户对第一装置的操控次数。例如,通过多轮语音交互,用户可以将准备购买的航班信息、用户的身份信息等均指示给第一装置;然而,在最终支付费用之前,用户无意中向第一装置指示了与航班、机票等无关的语音信息。如果第一装置结束有关购买飞机票的多轮语音交互,响应了用户在无意中做出的指示,这不仅不符合用户的期待,而且使得用户先前指示的航班信息、身份信息等均被作废。如果用户想要重新购买飞机票,用户需要再次将航班信息、身份信息等指示给第一装置。
如果第一装置完全不理会用户的回复,则可能导致第一装置在某些特殊场景下无法响应用户的指示,使得用户的语音指令失去效力。
在相邻两轮语音交互过程中,如果第一装置获取的两个语音信息没有关联或关联性较弱,或者后一轮语音交互的语音信息和前一轮语音交互的操作不对应或不相关,第一装置可以根据情况选择是否结束多轮语音交互。例如,在第一语音信息和第二语音信息无关或关联性小于第二预设阈值的情况下,或者,在第二语音信息不属于针对第一操作或第二操作的反馈的情况下,为满足用户的语音控制体验,第一装置可以根据第二预设条件,判断当前是否结束多轮语音交互。第二预设条件可以是是否结束多轮语音交互的预设条件。例如,在第二预设条件被满足的情况下,第一装置可以结束多轮语音交互,并开始新的语音交互;在第二预设条件未被满足的情况下,第一装置可以继续当前的多轮语音交互。
可选的,第一装置可以判断第二语音信息是否是与第一语音信息所指示的操作相关。 例如,第一操作用于询问用户问题A,第二语音信息是针对问题A的答复。
第二语音信息与第一语音信息所指示的操作相关,例如可以包括第一语音信息与第二语音信息相关,或者第一语音信息与第二语音信息的关联度高于第二预设阈值。例如,在第一语音信息与第二语音信息均指示以下一项或多项的情况下,第二语音信息与第一语音信息所指示的操作相关:相同功能、相同意图、相同参数。
下面针对以下场景进一步阐述:第二语音信息与第一语音信息所指示的操作无关,或者,第二语音信息与第一语音信息所指示的操作的关联性低于第二预设阈值,或者,第一语音信息与第二语音信息无关,或者第一语音信息与第二语音信息的关联度低于第二预设阈值。在此场景下,第一装置可能需要结束当前多轮语音交互,也可能需要重复上一论语音交互对应的操作,以继续当前多轮语音交互。
707a,第一装置根据所述第七语义识别结果、第二预设条件,确定执行所述由所述第一语音信息指示的操作。
707c,第一装置根据所述第七语义识别结果、第二预设条件,确定执行由所述第一装置根据所述第七语义识别结果确定的第三操作。
707d,第一装置根据所述第七语义识别结果、第二预设条件,确定执行由所述第二装置指示的第四操作。
707a、707b、707c中的任一个可以被执行。
在707a中,即使第一装置获取到了第二语音信息,第一装置可以仍然执行上一轮语音交互中的操作。
由所述第一语音信息指示的操作可以为所述第一操作或所述第二操作。在一个示例中,如果第一装置在704中执行的操作为第一操作,则在707a中第一装置可以仍确定执行第一操作。在此情况下,当前多轮语音交互可以是端侧多轮语音交互。如果第一装置在704中执行的操作为第二操作,则在707a中第一装置可以仍确定执行第二操作。在此情况下,当前多轮语音交互可以是云侧多轮语音交互。
在707b、707c中,第一装置获取到了第二语音信息,并可以指示第二语音信息指示的操作。其中,在707b中,第一装置可以确定执行第一装置根据第七语义识别结果确定的第三操作;在707c中,第一装置可以确定执行第二装置指示的第四操作。第一装置确定第三操作的具体实施方式可以参照图6所示的603a,在此就不再详细赘述。第一装置确定第四操作的具体实施方式可以参照图6所示的603b,在此就不再详细赘述。
在一个示例中,如果第一装置在704中执行的操作为第一操作,第一装置在707b中确定执行第三操作。在此情况下,新的端侧语音交互可以结束前一轮端侧语音交互。
在一个示例中,如果第一装置在704中执行的操作为第一操作,第一装置在707c中确定执行第四操作。在此情况下,新的云侧语音交互可以结束前一轮端侧语音交互。
在一个示例中,如果第一装置在704中执行的操作为第二操作,第一装置在707b中确定执行第三操作。在此情况下,新的端侧语音交互可以结束前一轮云侧语音交互。
在一个示例中,如果第一装置在704中执行的操作为第二操作,第一装置在707c中确定执行第四操作。在此情况下,新的云侧语音交互可以结束前一轮云侧语音交互。
结合上文可知,第一预设条件可以用于指示第一装置是否根据自身确定的语义识别结果确定相应的操作,第二预设条件可以用于指示第一装置是否结束多轮语音交互。第一装 置可以根据第七语义识别结果、第一预设条件、第二预设条件,确定第一装置针对第二语音信息要执行的操作。
例如,第七语义识别结果可以不满足第二预设条件,意味着第一装置可以不结束多轮语音交互,并确定执行第一语音信息指示的操作,即执行上述707a。第一装置可以判断第七语义识别结果是否满足第一预设条件,也可以不判断第七语义识别结果是否满足第一预设条件。
又如,第七语义识别结果可以满足第二预设条件,且可以满足第一预设条件。第七语义识别结果满足第二预设条件,意味着第一装置可以结束多轮语音交互,并确定执行第二语音信息指示的操作。第七语义识别结果满足第一预设条件,意味着第一装置可以确定执行第一装置根据第七语义识别结果确定的操作,即执行上述707b。
又如,第七语义识别结果可以满足第二预设条件,且可以不满足第一预设条件。第七语义识别结果满足第二预设条件,意味着第一装置可以结束多轮语音交互,并确定执行第二语音信息指示的操作。第七语义识别结果不满足第一预设条件,意味着第一装置可以确定执行第二装置指示的操作,即执行上述707c。
可选的,所述方法还包括:向第二装置发送第二语音信息;所述根据所述第七语义识别结果、第二预设条件,确定执行由所述第二装置指示的第四操作,包括:在所述第七语义识别结果,或者,在所述第七语义识别结果不满足所述第二预设条件,且所述第一语义识别结果不满足所述第一预设条件的情况下,从第二装置获取第八语义识别结果;根据所述第八语义识别结果、所述第二预设条件,确定执行所述由所述第一语音信息指示的操作,或者,确定执行所述第四操作。
在第七语义识别结果不满足第一预设条件的情况下,第一装置识别得到的第七语义识别结果可能相对不准确,第二装置识别得到的第八语义识别结果可能相对准确。第一装置可以判断第八语义识别结果是否满足第二预设条件,进而确定是否结束当前的多轮语音交互。
在一种可能的场景下,第一语义识别结果可以满足第一预设条件,可以意味着当前的多轮语音交互可以是端侧多轮语音交互。如果第七语义识别结果不满足第一预设条件,且第八语义识别结果满足第二预设条件,则可以意味着当前的端侧多轮语音交互可以被结束,新的语音交互可以是云侧语音交互。
如果第七语义识别结果不满足第二预设条件,意味着第一装置可以不根据第七语义识别结果确定要执行的操作。在所述第一语义识别结果不满足所述第一预设条件的情况下,可以意味着当前的多轮语音交互属于云侧语音交互。在此情况下,第一装置可以从第二装置获取第八语义识别结果,以继续当前的云侧多轮语音交互。第一装置可以判断第八语义识别结果是否满足第二预设条件,进而确定是否结束当前的云侧多轮语音交互。
例如,第八语义识别结果可以不满足第二预设条件,意味着第一装置可以不结束多轮语音交互,即确定执行第一语音信息指示的操作,即执行上述707a。
又如,第八语义识别结果可以满足第二预设条件,意味着第一装置可以结束多轮语音交互,即确定执行第二装置指示的操作,即执行上述707c。
假设与第一语音信息指示的操作相关的语义识别结果的优先级为第一优先级,与第一语音信息指示的操作无关的语义识别结果的优先级为第二优先级。
在一种可能的示例中,如果第七语义识别结果的优先级为第一优先级,则可以认为第二语音信息与第一语音信息指示的操作相关。第一装置可以确定执行第二语音信息指示的操作。如果第七语义识别结果的优先级为第二优先级,则可以认为第二语音信息与第一语音信息指示的操作无关。如果第二优先级高于第一优先级,则第二装置可以结束当前的多轮语音交互,并可以确定执行第二语音信息指示的操作。如果第二优先级低于第一优先级,则第二装置可以继续当前的多轮语音交互,并可以确定执行第一语音信息指示的操作。
在另一种可能的示例中,如果第八语义识别结果的优先级为第一优先级,则可以认为第二语音信息与第一语音信息指示的操作相关。第一装置可以确定执行第二装置指示的操作。如果第八语义识别结果的优先级为第二优先级,则可以认为第二语音信息与第一语音信息指示的操作无关。如果第二优先级高于第一优先级,则第二装置可以结束当前的多轮语音交互,并可以确定执行第二装置指示的操作。如果第二优先级低于第一优先级,则第二装置可以继续当前的多轮语音交互,并可以确定执行第一语音信息指示的操作。
可选的,所述第七语义识别结果满足所述第二预设条件,包括:所述第七语义识别结果的优先级高于所述第一语义识别结果的优先级。
第一装置可以比较第七语义识别结果的优先级和第一语义识别结果的优先级,判断是否结束当前的多轮语音交互。可选的,在704中,第一装置可以确定执行第一操作,也可以确定执行第二操作。可选的,在第七语义识别结果满足所述第二预设条件的情况下,第一装置可以执行707b,也可以执行707c。
可选的,所述根据所述第一语义识别结果和第一预设条件,确定执行由第二装置指示的第二操作,包括:在所述第一语义识别结果不满足所述第一预设条件的情况下,获取来自第二装置的第六语义识别结果,所述第六语义识别结果用于指示所述第二操作;根据所述第六语义识别结果,确定执行所述第二操作;所述第七语义识别结果满足所述第二预设条件,包括:所述第七语义识别结果的优先级高于所述第六语义识别结果的优先级。
在一种可能的示例中,在704中,第一装置可以从第二装置获取第六语义识别结果,以确定执行第二装置指示的第二操作。在此情况下,第一装置可以比较第七语义识别结果的优先级和第六语义识别结果的优先级,判断是否结束当前的多轮语音交互。
可选的,所述第八语义识别结果满足所述第二预设条件,包括:所述第八语义识别结果的优先级高于所述第一语义识别结果的优先级。
第一装置可以比较第八语义识别结果的优先级和第一语义识别结果的优先级,判断是否结束当前的多轮语音交互。可选的,在704中,第一装置可以确定执行第一操作,也可以确定执行第二操作。
可选的,所述根据所述第一语义识别结果和第一预设条件,确定执行由第二装置指示的第二操作,包括:在所述第一语义识别结果不满足所述第一预设条件的情况下,获取来自第二装置的第六语义识别结果,所述第六语义识别结果用于指示所述第二操作;根据所述第六语义识别结果,确定执行所述第二操作;所述第八语义识别结果满足所述第二预设条件,包括:所述第八语义识别结果的优先级高于所述第六语义识别结果的优先级。
在一种可能的示例中,在704中,第一装置可以从第二装置获取第六语义识别结果,以确定执行第二装置指示的第二操作。在此情况下,第一装置可以比较第八语义识别结果的优先级和第六语义识别结果的优先级,判断是否结束当前的多轮语音交互。
也就是说,用户可以通过高优先级的语音指令,结束当前的多轮语音交互。在一个可能的示例中,第一装置例如可以记录一些高优先级的指令。如果第七语义识别结果或第八语义识别结果与高优先级的指令对应,第一装置可以结束当前的多轮语音交互。
可选的,所述第七语义识别结果的优先级高于所述第一语义识别结果的优先级,包括以下中的一项或多项:所述第七语义识别结果指示的功能的优先级高于所述第一语义识别结果指示的功能的优先级;所述第七语义识别结果指示的意图的优先级高于所述第一语义识别结果指示的意图的优先级;所述第七语义识别结果指示的参数的优先级高于所述第一语义识别结果指示的参数的优先级。
第一装置可以比较第三功能的优先级和第一功能的优先级,和/或,比较第三意图的优先级和第一意图的优先级,和/或,比较第三参数的优先级和第一参数的优先级,判断是否结束当前的多轮语音交互。也就是说,第一装置可以记录一些高优先级的功能、和/或高优先级的意图和/或高优先级的参数。如果第一装置识别出高优先级的功能、高优先级的意图、高优先级的参数中的一个或多个,第一装置可以结束当前的多轮语音交互,并确定执行第二语音信息指示的操作。
例如,第七语义识别结果包括第三功能,第三功能可以为“安全控制功能”。第一语义识别结果包括第一功能,第一功能可以为“导航功能”。为确保设备安全,“安全控制功能”的优先级可以高于其他功能,例如高于“导航功能”。由此,第一装置可以根据第三功能的优先级和第一功能的优先级,确定第七语义识别结果满足第二预设条件,进而确定执行第二语音信息指示的操作。
又如,第七语义识别结果包括第三意图,第三意图可以为“事故模式意图”。第一语义识别结果包括第一意图,第一意图可以为“播放音频意图”。“事故模式意图”可以指用户有启用事故模式的意图。由于事故安全具有一定的紧迫性,“事故模式意图”的优先级可以高于其他意图,例如高于“播放音频意图”。由此,第一装置可以根据第三意图的优先级和第一意图的优先级,确定第七语义识别结果满足第二预设条件,进而确定执行第二语音信息指示的操作。
又如,第七语义识别结果包括第三参数,第三参数可以为“门锁”。第一语义识别结果包括第一参数,第一参数可以为“歌曲A”。用户可以通过第二语音信息指示第一装置执行与“门锁”相关的操作。由于“门锁”的开启或关闭相对容易影响车辆的行驶安全和停车安全,因此“门锁”的优先级可以高于其他参数,例如高于“歌曲A”。由此,第一装置可以根据第三参数的优先级和第一参数的优先级,确定第七语义识别结果满足第二预设条件,进而确定执行第二语音信息指示的操作。
可选的,所述第七语义识别结果的优先级高于所述第六语义识别结果的优先级,包括以下中的一项或多项:所述第七语义识别结果指示的功能的优先级高于所述第六语义识别结果指示的功能的优先级;所述第七语义识别结果指示的意图的优先级高于所述第六语义识别结果指示的意图的优先级;所述第七语义识别结果指示的参数的优先级高于所述第六语义识别结果指示的参数的优先级。
第七语义识别结果的优先级高于第六语义识别结果的优先级的相关示例,可以参照第七语义识别结果的优先级高于第一语义识别结果的优先级的相关示例,在此就不再详细赘述。
可选的,所述第八语义识别结果的优先级高于所述第一语义识别结果的优先级,包括以下中的一项或多项:所述第八语义识别结果指示的功能的优先级高于所述第一语义识别结果指示的功能的优先级;所述第八语义识别结果指示的意图的优先级高于所述第一语义识别结果指示的意图的优先级;所述第八语义识别结果指示的参数的优先级高于所述第一语义识别结果指示的参数的优先级。
第八语义识别结果的优先级高于所述第一语义识别结果的优先级的相关示例,可以参照第七语义识别结果的优先级高于第一语义识别结果的优先级的相关示例,在此就不再详细赘述。
可选的,所述第八语义识别结果的优先级高于所述第六语义识别结果的优先级,包括以下中的一项或多项:所述第八语义识别结果指示的功能的优先级高于所述第六语义识别结果指示的功能的优先级;所述第八语义识别结果指示的意图的优先级高于所述第六语义识别结果指示的意图的优先级;所述第八语义识别结果指示的参数的优先级高于所述第六语义识别结果指示的参数的优先级。
第八语义识别结果的优先级高于所述第六语义识别结果的优先级的相关示例,可以参照第七语义识别结果的优先级高于第一语义识别结果的优先级的相关示例,在此就不再详细赘述。
下面通过一些示例,并结合上文阐述的示例,介绍一些可能的场景。应理解,本申请实施例提供的方案应不限于下面提供的示例。
图8是本申请实施例提供的一种语音交互的方法800的示意性流程图。图8所示的方法800例如可以应用于图5所示的语音交互系统500。
801,第一装置响应用户的输入操作,执行语音唤醒操作。
801的具体实施方式例如可以参照图6所示的601或图7所示的701,在此不再详细赘述。
802,第一装置获取来自于语音传感器的语音信息a。
802的具体实施方式例如可以参照图6所示的602或图7所示的702,在此不再详细赘述。可选的,语音信息a例如可以对应上文中的第一语音信息。
在一个可能的示例中,语音信息a例如可以是“打开设备A”。
803,第一装置向第二装置发送语音信息a。
803的具体实施方式例如可以参照图6所示的603b,在此不再详细赘述。
804,第一装置根据语音信息a,确定语义识别结果a。
804的具体实施方式例如可以参照图6所示的603a或图7所示的703,在此不再详细赘述。可选的,语义识别结果a例如可以对应上文中的第一语义识别结果。
在一个可能的示例中,语义识别结果a例如可以包括功能a、意图a、参数a;其中,功能a可以是“车控功能”;意图a可以是“打开”;参数a可以是“设备A”。可选的,功能a例如可以与上文中的第一功能对应。意图a例如可以与上文中的第一意图对应。参数a例如可以与上文中的第一参数对应。
805,第一装置根据语义识别结果a和第一预设条件,确定执行由第一装置根据语义识别结果a确定的操作a。
805的具体实施方式例如可以参照图6所示的603a或图7所示的704,在此不再详细 赘述。可选的,操作a例如可以对应上文中的第一操作。
在一个可能的示例中,第一装置存储有语义识别列表a(可选的,语义识别列表a例如可以对应上文中的第一语义识别列表,或者对应上文中的总表,该总表可以包含上文中的第一语义识别列表、第二语义识别列表、第三语义识别列表)。语义识别列表a的多个功能可以包括“车控功能”,语义识别列表a的多个意图可以包括“打开”,语义识别列表a的多个参数可以包括“设备A”。并且,在语义识别列表a中,功能“车控功能”可以与意图“打开”对应,且意图“打开”可以与参数“设备A”对应。第一装置可以根据语义识别结果a和语义识别列表a,判断语义识别结果a满足第一预设条件。由于语义识别结果a满足第一预设条件,第一装置可以确定执行由第一装置根据语义识别结果a确定的操作a。操作a例如可以是打开设备A。
806,第二装置向第一装置发送语义识别结果b。
806的具体实施方式例如可以参照图6所示的603b或图7所示的704,在此不再详细赘述。可选的,语义识别结果b例如可以对应上文中的第六语义识别结果。
在一个可能的示例中,语义识别结果b例如可以与语义识别结果a相同或相似。语义识别结果b例如可以包括功能b、意图b、参数b;其中,功能b可以是“车控功能”;意图b可以是“打开”;参数b可以是“设备A”。可选的,参数b例如可以与上文中的第二参数对应。可选的,功能b例如可以与上文中的第四功能对应。意图b例如可以与上文中的第四意图对应。参数b例如可以与上文中的第四参数对应。
807,第一装置丢弃来自第二装置的语义识别结果b。
由于第一装置可以确定执行由第一装置根据语义识别结果a确定的操作a,因此第一装置可以忽略第二装置针对语音信息a的指示。
807可以是一个可选的步骤。807的具体实施方式例如可以参照图6所示的603b,在此不再详细赘述。
在图8所示的示例中,第一装置可以根据已有的语义识别列表,判断当前识别的语义识别结果是否具有相对较高的正确率。在当前识别的语义识别结果可能具有相对较高的正确率的情况下,第一装置可以选择自行确定要执行的操作,进而有利于在第一装置擅长的场景下快速、准确地响应用户的语音指令。
图9是本申请实施例提供的一种语音交互的方法900的示意性流程图。图9所示的方法900例如可以应用于图5所示的语音交互系统500。
901,第一装置响应用户的输入操作,执行语音唤醒操作。
901的具体实施方式例如可以参照图6所示的601或图7所示的701,在此不再详细赘述。
902,第一装置获取来自于语音传感器的语音信息b。
902的具体实施方式例如可以参照图6所示的602或图7所示的702,在此不再详细赘述。可选的,语音信息b例如可以对应上文中的第一语音信息。
在一个可能的示例中,语音信息b例如可以是“导航去位置A”。
903,第一装置向第二装置发送语音信息b。
903的具体实施方式例如可以参照图6所示的603b,在此不再详细赘述。
904,第一装置根据语音信息b,确定语义识别结果c。
904的具体实施方式例如可以参照图6所示的603a或图7所示的703,在此不再详细赘述。可选的,语义识别结果c例如可以对应上文中的第一语义识别结果。
在一个可能的示例中,语义识别结果c例如可以包括功能c、意图c、参数c;其中,功能c可以是“云导航功能”;意图c可以是“路径规划”;参数c可以是“位置B”。“位置A”与“位置B”可以相同也可以不同。可选的,功能c例如可以与上文中的第一功能对应。意图c例如可以与上文中的第一意图对应。参数c例如可以与上文中的第一参数对应。在另一个可能的示例中,语义识别结果c例如可以包括指示信息a,指示信息a用于指示不支持语义识别。
905,第一装置根据语义识别结果c和第一预设条件,确定执行由第二装置指示的操作b。
905的具体实施方式例如可以参照图6所示的603a或图7所示的704,在此不再详细赘述。可选的,操作b例如可以对应上文中的第二操作。
在一个可能的示例中,第一装置存储有语义识别列表b(可选的,语义识别列表b例如可以对应上文中的第一语义识别列表,或者对应上文中的总表,该总表可以包含上文中的第一语义识别列表、第二语义识别列表、第三语义识别列表)。语义识别列表b的多个功能可以不包括“云导航功能”,和/或,语义识别列表b的多个参数可以不包括“位置B”,和/或,在语义识别列表b中,功能“云导航功能”可以不与参数“位置B”对应。第一装置可以根据语义识别结果c和语义识别列表b,判断语义识别结果a不满足第一预设条件。由于语义识别结果a不满足第一预设条件,第一装置可以确定执行由第二装置指示的操作b。
906,第二装置向第一装置发送针对语音信息b的语义识别结果d。
906的具体实施方式例如可以参照图6所示的603b或图7所示的704,在此不再详细赘述。可选的,语义识别结果d例如可以对应上文中的第六语义识别结果。
在一个可能的示例中,语义识别结果d例如可以包括功能d、意图d、参数d;其中,功能d可以是“云导航功能”;意图d可以是“路径规划”;参数d可以是“位置A”。可选的,参数d例如可以与上文中的第二参数对应。可选的,功能d例如可以与上文中的第四功能对应。意图d例如可以与上文中的第四意图对应。参数d例如可以与上文中的第四参数对应。
907,第一装置根据语义识别结果d,执行操作b。
907的具体实施方式例如可以参照图6所示的603b或图7所示的704,在此不再详细赘述。操作b例如可以是路径规划到位置A。
可选的,图9所示的示例还可以包括以下步骤。
908,第一装置根据语义识别结果d,确定参数d以及参数类型a,并将参数d与参数类型a的关联关系记录在语义识别列表b中。
908的具体实施方式例如可以参照图6所示的603b,在此不再详细赘述。可选的,参数d例如可以对应上文中的第二参数。参数类型a例如可以对应上文中的第二参数类型。语义识别列表b例如可以对应上文中的第三语义识别列表。
在一个示例中,参数类型a可以为“位置”。更新后的语义识别列表b可以包括以下关联关系:“位置A”-“位置”。
909,第一装置获取来自于语音传感器的语音信息c。
909的具体实施方式例如可以参照图6所示的602或图7所示的702,在此不再详细赘述。可选的,语音信息c例如可以对应上文中的第一语音信息。
在一个可能的示例中,语音信息c例如可以是“导航去位置A”。
910,第一装置向第二装置发送语音信息c。
910的具体实施方式例如可以参照图6所示的603b,在此不再详细赘述。
911,第一装置根据语音信息c,确定语义识别结果e,语义识别结果e包括功能c和参数d。
911的具体实施方式例如可以参照图6所示的603a,在此不再详细赘述。可选的,语义识别结果e例如可以与上文中的第二语义识别结果对应。功能c例如可以与上文中的第二功能对应。参数d例如可以与上文中的第一参数对应。
在一个可能的示例中,语义识别结果e例如可以包括功能c、意图c、参数d;其中,功能c可以是“云导航功能”;意图c可以是“路径规划”;参数d可以是“位置A”。
912,第一装置根据语义识别结果e和语义识别列表b,修正语义识别结果e中的功能c为功能d,得到语义识别结果f,功能d与功能c为类型相同的两个不同功能。
912的具体实施方式例如可以参照图6所示的603a,在此不再详细赘述。可选的,功能d例如可以与上文中的第一功能对应。语义识别结果f例如可以与上文中的第一语义识别结果对应。
在一个可能的示例中,由于语义识别列表b的多个参数可以包括“位置A”,因此第一装置可以将语义识别结果e中的“云导航功能”修改为“本地导航功能”,得到语义识别结果f。语义识别结果f例如可以包括功能d、意图c、参数d。功能d可以是“本地导航功能”;意图c可以是“路径规划”;参数d可以是“位置A”。“云导航功能”与“本地导航功能”均属于导航功能,且“云导航功能”与“本地导航功能”是两个不同的功能。可选的,功能d例如可以与上文中的第一功能对应。意图c例如可以与上文中的第一意图对应。参数d例如可以与上文中的第一参数对应。
913,第一装置根据语义识别结果f和第一预设条件,确定执行由第一装置根据语义识别结果f确定的操作c。
913的具体实施方式例如可以参照图6所示的603a或图7所示的704,在此不再详细赘述。可选的,操作c例如可以对应上文中的第一操作。
在一个可能的示例中,第一装置存储有语义识别列表b(可选的,语义识别列表b例如可以对应上文中的第一语义识别列表,或者对应上文中的总表,该总表可以包含上文中的第一语义识别列表、第二语义识别列表、第三语义识别列表)。语义识别列表b的多个功能可以包括“本地导航功能”;在语义识别列表b中,功能“本地导航功能”可以与参数类型“位置”对应;并且,在语义识别列表c中,参数“位置A”可以与参数类型“位置”对应。第一装置可以根据语义识别结果f、语义识别列表b、语义识别列表c,判断语义识别结果a满足第一预设条件。由于语义识别结果f满足第一预设条件,第一装置可以确定执行由第一装置根据语义识别结果f确定的操作c。操作c例如可以是路径规划到位置A。
914,第一装置丢弃来自第二装置针对语音信息c的语义识别结果g。
由于第一装置可以确定执行由第一装置根据语义识别结果f确定的操作c,因此第一 装置可以忽略第二装置针对语音信息c的指示。
914可以是一个可选的步骤。914的具体实施方式例如可以参照图6所示的603b,在此不再详细赘述。
在图9所示的示例中,第一装置可以根据已有的语义识别列表,判断当前识别的语义识别结果是否具有相对较高的正确率。在当前识别的语义识别结果可能不具有相对较高的正确率的情况下,第一装置可以选择根据其他装置的指示执行操作,进而有利于使第一装置准确地响应用户的语音指令。另外,第一装置还可以在其他装置的指示下对用户的语音指令学习,进而有利于拓宽第一装置能够自行响应用户语音指令的场景。
图10是本申请实施例提供的一种语音交互的方法的示意性流程图。图10所示的方法例如可以应用于图5所示的语音交互系统500。
1001,第一装置响应用户的输入操作,执行语音唤醒操作。
1001的具体实施方式例如可以参照图6所示的601或图7所示的701,在此不再详细赘述。
1002,第一装置获取来自于语音传感器的语音信息d。
1002的具体实施方式例如可以参照图6所示的602或图7所示的702,在此不再详细赘述。可选的,语音信息d例如可以对应上文中的第一语音信息。
在一个可能的示例中,语音信息d例如可以是“导航去位置C”。
1003,第一装置向第二装置发送语音信息d。
1003的具体实施方式例如可以参照图6所示的603b,在此不再详细赘述。
1004,第一装置根据语音信息d,确定语义识别结果h。
1004的具体实施方式例如可以参照图6所示的603a或图7所示的703,在此不再详细赘述。可选的,语义识别结果h例如可以对应上文中的第一语义识别结果。
1005,第一装置根据语义识别结果h和第一预设条件,确定执行由第二装置指示的操作d。
1005的具体实施方式例如可以参照图6所示的603a或图7所示的704,在此不再详细赘述。可选的,操作d例如可以对应上文中的第二操作。
1006,第二装置向第一装置发送针对语音信息d的语义识别结果i。
1006的具体实施方式例如可以参照图6所示的603b或图7所示的704,在此不再详细赘述。可选的,语义识别结果i例如可以对应上文中的第六语义识别结果。
1007,第一装置根据语义识别结果i,执行操作d。
1007的具体实施方式例如可以参照图6所示的603b或图7所示的704,在此不再详细赘述。
在一个可能的示例中,操作d例如可以是播报以下内容:寻找到多个相关目的地,请选择位置C-1、位置C-2、位置C-3。操作d可以用于询问准确的导航目的地。
可选的,在一个可能的示例中,1001至1007可以是第一轮云侧语音交互。
1008,第一装置获取来自于语音传感器的语音信息e。
1008的具体实施方式例如可以参照图7所示的705,在此不再详细赘述。可选的,语音信息e例如可以对应上文中的第二语音信息。可选的,语音信息e与操作d可以无关。
在一个可能的示例中,语音信息d例如可以是“打开设备A”。
1009,第一装置向第二装置发送语音信息e。
1009的具体实施方式例如可以参照图6所示的603b,在此不再详细赘述。
1010,第一装置根据语音信息e,确定语义识别结果j。
1010的具体实施方式例如可以参照图7所示的706,在此不再详细赘述。可选的,语义识别结果j例如可以对应上文中的第七语义识别结果。
在一个示例中,语义识别结果j例如可以指示“打开设备A”。
1011,在语义识别结果j不满足第二预设条件的情况下,第一装置从第二装置获取针对语音信息e的语义识别结果k。
1011的具体实施方式例如可以参照图7所示的707a,在此不再详细赘述。可选的,语义识别结果k例如可以对应上文中的第八语义识别结果。
语义识别结果j不满足第二预设条件,第一装置可以确定不执行语义识别结果j指示的操作。第一装置可以获取第二装置针对语音信息e的语义识别结果,以判断是否执行语音信息e指示的操作。
在一个可能的示例中,语义识别结果j的优先级可以低于语义识别结果g或语义识别结果i的优先级。
1012,在语义识别结果k不满足第二预设条件的情况下,第一装置确定重复执行操作d。
1012的具体实施方式例如可以参照图7所示的707a,在此不再详细赘述。
语义识别结果k不满足第二预设条件,第一装置可以确定不执行语义识别结果k指示的操作。第一装置可以重复执行操作d,以使得第一轮云侧语音交互可以被继续执行。
在一个可能的示例中,语义识别结果k可以低于语义识别结果g或语义识别结果i的优先级。
1013,第一装置获取来自于语音传感器的语音信息f。
1013的具体实施方式例如可以参照图7所示的705,在此不再详细赘述。可选的,语音信息f例如可以对应上文中的第二语音信息。可选的,语音信息f与操作d可以相关。
在一个可能的示例中,语音信息f例如可以是“位置C-1,途径位置D”。
1014,第一装置向第二装置发送语音信息f。
1014的具体实施方式例如可以参照图6所示的603b,在此不再详细赘述。
1015,第一装置根据语音信息f,确定语义识别结果m。
1015的具体实施方式例如可以参照图7所示的706,在此不再详细赘述。可选的,语义识别结果5例如可以对应上文中的第七语义识别结果。
在一个示例中,语义识别结果m例如可以指示不满足第一预设条件。
1016,第一装置从第二装置获取针对语音信息f的语义识别结果n。
1016的具体实施方式例如可以参照图7所示的707c,在此不再详细赘述。可选的,语义识别结果n例如可以对应上文中的第八语义识别结果。
语义识别结果m不满足第一预设条件,第一装置可以判断执行第二装置指示的操作,或者重复执行上一轮操作。第一装置可以获取第二装置针对语音信息f的语义识别结果n,以判断是否执行语音信息f指示的操作。
1017,在语义识别结果n与操作d相关的情况下,确定执行第二装置指示的操作e。
1017的具体实施方式例如可以参照图7所示的707c,在此不再详细赘述。
语义识别结果n与操作d相关,可以意味着语音信息f是针对上一轮语音交互的答复。第一装置可以确定执行第二装置指示的操作e。1013至1017可以是第二轮云侧语音交互。第一轮云侧语音交互和第二轮云侧语音交互可以是多轮语音交互职工的两轮语音交互。
在一个可能的示例中,操作e例如可以是播报以下内容:寻找到多个相关途经地,请选择位置D-1、位置D-2、位置D-3。操作e可以用于询问准确的导航途经地。
1018,第一装置获取来自于语音传感器的语音信息g。
1018的具体实施方式例如可以参照图6所示的602或图7所示的702,在此不再详细赘述。可选的,语音信息g例如可以对应上文中的第二语音信息。
在一个可能的示例中,语音信息g例如可以是“车辆发生事故”。
1019,第一装置向第二装置发送语音信息f。
1019的具体实施方式例如可以参照图6所示的603b,在此不再详细赘述。
1020,第一装置根据语音信息g,确定语义识别结果p。
1020的具体实施方式例如可以参照图6所示的603a或图7所示的703,在此不再详细赘述。可选的,语义识别结果p例如可以对应上文中的第七语义识别结果。
在一个示例中,语义识别结果p可以包括功能e、意图e;其中,功能e可以是“安全控制功能”;意图c可以是“启动事故模式”。
1021,在语义识别结果p满足第二预设条件的情况下,确定执行由第一装置根据语义识别结果p确定的操作f。
1021的具体实施方式例如可以参照图7所示的707b,在此不再详细赘述。可选的,操作f例如可以对应上文中的第三操作。
语义识别结果p满足第二预设条件,意味着第一装置可以结束当前的云侧多轮语音交互。第一装置可以确定执行语义识别结果p指示的操作。第一装置可以获取第二装置针对语音信息e的语义识别结果,以判断是否执行语音信息e指示的操作。
在一个可能的示例中,语义识别结果p的优先级可以高于语义识别结果n或语义识别结果m的优先级。
1022,第二装置向第一装置发送针对语音信息g的语义识别结果q。
1022的具体实施方式例如可以参照图6所示的603b或图7所示的704,在此不再详细赘述。
在一个可能的示例中,语义识别结果q例如可以指示语音信息g与操作d不匹配。
1023,第一装置丢弃来自第二装置的语义识别结果q。
由于第一装置可以确定执行由第一装置根据语义识别结果p确定的操作f,因此第一装置可以忽略第二装置针对语音信息g的指示。
1023的具体实施方式例如可以参照图6所示的603b,在此不再详细赘述。
在图10所示的示例中,在当前语音信息与上一轮操作不匹配的情况下,第一装置可以根据与多轮语音交互相关的预设条件,判断是否结束当前多轮语音交互。这样有利于适应性保留多轮语音交互的优势,且有利于在特殊情况下及时响应用户的相对紧迫、相对重要的语音指令。
图11是本申请实施例提供的一种语音交互的装置1100的示意性结构图。该装置1100 包括获取单元1101和处理单元1102。该装置1100可以用于执行本申请实施例提供的语音交互的方法的各步骤。
例如,获取单元1101可以用于执行图6所示的方法600中的602,处理单元1102可以用于执行图6所示的方法600中的603a。可选的,装置1100还包括发送单元,发送单元可以用于执行图6所示的方法600中的603b。
又如,获取单元1101可以用于执行图7所示的方法700中的702、705,处理单元1102可以用于执行图7所示的方法700中的703、704、706、707。
获取单元1101,用于获取来自于语音传感器的第一语音信息。
处理单元1102,用于根据所述第一语音信息,确定执行所述第一语音信息指示的目标操作。
处理单元1102例如可以包括图5所示示例中的语义识别模块、操作决策模块。
可选地,作为一个实施例,处理单元1102具体用于:根据所述第一语音信息,确定第一语义识别结果;根据所述第一语义识别结果和第一预设条件,确定执行由所述第一装置根据所述第一语义识别结果确定的第一操作,或者,确定执行由第二装置指示的第二操作。
可选地,作为一个实施例,处理单元1102具体用于:在所述第一语义识别结果满足所述第一预设条件的情况下,确定执行所述第一操作。
可选地,作为一个实施例,所述第一装置预设有多个功能,所述第一语义识别结果满足第一预设条件,包括:所述第一语义识别结果指示第一功能,所述第一功能属于所述多个功能。
可选地,作为一个实施例,所述第一装置预设有多个意图,所述第一语义识别结果满足第一预设条件,包括:所述第一语义识别结果指示第一意图,所述第一意图属于所述多个意图。
可选地,作为一个实施例,所述第一装置预设有多个参数,所述第一语义识别结果满足第一预设条件,包括:所述第一语义识别结果指示第一参数,所述第一参数属于所述多个参数。
可选地,作为一个实施例,所述第一语义识别结果指示第一功能以及指示第一参数,所述第一语义识别结果满足第一预设条件,还包括:所述第一语义识别结果指示的所述第一功能与所述第一语义识别结果指示的所述第一参数对应相同的参数类型。
可选地,作为一个实施例,处理单元1102具体用于:根据所述第一语音信息,确定第二语义识别结果,所述第二语义识别结果指示第二功能以及指示所述第一参数;在所述第一装置预设的多个功能不包括所述第二功能,且所述第一装置预设的多个参数包括所述第一参数的情况下,修正所述第二语义识别结果中的所述第二功能为所述第一功能,得到所述第一语义识别结果,所述第一功能与所述第二功能为类型相同的两个不同功能。
可选地,作为一个实施例,所述第一语义识别结果指示第一意图以及指示第一参数,所述第一语义识别结果满足第一预设条件,还包括:所述第一语义识别结果指示的所述第一意图与所述第一语义识别结果指示的所述第一参数对应相同的参数类型。
可选地,作为一个实施例,处理单元1102具体用于:根据所述第一语音信息,确定第三语义识别结果,所述第三语义识别结果包括第二意图以及指示所述第一参数;在所述 第一装置预设的多个意图不包括所述第二意图,且所述第一装置预设的多个参数包括所述第一参数的情况下,修正所述第三语义识别结果中的所述第二意图为所述第一意图,得到所述第一语义识别结果,所述第一意图与所述第二意图为类型相同的两个不同意图。
可选地,作为一个实施例,所述第一语义识别结果满足所述第一预设条件,包括:所述第一语义识别结果包括第一指示位,所述第一指示位指示所述第一语义识别结果满足所述第一预设条件。
可选地,作为一个实施例,处理单元1102具体用于:根据所述第一语音信息,确定第四语义识别结果,所述第四语义识别结果包括第一功能和第一参数;在所述第一功能属于所述第一装置预设的多个功能,且所述第一参数属于所述第一装置预设的多个参数,且所述第一功能和所述第一参数对应相同的参数类型的情况下,确定所述第一语义识别结果,所述语义识别结果包括所述第一指示位。
可选地,作为一个实施例,处理单元1102具体用于:根据所述第一语音信息,确定第五语义识别结果,所述第五语义识别结果包括第一意图和第一参数;在所述第一意图属于所述第一装置预设的多个意图,且所述第一参数属于所述第一装置预设的多个参数,且所述第一意图和所述第一参数对应相同的参数类型的情况下,确定所述第一语义识别结果,所述语义识别结果包括所述第一指示位。
可选地,作为一个实施例,装置还包括:发送单元,用于向所述第二装置发送所述第一语音信息;处理单元1102还用于,丢弃来自所述第二装置的第六语义识别结果。
可选地,作为一个实施例,处理单元1102具体用于:在所述第一语义识别结果不满足所述第一预设条件的情况下,获取来自第二装置的第六语义识别结果,所述第六语义识别结果用于指示所述第二操作;根据所述第六语义识别结果,确定执行所述第二操作。
可选地,作为一个实施例,处理单元1102还用于:根据所述第六语义识别结果,确定第二参数以及第二参数类型;装置还包括:存储单元,用于保存所述第二参数与所述第二参数类型的关联关系。
可选地,作为一个实施例,获取单元1101还用于,获取来自于所述语音传感器的第二语音信息;处理单元1102还用于,根据所述第二语音信息,确定第七语义识别结果;处理单元1102还用于,根据所述第七语义识别结果、第二预设条件,确定执行由所述第一语音信息指示的操作,或者,确定执行由所述第一装置根据所述第七语义识别结果确定的第三操作,或者,确定执行由所述第二装置指示的第四操作。
可选地,作为一个实施例,处理单元1102具体用于,在所述第七语义识别结果满足所述第一预设条件且满足所述第二预设条件的情况下,确定执行所述第三操作;在所述第七语义识别结果不满足所述第一预设条件且满足所述第二预设条件的情况下,确定执行所述第四操作;在所述第七语义识别结果不满足所述第二预设条件的情况下,确定执行所述与所述第一语义识别结果对应的操作。
可选地,作为一个实施例,所述第七语义识别结果满足所述第二预设条件,包括:所述第七语义识别结果的优先级高于所述第一语义识别结果的优先级。
可选地,作为一个实施例,所述第七语义识别结果的优先级高于所述第一语义识别结果的优先级,包括以下中的一项或多项:所述第七语义识别结果指示的功能的优先级高于所述第一语义识别结果指示的功能的优先级;所述第七语义识别结果指示的意图的优先级 高于所述第一语义识别结果指示的意图的优先级;所述第七语义识别结果指示的参数的优先级高于所述第一语义识别结果指示的参数的优先级。
可选地,作为一个实施例,理单元具体用于:在所述第一语义识别结果不满足所述第一预设条件的情况下,获取来自所述第二装置的第六语义识别结果,所述第六语义识别结果用于指示所述第二操作;根据所述第六语义识别结果,确定执行所述第二操作;所述第七语义识别结果满足所述第二预设条件,包括:所述第七语义识别结果的优先级高于所述第六语义识别结果的优先级。
可选地,作为一个实施例,所述第七语义识别结果的优先级高于所述第六语义识别结果的优先级,包括以下中的一项或多项:所述第七语义识别结果指示的功能的优先级高于所述第六语义识别结果指示的功能的优先级;所述第七语义识别结果指示的意图的优先级高于所述第六语义识别结果指示的意图的优先级;所述第七语义识别结果指示的参数的优先级高于所述第六语义识别结果指示的参数的优先级。
可选地,作为一个实施例,装置还包括:发送单元,用于向所述第二装置发送第二语音信息;处理单元1102具体用于:在所述第七语义识别结果不满足所述第一预设条件的情况下,或者,在所述第七语义识别结果不满足所述第二预设条件,且所述第一语义识别结果不满足所述第一预设条件的情况下,从所述第二装置获取第八语义识别结果;根据所述第八语义识别结果、所述第二预设条件,确定执行所述由所述第一语音信息指示的操作,或者,确定执行所述第四操作。
发送单元例如可以是图5所示示例中的收发模块。
可选地,作为一个实施例,处理单元1102具体用于:在所述第八语义识别结果满足所述第二预设条件的情况下,确定执行所述第四操作;在所述第八语义识别结果不满足所述第二预设条件的情况下,确定执行所述由所述第一语音信息指示的操作。
可选地,作为一个实施例,所述第八语义识别结果满足所述第二预设条件,包括:所述第八语义识别结果的优先级高于所述第一语义识别结果的优先级。
可选地,作为一个实施例,所述第八语义识别结果的优先级高于所述第一语义识别结果的优先级,包括以下中的一项或多项:所述第八语义识别结果指示的功能的优先级高于所述第一语义识别结果指示的功能的优先级;所述第八语义识别结果指示的意图的优先级高于所述第一语义识别结果指示的意图的优先级;所述第八语义识别结果指示的参数的优先级高于所述第一语义识别结果指示的参数的优先级。
可选地,作为一个实施例,处理单元1102具体用于:在所述第一语义识别结果不满足所述第一预设条件的情况下,获取来自所述第二装置的第六语义识别结果,所述第六语义识别结果用于指示所述第二操作;根据所述第六语义识别结果,确定执行所述第二操作;所述第八语义识别结果满足所述第二预设条件,包括:所述第八语义识别结果的优先级高于所述第六语义识别结果的优先级。
可选地,作为一个实施例,所述第八语义识别结果的优先级高于所述第六语义识别结果的优先级,包括以下中的一项或多项:所述第八语义识别结果指示的功能的优先级高于所述第六语义识别结果指示的功能的优先级;所述第八语义识别结果指示的意图的优先级高于所述第六语义识别结果指示的意图的优先级;所述第八语义识别结果指示的参数的优先级高于所述第六语义识别结果指示的参数的优先级。
可选地,作为一个实施例,所述第二语音信息与所述第一语音信息指示的操作无关。
可选地,作为一个实施例,装置还包括:唤醒模块,用于响应用户的输入操作,执行语音唤醒操作。
图12是本申请实施例提供的一种语音交互的装置1200的示意性结构图。该装置1200可以包括至少一个处理器1202和通信接口1203。
可选地,该装置1200还可以包括存储器1201和总线1204中的中的一项或多项。其中,存储器1201、处理器1202和通信接口1203中的任意两项之间或全部三项之间均可以通过总线1204实现彼此之间的通信连接。
可选地,存储器1201可以是只读存储器(read only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器1201可以存储程序,当存储器1201中存储的程序被处理器1202执行时,处理器1202和通信接口1203用于执行本申请实施例提供的语音交互的方法的各个步骤。也就是说,处理器1202可以通过通信接口1203从存储器1201获取存储的指令,以执行本申请实施例提供的语音交互的方法的各个步骤。
可选地,存储器1201可以具有图1所示存储器152的功能,以实现上述存储程序的功能。可选地,处理器1202可以采用通用的中央处理器(central processing unit,CPU),微处理器,专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphic processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例提供的语音交互的装置中的单元所需执行的功能,或者执行本申请实施例提供的语音交互的方法的各个步骤。
可选地,处理器1202可以具有图1所示处理器151的功能,以实现上述执行相关程序的功能。
可选地,处理器1202还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请实施例提供的语音交互的方法的各个步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。
可选地,上述处理器1202还可以是通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成本申请实施例语音交互的装置中包括的单元所需执行的功能,或者执行本申请实施例提供的语音交互的方法的各个步骤。
可选地,通信接口1203可以使用例如但不限于收发器一类的收发装置,来实现装置与其他设备或通信网络之间的通信。该通信接口1203例如还可以是接口电路。
总线1204可包括在装置各个部件(例如,存储器、处理器、通信接口)之间传送信息的通路。
本申请实施例还提供一种包含指令的计算机程序产品,该指令被计算机执行时使得该计算机实现上述方法实施例提供的语音交互的方法。
本申请实施例还提供一种终端设备,该终端设备包括上述任意一种语音交互的装置,例如图11或图12所示的语音交互的装置等。
示例性地,该终端可以为车辆。或者,该终端还可以是对车辆进行远程控制的终端。
上述语音交互的装置既可以是安装在车辆上的,又可以是独立于车辆的,例如可以是利用无人机、其他车辆、机器人等来控制该目标车辆。
除非另有定义,本申请所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本申请中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。
本申请的各个方面或特征可以实现成方法、装置或使用标准编程和/或工程技术的制品。本申请中使用的术语“制品”可以涵盖可从任何计算机可读器件、载体或介质访问的计算机程序。例如,计算机可读介质可以包括但不限于:磁存储器件(例如,硬盘、软盘或磁带等),光盘(例如,压缩盘(compact disc,CD)、数字通用盘(digital versatile disc,DVD)等),智能卡和闪存器件(例如,可擦写可编程只读存储器(erasable programmable read-only memory,EPROM)、卡、棒或钥匙驱动器等)。
本申请描述的各种存储介质可代表用于存储信息的一个或多个设备和/或其它机器可读介质。术语“机器可读介质”可以包括但不限于:无线信道和能够存储、包含和/或承载指令和/或数据的各种其它介质。
需要说明的是,当处理器为通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件时,存储器(存储模块)可以集成在处理器中。
还需要说明的是,本申请描述的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
在本说明书中使用的术语“部件”、“模块”、“系统”等用于表示计算机相关的实体、硬件、固件、硬件和软件的组合、软件、或执行中的软件。例如,部件可以是但不限于,在处理器上运行的进程、处理器、对象、可执行文件、执行线程、程序和/或计算机。通过图示,在计算设备上运行的应用和计算设备都可以是部件。一个或多个部件可驻留在进程和/或执行线程中,部件可位于一个计算机上和/或分布在2个或更多个计算机之间。此外,这些部件可从在上面存储有各种数据结构的各种计算机可读介质执行。部件可例如根据具有一个或多个数据分组(例如来自与本地系统、分布式系统和/或网络间的另一部件交互的二个部件的数据,例如通过信号与其它系统交互的互联网)的信号通过本地和/或远程进程来通信。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装 置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (63)

  1. 一种语音交互的方法,其特征在于,应用于第一装置,所述方法包括:
    获取来自于语音传感器的第一语音信息;
    根据所述第一语音信息,确定第一语义识别结果;
    根据所述第一语义识别结果和第一预设条件,确定执行由所述第一装置根据所述第一语义识别结果确定的第一操作,或者,确定执行由第二装置指示的第二操作。
  2. 如权利要求1所述的方法,其特征在于,所述根据所述第一语义识别结果和第一预设条件,确定执行由所述第一装置根据所述第一语义识别结果确定的第一操作,包括:
    在所述第一语义识别结果满足所述第一预设条件的情况下,确定执行所述第一操作。
  3. 如权利要求2所述的方法,其特征在于,所述第一装置预设有多个功能,
    所述第一语义识别结果满足第一预设条件,包括:
    所述第一语义识别结果指示第一功能,所述第一功能属于所述多个功能。
  4. 如权利要求2或3所述的方法,其特征在于,所述第一装置预设有多个意图,
    所述第一语义识别结果满足第一预设条件,包括:
    所述第一语义识别结果指示第一意图,所述第一意图属于所述多个意图。
  5. 如权利要求2至4中任一项所述的方法,其特征在于,所述第一装置预设有多个参数,
    所述第一语义识别结果满足第一预设条件,包括:
    所述第一语义识别结果指示第一参数,所述第一参数属于所述多个参数。
  6. 如权利要求2至5中任一项所述的方法,其特征在于,所述第一语义识别结果指示第一功能以及指示第一参数,所述第一语义识别结果满足第一预设条件,还包括:
    所述第一语义识别结果指示的所述第一功能与所述第一语义识别结果指示的所述第一参数对应相同的参数类型。
  7. 如权利要求6所述的方法,其特征在于,所述根据所述第一语音信息,确定第一语义识别结果,包括:
    根据所述第一语音信息,确定第二语义识别结果,所述第二语义识别结果指示第二功能以及指示所述第一参数;
    在所述第一装置预设的多个功能不包括所述第二功能,且所述第一装置预设的多个参数包括所述第一参数的情况下,修正所述第二语义识别结果中的所述第二功能为所述第一功能,得到所述第一语义识别结果,所述第一功能与所述第二功能为类型相同的两个不同功能。
  8. 如权利要求2至5中任一项所述的方法,其特征在于,所述第一语义识别结果指示第一意图以及指示第一参数,所述第一语义识别结果满足第一预设条件,还包括:
    所述第一语义识别指示的所述第一意图与所述第一语义识别结果指示的所述第一参数对应相同的参数类型。
  9. 如权利要求8所述的方法,其特征在于,所述根据所述第一语音信息,确定第一语义识别结果,包括:
    根据所述第一语音信息,确定第三语义识别结果,所述第三语义识别结果指示第二意图以及指示所述第一参数;
    在所述第一装置预设的多个意图不包括所述第二意图,且所述第一装置预设的多个参数包括所述第一参数的情况下,修正所述第三语义识别结果中的所述第二意图为所述第一意图,得到所述第一语义识别结果,所述第一意图与所述第二意图为类型相同的两个不同意图。
  10. 如权利要求2所述的方法,其特征在于,所述第一语义识别结果满足所述第一预设条件,包括:
    所述第一语义识别结果包括第一指示位,所述第一指示位指示所述第一语义识别结果满足所述第一预设条件。
  11. 如权利要求10所述的方法,其特征在于,所述根据所述第一语音信息,确定第一语义识别结果,包括:
    根据所述第一语音信息,确定第四语义识别结果,所述第四语义识别结果指示第一功能和第一参数;
    在所述第一功能属于所述第一装置预设的多个功能,且所述第一参数属于所述第一装置预设的多个参数,且所述第一功能和所述第一参数对应相同的参数类型的情况下,确定所述第一语义识别结果,所述语义识别结果包括所述第一指示位。
  12. 如权利要求10所述的方法,其特征在于,
    所述根据所述第一语音信息,确定第一语义识别结果,包括:
    根据所述第一语音信息,确定第五语义识别结果,所述第五语义识别结果指示第一意图和第一参数;
    在所述第一意图属于所述第一装置预设的多个意图,且所述第一参数属于所述第一装置预设的多个参数,且所述第一意图和所述第一参数对应相同的参数类型的情况下,确定所述第一语义识别结果,所述语义识别结果包括所述第一指示位。
  13. 如权利要求2至12中任一项所述的方法,其特征在于,所述方法还包括:
    向所述第二装置发送所述第一语音信息;
    丢弃来自所述第二装置的第六语义识别结果。
  14. 如权利要求1所述的方法,其特征在于,所述根据所述第一语义识别结果和第一预设条件,确定执行由第二装置指示的第二操作,包括:
    在所述第一语义识别结果不满足所述第一预设条件的情况下,获取来自第二装置的第六语义识别结果,所述第六语义识别结果用于指示所述第二操作;
    根据所述第六语义识别结果,确定执行所述第二操作。
  15. 如权利要求14所述的方法,其特征在于,所述方法还包括:
    根据所述第六语义识别结果,确定第二参数以及第二参数类型;
    保存将所述第二参数与所述第二参数类型的关联关系。
  16. 如权利要求1至15中任一项所述的方法,其特征在于,所述方法还包括:
    获取来自于所述语音传感器的第二语音信息;
    根据所述第二语音信息,确定第七语义识别结果;
    根据所述第七语义识别结果、第二预设条件,确定执行由所述第一语音信息指示的操 作,或者,确定执行由所述第一装置根据所述第七语义识别结果确定的第三操作,或者,确定执行由所述第二装置指示的第四操作。
  17. 如权利要求16所述的方法,其特征在于,所述根据所述第七语义识别结果、第二预设条件,确定执行由所述第一语音信息指示的操作,或者,确定执行由所述第一装置根据所述第七语义识别结果确定的第三操作,或者,确定执行由所述第二装置指示的第四操作,包括:
    在所述第七语义识别结果满足所述第一预设条件且满足所述第二预设条件的情况下,确定执行所述第三操作;
    在所述第七语义识别结果不满足所述第一预设条件且满足所述第二预设条件的情况下,确定执行所述第四操作;
    在所述第七语义识别结果不满足所述第二预设条件的情况下,确定执行所述与所述第一语义识别结果对应的操作。
  18. 如权利要求17所述的方法,其特征在于,所述第七语义识别结果满足所述第二预设条件,包括:
    所述第七语义识别结果的优先级高于所述第一语义识别结果的优先级。
  19. 如权利要求18所述的方法,其特征在于,所述第七语义识别结果的优先级高于所述第一语义识别结果的优先级,包括以下中的一项或多项:
    所述第七语义识别结果指示的功能的优先级高于所述第一语义识别结果指示的功能的优先级;
    所述第七语义识别结果指示的意图的优先级高于所述第一语义识别结果指示的意图的优先级;
    所述第七语义识别结果指示的参数的优先级高于所述第一语义识别结果指示的参数的优先级。
  20. 如权利要求17所述的方法,其特征在于,所述根据所述第一语义识别结果和第一预设条件,确定执行由第二装置指示的第二操作,包括:
    在所述第一语义识别结果不满足所述第一预设条件的情况下,获取来自所述第二装置的第六语义识别结果,所述第六语义识别结果用于指示所述第二操作;
    根据所述第六语义识别结果,确定执行所述第二操作;
    所述第七语义识别结果满足所述第二预设条件,包括:
    所述第七语义识别结果的优先级高于所述第六语义识别结果的优先级。
  21. 如权利要求20所述的方法,其特征在于,所述第七语义识别结果的优先级高于所述第六语义识别结果的优先级,包括以下中的一项或多项:
    所述第七语义识别结果指示的功能的优先级高于所述第六语义识别结果指示的功能的优先级;
    所述第七语义识别结果指示的意图的优先级高于所述第六语义识别结果指示的意图的优先级;
    所述第七语义识别结果指示的参数的优先级高于所述第六语义识别结果指示的参数的优先级。
  22. 如权利要求16所述的方法,其特征在于,所述方法还包括:
    向所述第二装置发送第二语音信息;
    所述根据所述第七语义识别结果、第二预设条件,确定执行由所述第一语音信息指示的操作,或者,确定执行由所述第二装置指示的第四操作,包括:
    在所述第七语义识别结果不满足所述第一预设条件的情况下,或者,在所述第七语义识别结果不满足所述第二预设条件,且所述第一语义识别结果不满足所述第一预设条件的情况下,从所述第二装置获取第八语义识别结果;
    根据所述第八语义识别结果、所述第二预设条件,确定执行所述由所述第一语音信息指示的操作,或者,确定执行所述第四操作。
  23. 如权利要求22所述的方法,其特征在于,所述根据所述第八语义识别结果、所述第二预设条件,确定执行所述由所述第一语音信息指示的操作,或者,确定执行所述第四操作,包括:
    在所述第八语义识别结果满足所述第二预设条件的情况下,确定执行所述第四操作;
    在所述第八语义识别结果不满足所述第二预设条件的情况下,确定执行所述由所述第一语音信息指示的操作。
  24. 如权利要求23所述的方法,其特征在于,所述第八语义识别结果满足所述第二预设条件,包括:
    所述第八语义识别结果的优先级高于所述第一语义识别结果的优先级。
  25. 如权利要求24所述的方法,其特征在于,所述第八语义识别结果的优先级高于所述第一语义识别结果的优先级,包括以下中的一项或多项:
    所述第八语义识别结果指示的功能的优先级高于所述第一语义识别结果指示的功能的优先级;
    所述第八语义识别结果指示的意图的优先级高于所述第一语义识别结果指示的意图的优先级;
    所述第八语义识别结果指示的参数的优先级高于所述第一语义识别结果指示的参数的优先级。
  26. 如权利要求23所述的方法,其特征在于,所述根据所述第一语义识别结果和第一预设条件,确定执行由第二装置指示的第二操作,包括:
    在所述第一语义识别结果不满足所述第一预设条件的情况下,获取来自所述第二装置的第六语义识别结果,所述第六语义识别结果用于指示所述第二操作;
    根据所述第六语义识别结果,确定执行所述第二操作;
    所述第八语义识别结果满足所述第二预设条件,包括:
    所述第八语义识别结果的优先级高于所述第六语义识别结果的优先级。
  27. 如权利要求26所述的方法,其特征在于,所述第八语义识别结果的优先级高于所述第六语义识别结果的优先级,包括以下中的一项或多项:
    所述第八语义识别结果指示的功能的优先级高于所述第六语义识别结果指示的功能的优先级;
    所述第八语义识别结果指示的意图的优先级高于所述第六语义识别结果指示的意图的优先级;
    所述第八语义识别结果指示的参数的优先级高于所述第六语义识别结果指示的参数 的优先级。
  28. 如权利要求16至27中任一项所述的方法,其特征在于,所述第二语音信息与所述第一语音信息指示的操作无关。
  29. 如权利要求1至28中任一项所述的方法,其特征在于,所述方法还包括:
    响应用户的输入操作,执行语音唤醒操作。
  30. 一种语音交互的装置,其特征在于,包括:
    获取单元,用于获取来自于语音传感器的第一语音信息;
    处理单元,用于根据所述第一语音信息,确定第一语义识别结果;
    所述处理单元,还用于根据所述第一语义识别结果和第一预设条件,确定执行由所述第一装置根据所述第一语义识别结果确定的第一操作,或者,确定执行由第二装置指示的第二操作。
  31. 如权利要求30所述的装置,其特征在于,所述处理单元具体用于:
    在所述第一语义识别结果满足所述第一预设条件的情况下,确定执行所述第一操作。
  32. 如权利要求31所述的装置,其特征在于,所述第一装置预设有多个功能,所述第一语义识别结果满足第一预设条件,包括:
    所述第一语义识别结果指示第一功能,所述第一功能属于所述多个功能。
  33. 如权利要求31或32所述的装置,其特征在于,所述第一装置预设有多个意图,
    所述第一语义识别结果满足第一预设条件,包括:
    所述第一语义识别结果指示第一意图,所述第一意图属于所述多个意图。
  34. 如权利要求31至33中任一项所述的装置,其特征在于,所述第一装置预设有多个参数,
    所述第一语义识别结果满足第一预设条件,包括:
    所述第一语义识别结果指示第一参数,所述第一参数属于所述多个参数。
  35. 如权利要求31至34中任一项所述的装置,其特征在于,所述第一语义识别结果指示第一功能以及指示第一参数,
    所述第一语义识别结果满足第一预设条件,还包括:
    所述第一语义识别结果指示的所述第一功能与所述第一语义识别结果指示的所述第一参数对应相同的参数类型。
  36. 如权利要求35所述的装置,其特征在于,所述处理单元具体用于:
    根据所述第一语音信息,确定第二语义识别结果,所述第二语义识别结果指示第二功能以及指示所述第一参数;
    在所述第一装置预设的多个功能不包括所述第二功能,且所述第一装置预设的多个参数包括所述第一参数的情况下,修正所述第二语义识别结果中的所述第二功能为所述第一功能,得到所述第一语义识别结果,所述第一功能与所述第二功能为类型相同的两个不同功能。
  37. 如权利要求31至34中任一项所述的装置,其特征在于,所述第一语义识别结果指示第一意图以及指示第一参数,
    所述第一语义识别结果满足第一预设条件,还包括:
    所述第一语义识别结果指示的所述第一意图与所述第一语义识别结果指示的所述第 一参数对应相同的参数类型。
  38. 如权利要求37所述的装置,其特征在于,所述处理单元具体用于:
    根据所述第一语音信息,确定第三语义识别结果,所述第三语义识别结果指示第二意图以及指示所述第一参数;
    在所述第一装置预设的多个意图不包括所述第二意图,且所述第一装置预设的多个参数包括所述第一参数的情况下,修正所述第三语义识别结果中的所述第二意图为所述第一意图,得到所述第一语义识别结果,所述第一意图与所述第二意图为类型相同的两个不同意图。
  39. 如权利要求31所述的装置,其特征在于,所述第一语义识别结果满足所述第一预设条件,包括:
    所述第一语义识别结果包括第一指示位,所述第一指示位指示所述第一语义识别结果满足所述第一预设条件。
  40. 如权利要求39所述的装置,其特征在于,
    所述处理单元具体用于:
    根据所述第一语音信息,确定第四语义识别结果,所述第四语义识别结果指示第一功能和第一参数;
    在所述第一功能属于所述第一装置预设的多个功能,且所述第一参数属于所述第一装置预设的多个参数,且所述第一功能和所述第一参数对应相同的参数类型的情况下,确定所述第一语义识别结果,所述语义识别结果包括所述第一指示位。
  41. 如权利要求39所述的装置,其特征在于,
    所述处理单元具体用于:
    根据所述第一语音信息,确定第五语义识别结果,所述第五语义识别结果包括第一意图和第一参数;
    在所述第一意图属于所述第一装置预设的多个意图,且所述第一参数属于所述第一装置预设的多个参数,且所述第一意图和所述第一参数对应相同的参数类型的情况下,确定所述第一语义识别结果,所述语义识别结果包括所述第一指示位。
  42. 如权利要求31至41中任一项所述的装置,其特征在于,所述装置还包括:
    发送单元,用于向所述第二装置发送所述第一语音信息;
    所述处理单元还用于,丢弃来自所述第二装置的第六语义识别结果。
  43. 如权利要求30所述的装置,其特征在于,所述处理单元具体用于:
    在所述第一语义识别结果不满足所述第一预设条件的情况下,获取来自第二装置的第六语义识别结果,所述第六语义识别结果用于指示所述第二操作;
    根据所述第六语义识别结果,确定执行所述第二操作。
  44. 如权利要求43所述的装置,其特征在于,所述处理单元还用于:
    根据所述第六语义识别结果,确定第二参数以及第二参数类型;
    所述装置还包括存储单元,用于保存所述第二参数与所述第二参数类型的关联关系。
  45. 如权利要求30至44中任一项所述的装置,其特征在于,
    所述获取单元还用于,获取来自于所述语音传感器的第二语音信息;
    所述处理单元还用于,根据所述第二语音信息,确定第七语义识别结果;
    所述处理单元还用于,根据所述第七语义识别结果、第二预设条件,确定执行由所述第一语音信息指示的操作,或者,确定执行由所述第一装置根据所述第七语义识别结果确定的第三操作,或者,确定执行由所述第二装置指示的第四操作。
  46. 如权利要求45所述的装置,其特征在于,所述处理单元具体用于,
    在所述第七语义识别结果满足所述第一预设条件且满足所述第二预设条件的情况下,确定执行所述第三操作;
    在所述第七语义识别结果不满足所述第一预设条件且满足所述第二预设条件的情况下,确定执行所述第四操作;
    在所述第七语义识别结果不满足所述第二预设条件的情况下,确定执行所述与所述第一语义识别结果对应的操作。
  47. 如权利要求46所述的装置,其特征在于,所述第七语义识别结果满足所述第二预设条件,包括:
    所述第七语义识别结果的优先级高于所述第一语义识别结果的优先级。
  48. 如权利要求47所述的装置,其特征在于,所述第七语义识别结果的优先级高于所述第一语义识别结果的优先级,包括以下中的一项或多项:
    所述第七语义识别结果指示的功能的优先级高于所述第一语义识别结果指示的功能的优先级;
    所述第七语义识别结果指示的意图的优先级高于所述第一语义识别结果指示的意图的优先级;
    所述第七语义识别结果指示的参数的优先级高于所述第一语义识别结果指示的参数的优先级。
  49. 如权利要求46所述的装置,其特征在于,所述处理单元具体用于:
    在所述第一语义识别结果不满足所述第一预设条件的情况下,获取来自所述第二装置的第六语义识别结果,所述第六语义识别结果用于指示所述第二操作;
    根据所述第六语义识别结果,确定执行所述第二操作;
    所述第七语义识别结果满足所述第二预设条件,包括:
    所述第七语义识别结果的优先级高于所述第六语义识别结果的优先级。
  50. 如权利要求49所述的装置,其特征在于,所述第七语义识别结果的优先级高于所述第六语义识别结果的优先级,包括以下中的一项或多项:
    所述第七语义识别结果指示的功能的优先级高于所述第六语义识别结果指示的功能的优先级;
    所述第七语义识别结果指示的意图的优先级高于所述第六语义识别结果指示的意图的优先级;
    所述第七语义识别结果指示的参数的优先级高于所述第六语义识别结果指示的参数的优先级。
  51. 如权利要求45所述的装置,其特征在于,所述装置还包括:
    发送单元,用于向所述第二装置发送第二语音信息;
    所述处理单元具体用于:
    在所述第七语义识别结果不满足所述第一预设条件的情况下,或者,在所述第七语义 识别结果不满足所述第二预设条件,且所述第一语义识别结果不满足所述第一预设条件的情况下,从所述第二装置获取第八语义识别结果;
    根据所述第八语义识别结果、所述第二预设条件,确定执行所述由所述第一语音信息指示的操作,或者,确定执行所述第四操作。
  52. 如权利要求51所述的装置,其特征在于,所述处理单元具体用于:
    在所述第八语义识别结果满足所述第二预设条件的情况下,确定执行所述第四操作;
    在所述第八语义识别结果不满足所述第二预设条件的情况下,确定执行所述由所述第一语音信息指示的操作。
  53. 如权利要求52所述的装置,其特征在于,所述第八语义识别结果满足所述第二预设条件,包括:
    所述第八语义识别结果的优先级高于所述第一语义识别结果的优先级。
  54. 如权利要求53所述的装置,其特征在于,所述第八语义识别结果的优先级高于所述第一语义识别结果的优先级,包括以下中的一项或多项:
    所述第八语义识别结果指示的功能的优先级高于所述第一语义识别结果指示的功能的优先级;
    所述第八语义识别结果指示的意图的优先级高于所述第一语义识别结果指示的意图的优先级;
    所述第八语义识别结果指示的参数的优先级高于所述第一语义识别结果指示的参数的优先级。
  55. 如权利要求52所述的装置,其特征在于,所述处理单元具体用于:
    在所述第一语义识别结果不满足所述第一预设条件的情况下,获取来自所述第二装置的第六语义识别结果,所述第六语义识别结果用于指示所述第二操作;
    根据所述第六语义识别结果,确定执行所述第二操作;
    所述第八语义识别结果满足所述第二预设条件,包括:
    所述第八语义识别结果的优先级高于所述第六语义识别结果的优先级。
  56. 如权利要求55所述的装置,其特征在于,所述第八语义识别结果的优先级高于所述第六语义识别结果的优先级,包括以下中的一项或多项:
    所述第八语义识别结果指示的功能的优先级高于所述第六语义识别结果指示的功能的优先级;
    所述第八语义识别结果指示的意图的优先级高于所述第六语义识别结果指示的意图的优先级;
    所述第八语义识别结果指示的参数的优先级高于所述第六语义识别结果指示的参数的优先级。
  57. 如权利要求45至56中任一项所述的装置,其特征在于,所述第二语音信息与所述第一语音信息指示的操作无关。
  58. 如权利要求1至57中任一项所述的装置,其特征在于,所述装置还包括:
    唤醒模块,用于响应用户的输入操作,执行语音唤醒操作。
  59. 一种语音交互的装置,其特征在于,所述装置包括:处理器和存储器,所述处理器与存储器耦合,所述存储器用于存储计算机程序,处理器,用于执行所述存储器中存储 的计算机程序,以使得所述装置执行如权利要求1至29所述的方法。
  60. 一种计算机可读介质,其特征在于,所述计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行如权利要求1至29所述的方法。
  61. 一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行如权利要求1至29所述的方法。
  62. 一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行如权利要求1至29所述的方法。
  63. 一种语音交互系统,所述语音交互系统包括如权利要求1至29所述的第一装置、如权利要求1至29所述的第二装置,所述第一装置用于执行如权利要求1至29所述的方法。
PCT/CN2021/087958 2021-04-17 2021-04-17 语音交互的方法和装置 WO2022217621A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CN202180005755.9A CN115500085A (zh) 2021-04-17 2021-04-17 语音交互的方法和装置
EP21936495.7A EP4318464A4 (en) 2021-04-17 2021-04-17 VOICE INTERACTION METHOD AND DEVICE
PCT/CN2021/087958 WO2022217621A1 (zh) 2021-04-17 2021-04-17 语音交互的方法和装置
US18/488,647 US20240046931A1 (en) 2021-04-17 2023-10-17 Voice interaction method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/087958 WO2022217621A1 (zh) 2021-04-17 2021-04-17 语音交互的方法和装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/488,647 Continuation US20240046931A1 (en) 2021-04-17 2023-10-17 Voice interaction method and apparatus

Publications (1)

Publication Number Publication Date
WO2022217621A1 true WO2022217621A1 (zh) 2022-10-20

Family

ID=83639454

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/087958 WO2022217621A1 (zh) 2021-04-17 2021-04-17 语音交互的方法和装置

Country Status (4)

Country Link
US (1) US20240046931A1 (zh)
EP (1) EP4318464A4 (zh)
CN (1) CN115500085A (zh)
WO (1) WO2022217621A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117496973B (zh) * 2024-01-02 2024-03-19 四川蜀天信息技术有限公司 一种提升人机对话交互体验感的方法、装置、设备及介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102708865A (zh) * 2012-04-25 2012-10-03 北京车音网科技有限公司 语音识别方法、装置及系统
US8996372B1 (en) * 2012-10-30 2015-03-31 Amazon Technologies, Inc. Using adaptation data with cloud-based speech recognition
CN106992009A (zh) * 2017-05-03 2017-07-28 深圳车盒子科技有限公司 车载语音交互方法、系统及计算机可读存储介质
CN107564525A (zh) * 2017-10-23 2018-01-09 深圳北鱼信息科技有限公司 语音识别方法及装置
CN109859761A (zh) * 2019-02-22 2019-06-07 安徽卓上智能科技有限公司 一种智能语音交互控制方法
CN110444206A (zh) * 2019-07-31 2019-11-12 北京百度网讯科技有限公司 语音交互方法及装置、计算机设备与可读介质
CN111627435A (zh) * 2020-04-30 2020-09-04 长城汽车股份有限公司 语音识别方法与系统及基于语音指令的控制方法与系统
CN111696534A (zh) * 2019-03-15 2020-09-22 阿里巴巴集团控股有限公司 语音交互设备和系统、设备控制方法、计算设备以及介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109410927B (zh) * 2018-11-29 2020-04-03 北京蓦然认知科技有限公司 离线命令词与云端解析结合的语音识别方法、装置和系统
US11138975B2 (en) * 2019-07-31 2021-10-05 Sonos, Inc. Locally distributed keyword detection

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102708865A (zh) * 2012-04-25 2012-10-03 北京车音网科技有限公司 语音识别方法、装置及系统
US8996372B1 (en) * 2012-10-30 2015-03-31 Amazon Technologies, Inc. Using adaptation data with cloud-based speech recognition
CN106992009A (zh) * 2017-05-03 2017-07-28 深圳车盒子科技有限公司 车载语音交互方法、系统及计算机可读存储介质
CN107564525A (zh) * 2017-10-23 2018-01-09 深圳北鱼信息科技有限公司 语音识别方法及装置
CN109859761A (zh) * 2019-02-22 2019-06-07 安徽卓上智能科技有限公司 一种智能语音交互控制方法
CN111696534A (zh) * 2019-03-15 2020-09-22 阿里巴巴集团控股有限公司 语音交互设备和系统、设备控制方法、计算设备以及介质
CN110444206A (zh) * 2019-07-31 2019-11-12 北京百度网讯科技有限公司 语音交互方法及装置、计算机设备与可读介质
CN111627435A (zh) * 2020-04-30 2020-09-04 长城汽车股份有限公司 语音识别方法与系统及基于语音指令的控制方法与系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4318464A4 *

Also Published As

Publication number Publication date
CN115500085A (zh) 2022-12-20
EP4318464A4 (en) 2024-05-08
US20240046931A1 (en) 2024-02-08
EP4318464A1 (en) 2024-02-07

Similar Documents

Publication Publication Date Title
US11437041B1 (en) Speech interface device with caching component
KR102429436B1 (ko) 사용자의 입력 입력에 기초하여 타겟 디바이스를 결정하고, 타겟 디바이스를 제어하는 서버 및 그 동작 방법
US9953648B2 (en) Electronic device and method for controlling the same
KR102561712B1 (ko) 음성 인식 장치 및 그 동작 방법
JP2020527753A (ja) ビューに基づく音声インタラクション方法、装置、サーバ、端末及び媒体
US20070208556A1 (en) Apparatus for providing voice dialogue service and method of operating the same
KR102484257B1 (ko) 전자 장치, 그의 문서 표시 방법 및 비일시적 컴퓨터 판독가능 기록매체
EP3523718B1 (en) Creating a cinematic storytelling experience using network-addressable devices
US11881209B2 (en) Electronic device and control method
US11069351B1 (en) Vehicle voice user interface
US11295743B1 (en) Speech processing for multiple inputs
US20230290338A1 (en) Methods for natural language model training in natural language understanding (nlu) systems
US20240046931A1 (en) Voice interaction method and apparatus
US11393456B1 (en) Spoken language understanding system
US20200365139A1 (en) Information processing apparatus, information processing system, and information processing method, and program
CN111539217B (zh) 一种用于自然语言内容标题消歧的方法、设备和系统
US11605380B1 (en) Coordinating content-item output across multiple electronic devices
US11694684B1 (en) Generation of computing functionality using devices
US11775617B1 (en) Class-agnostic object detection
US11893984B1 (en) Speech processing system
US12010387B1 (en) Content-based voice targeting of devices using slot and task data
WO2023092399A1 (zh) 语音识别方法、语音识别装置及系统
CN116758914A (zh) 一种智能家居语音交互控制方法及系统
CN118034415A (zh) 一种家庭设备的控制方法、装置、电子设备及存储介质
CN116226358A (zh) 对话推荐语料生成方法、装置、设备和介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21936495

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2021936495

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2021936495

Country of ref document: EP

Effective date: 20231026

NENP Non-entry into the national phase

Ref country code: DE