WO2022262366A1 - 跨设备的对话业务接续方法、系统、电子设备和存储介质 - Google Patents

跨设备的对话业务接续方法、系统、电子设备和存储介质 Download PDF

Info

Publication number
WO2022262366A1
WO2022262366A1 PCT/CN2022/084544 CN2022084544W WO2022262366A1 WO 2022262366 A1 WO2022262366 A1 WO 2022262366A1 CN 2022084544 W CN2022084544 W CN 2022084544W WO 2022262366 A1 WO2022262366 A1 WO 2022262366A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
information
electronic device
voice
execution instruction
Prior art date
Application number
PCT/CN2022/084544
Other languages
English (en)
French (fr)
Inventor
王翃宇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22823857.2A priority Critical patent/EP4343756A1/en
Publication of WO2022262366A1 publication Critical patent/WO2022262366A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/72442User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality for playing music files
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/80Services using short range communication, e.g. near-field communication [NFC], radio-frequency identification [RFID] or low energy communication
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/72409User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality by interfacing with external accessories
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2250/00Details of telephonic subscriber devices
    • H04M2250/74Details of telephonic subscriber devices with voice recognition means

Definitions

  • the present application relates to the field of natural language processing (Natural Language Processing, NLP), and in particular to a cross-device dialog service connection method, system, electronic device and computer-readable storage medium.
  • natural language processing Natural Language Processing, NLP
  • NLP Natural Language Processing
  • AI artificial intelligence
  • electronic devices can perform human-computer interaction with users based on an NLP dialogue system to implement corresponding voice services. For example, after waking up the voice assistant of the mobile phone, the user inputs the voice "play song A" into the mobile phone; the mobile phone processes the user's input voice based on the dialogue system, obtains the execution instruction to play song A, and responds to the execution instruction, automatically plays Song A.
  • Embodiments of the present application provide a cross-device dialog service connection method, system, electronic device, and computer-readable storage medium, which can realize cross-device dialog service continuity.
  • an embodiment of the present application provides a cross-device dialog service continuity system, where the system includes a first electronic device and at least one second electronic device.
  • the first electronic device is used to: collect the voice of the first user; if it is determined that the voice of the first user contains information indicating to send an instruction to the second electronic device, then send the first information and the first execution instruction to the second electronic device , the first information includes information for describing the intention of the first user's voice, and the first execution instruction is an execution instruction corresponding to the first user's voice;
  • the second electronic device is used to: collect the voice of the second user after receiving the first execution instruction; execute the second execution instruction corresponding to the voice of the second user, the second execution instruction is generated according to the first information and the voice of the second user instruction.
  • the first electronic device when the first electronic device sends the first execution instruction to the second electronic device, it also sends the first information to the second electronic device, that is, it also sends information describing the intention of the first user's voice, or That is, the first information is transmitted to the second electronic device along with the service flow.
  • the second electronic device can perform semantic understanding on the newly collected voice of the second user according to the information from the first electronic device used to describe the intention of the voice of the first user, so as to determine the intention of the voice of the second user, thereby realizing Inter-device dialogue service continuity is achieved.
  • the first electronic device is a mobile phone
  • the second electronic device is a large-screen device
  • the first user's voice is "recommend a song to the large-screen device”
  • the intention of the user's voice is to recommend music
  • the second user's voice is "another”.
  • the mobile phone sends the first execution command and the intention of recommending music to the large-screen device.
  • the large-screen device collects "change one”, it recognizes that the intention of "changing one” is the recommended music according to the intention of the recommended music, and responds to " Change another", recommending another song to the user.
  • the information used to describe the intention of the first user's voice includes the first text of the first user's voice and/or the first intention of the first user's voice.
  • the first user voice is "recommend a song to a large-screen device”
  • the first text is the text “recommend a song to a large-screen device”
  • the first intention is to recommend music.
  • the first information includes text and/or intentions of N rounds of dialogue, where N is a positive integer greater than 1;
  • the text of the N rounds of dialogue includes the first text of the first user's voice, and the intent of the N rounds of dialogue includes the first intention of the first user's voice; wherein, the N rounds of dialogue are user dialogues collected by the first electronic device.
  • the first electronic device transmits information such as the intention of N rounds of conversations to the second electronic device, which allows the second electronic device to more accurately identify the intention of the newly collected user voice, and realizes a more open cross-device Continuation of dialogue business.
  • the N rounds of dialogue may include the voice of the first user.
  • the first electronic device may transmit relevant information of the latest N rounds of conversations to the second electronic device.
  • N 3.
  • the first execution instruction includes information representing a slot of the first user's voice.
  • the second electronic device can more accurately recognize the newly collected user voice, and realize a more open cross-device dialogue service connection.
  • the first electronic device not only transmits the intent information of the user's voice to the second electronic device, but also places the song slot extracted from the user's voice into one And pass it to the second electronic device.
  • the first electronic device is specifically configured to: perform voice recognition on the voice of the first user to obtain the first text; perform semantic understanding on the first text to obtain the first text of the voice of the first user. An intention and the first slot; if the first slot includes the target device slot, and the entity of the target device slot is the second electronic device, then it is determined that the first user voice contains instructions for sending instructions to the second electronic device Information: generating a first execution instruction corresponding to the first user voice according to the first intent and the first slot.
  • the system further includes a third electronic device communicatively connected to the first electronic device; the first electronic device is specifically configured to: send the first user voice to the third electronic device; receive The first slot, the first intent, and the first execution instruction from the third electronic device, the first slot and the first intent are extracted by the third electronic device from the voice of the first user, and the first execution instruction is the third electronic device
  • the execution instruction corresponding to the voice of the first user generated by the device according to the first slot and the first intention; if the first slot includes the target device slot, and the entity in the target device slot is the second electronic device, then determine the first user
  • the voice includes information for instructing to send an instruction to the second electronic device.
  • the first electronic device may use the voice service capability of the third electronic device to analyze and recognize the voice of the first user. In this way, the first electronic device may not be capable of deploying a voice service system. Therefore, the application range of the inter-device dialogue service connection is wider.
  • the first electronic device may also be devices such as a smart watch, a smart earphone, and a smart speaker. Even if these devices do not have the ability to deploy modules such as speech recognition, semantic understanding, and dialogue management, they can still achieve cross-device dialogue service continuity.
  • the second electronic device is specifically configured to: perform speech recognition on the voice of the second user to obtain the second text; perform semantic understanding on the second text according to the first information to obtain the second text. Semantic information of the second user's voice; generating a second execution instruction corresponding to the second user's voice according to the semantic information of the second user's voice.
  • the second electronic device is specifically configured to: use the first information as the latest context of the semantic understanding module, the second electronic device includes the semantic understanding module; input the second text into the semantic understanding module , to obtain the semantic information of the second user's speech output by the semantic understanding module, wherein the semantic understanding module uses the latest context to perform semantic understanding on the second text.
  • the system further includes a fourth electronic device communicatively connected to the second electronic device; the second electronic device is specifically configured to: send the second user voice and the second electronic device to the fourth electronic device.
  • a message ; receiving semantic information and a second execution instruction of the second user voice from the fourth electronic device;
  • the semantic information of the second user's voice is the information obtained by the fourth electronic device from semantic understanding of the second user's voice according to the first information
  • the second execution instruction is the second user's voice generated by the fourth electronic device according to the semantic information of the second user's voice. Two execution instructions corresponding to the user's voice.
  • the second electronic device may use the voice service capability of the fourth electronic device to analyze and recognize the voice of the second user. In this way, the second electronic device may not be capable of deploying a voice service system. Therefore, the application range of the inter-device dialogue service connection is wider.
  • the first electronic device is specifically configured to: determine whether the user account of the first electronic device and the user account of the second electronic device are the same user; Sending the first execution instruction and the first information, and sending the second information to the second electronic device, where the second information includes any one or any combination of the first user information, scene information and first application state information;
  • the first user information is information used to describe the user of the first electronic device
  • the first application state information is used to characterize the first target application on the first electronic device
  • the scene information is used to describe the user scene information
  • the second electronic device is specifically configured to: generate a second execution instruction according to the first information, the second user's voice, and the second information.
  • the first electronic device automatically identifies whether the users of the two devices are the same user, and if they are the same user, the first electronic device transmits the first execution instruction and the first information to the second electronic device
  • the second information is also sent to the second electronic device.
  • the second electronic device can provide the user with more personalized and precise services according to the second information, so as to improve user experience under cross-device dialogue service continuity.
  • the first user information is the information of user A, and through the information of user A, it can be learned that the genre of the user's preferred song is a popular song.
  • the scene information it can be known that the user is in a walking state, that is, in a sports scene.
  • the first target application is Huawei Music installed on the mobile phone, and the first application status information includes historical play records of songs on the Huawei Music.
  • the voice of the first user is "recommend a song to the large-screen device", and the voice of the second user is "change another”.
  • the mobile phone sends the information to the large-screen device
  • the large-screen device determines the popular music songs to be recommended in the sports scene according to the user's sports scene and the user's preferred song type as pop music. Further, based on the historical play record of the song, the popular music song with the largest number of plays and belonging to the sports scene is selected as the recommended song. In this way, the recommended songs are more in line with user preferences.
  • the second information can also be used for semantic understanding of the second electronic device, so that the second electronic device can more accurately understand the intent of the newly collected user voice, and realize a more open cross-device dialogue service connection.
  • the first electronic device transmits historically played song information to the second electronic device, and the historically played song information includes song titles.
  • the second electronic device collects "Change XXX", it can recognize "XXX" as the song name according to the song name, and then recognize that the intention of the newly collected user voice is to play the song XXX.
  • the first electronic device is specifically configured to: if the user account of the first electronic device and the user account of the second electronic device are not the same user, send the first Executing instructions and first messages;
  • the second electronic device is specifically configured to: generate a second execution instruction according to the first information, the voice of the second user, and the third information, where the third information includes the second user information and/or the second application state information; wherein, the second user
  • the information is information used to describe the user of the second electronic device, and the second application state information is used to characterize the second target application on the second electronic device.
  • the first electronic device automatically identifies whether the accounts of the two devices are the same user, and if not, the user information on the first electronic device may not be sent.
  • the second electronic device can provide the user with more personalized and precise services according to the relevant information of the device.
  • the second target application may be a Huawei music application installed on the second electronic device.
  • the first electronic device is specifically used for:
  • the first information and the first execution instruction are respectively sent to at least the second electronic device through different communication connections.
  • the first electronic device automatically identifies the type of communication connection, and sends a corresponding message to the corresponding second electronic device according to the type of communication connection. Information.
  • the second electronic device is specifically configured to: when executing the first execution instruction, or when prompting the user whether to execute the first execution instruction, collect the voice of the second user.
  • the continuation of the cross-device dialogue service can be made more humanized.
  • the second electronic device is further configured to wake up the voice assistant after receiving the first execution instruction, where the second electronic device includes the voice assistant.
  • the second electronic device automatically wakes up the voice assistant, and the user does not need to wake up the voice assistant on the second device through a specific wake-up word, so that the cross-device dialogue business connection is smoother and the user experience is better.
  • the first execution instruction is an instruction for recommending music
  • the second execution instruction is an instruction for recommending another song.
  • the embodiment of the present application provides a cross-device dialog service connection method, which is applied to the first electronic device, and the method includes: collecting the voice of the first user; After sending the instruction information, send the first information and the first execution instruction to the second electronic device; wherein, the first information includes information describing the intention of the first user's voice, and the first execution instruction is corresponding to the first user's voice Execute instructions.
  • the information used to describe the intention of the first user's voice includes the first text of the first user's voice and/or the first intention of the first user's voice.
  • the first information includes text and/or intentions of N rounds of dialogue, where N is a positive integer greater than 1;
  • the text of the N rounds of dialogue includes the first text of the first user's voice, and the intent of the N rounds of dialogue includes the first intention of the first user's voice; wherein, the N rounds of dialogue are user dialogues collected by the first electronic device.
  • the first execution instruction includes information representing a slot of the first user's voice.
  • sending the first information and the first execution instruction to the second electronic device includes:
  • the first slot includes the slot of the target device, and the entity of the slot of the target device is the second electronic device, then determining that the first user voice contains information for instructing to send instructions to the second electronic device;
  • sending the first information and the first execution instruction to the second electronic device includes:
  • the first slot and the first intent are extracted by the third electronic device from the voice of the first user, and the first execution instruction is the third An execution instruction corresponding to the first user voice generated by the electronic device according to the first slot and the first intention;
  • the first slot includes the slot of the target device, and the entity of the slot of the target device is the second electronic device, then determining that the first user voice contains information for instructing to send instructions to the second electronic device;
  • the method before sending the first information and the first execution instruction to the second electronic device, the method further includes:
  • the second information includes the first user information, the scene information and the first application state information any one or any combination;
  • the first user information is the information used to describe the user of the first electronic device
  • the scene information is the information used to describe the user scene
  • the first application state information is used to characterize the first target application on the first electronic device. information.
  • sending the first information and the first execution instruction to the second electronic devices includes:
  • the first information and the first execution instruction are respectively sent to at least the second electronic device through different communication connections.
  • the first execution instruction is an instruction for recommending music.
  • the embodiment of the present application provides a cross-device dialog service connection method, which is applied to the second electronic device, and the method includes:
  • the first information includes information describing the intention of the first user's voice
  • the first execution instruction is an execution instruction corresponding to the first user's voice
  • the first user's voice Voice collected by the first electronic device and containing information for instructing to send instructions to the second electronic device
  • Collecting the voice of the second user executing a second execution instruction corresponding to the voice of the second user, where the second execution instruction is an instruction generated according to the first information and the voice of the second user.
  • the information used to describe the intention of the first user's voice includes the first text of the first user's voice and/or the first intention of the first user's voice.
  • the first information includes text and/or intentions of N rounds of dialogue, where N is a positive integer greater than 1;
  • the text of the N rounds of dialogue includes the first text of the first user's voice, and the intent of the N rounds of dialogue includes the first intention of the first user's voice; wherein, the N rounds of dialogue are user dialogues collected by the first electronic device.
  • the first execution instruction includes information representing a slot of the first user's voice.
  • executing the second execution instruction corresponding to the voice of the second user includes:
  • the second electronic device includes the semantic understanding module
  • the second text is input into the semantic understanding module to obtain the semantic information of the second user's speech output by the semantic understanding module, wherein the semantic understanding module uses the latest context to perform semantic understanding on the second text.
  • the method further includes:
  • the second information includes any one or any combination of first user information, scene information, and first application state information;
  • a second execution instruction corresponding to the second user's voice including:
  • the first user information is the information used to describe the user of the first electronic device
  • the scene information is the information used to describe the user scene
  • the first application state information is used to characterize the first target application on the first electronic device. information.
  • the second execution order includes:
  • the third information includes second user information and/or second application state information, where the second user information is information describing the user of the second electronic device, and the second application state information is used to characterize the second electronic device on the second electronic device.
  • second user information is information describing the user of the second electronic device
  • second application state information is used to characterize the second electronic device on the second electronic device.
  • executing the second execution instruction corresponding to the voice of the second user includes:
  • the semantic information of the second user's voice is the information obtained by the fourth electronic device from semantic understanding of the second user's voice according to the first information
  • the second execution instruction is the second user's voice generated by the fourth electronic device according to the semantic information of the second user's voice.
  • collecting the voice of the second user includes: collecting the voice of the second user when executing the first execution instruction or prompting the user whether to execute the first execution instruction.
  • the method before collecting the voice of the second user, the method further includes: after receiving the first execution instruction, waking up the voice assistant, where the second electronic device includes the voice assistant.
  • the second execution instruction is an instruction for recommending another song.
  • an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and operable on the processor.
  • the processor executes the computer program, the above-mentioned second or third aspect can be realized.
  • an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method according to any one of the second aspect or the third aspect above is implemented.
  • the embodiment of the present application provides a chip system, the chip system includes a processor, the processor is coupled with the memory, and the processor executes the computer program stored in the memory, so as to realize any one of the second aspect or the third aspect. method described in the item.
  • the chip system can be a single chip, or a chip module composed of multiple chips.
  • an embodiment of the present application provides a computer program product, which, when the computer program product is run on an electronic device, causes the electronic device to execute the method described in any one of the above-mentioned second aspect or third aspect.
  • FIG. 1 is a schematic diagram of a voice service system provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of a scenario where a voice-controlled mobile phone plays music provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of a scene where a mobile phone recommends music to a large-screen device provided in an embodiment of the present application;
  • FIG. 4 is a schematic diagram of a cross-device dialog service connection system provided by an embodiment of the present application.
  • FIG. 5 is another schematic diagram of the cross-device dialogue service connection system provided by the embodiment of the present application.
  • 6A to 6B are schematic diagrams of scenarios where a mobile phone recommends music to a large-screen device according to an embodiment of the present application
  • FIG. 7 is a schematic flow diagram of a mobile phone recommending music to a large-screen device according to an embodiment of the present application
  • FIGS. 8A to 8C are schematic diagrams of navigation scenarios provided by the embodiment of the present application.
  • FIG. 8D is a schematic diagram of a video recommendation scene provided by an embodiment of the present application.
  • FIG. 9 is another schematic diagram of the cross-device dialog service continuation system provided by the embodiment of the present application.
  • FIG. 10 is another schematic flow diagram of the cross-device dialogue service connection method provided by the embodiment of the present application.
  • FIG. 11 is a schematic diagram of a scene where earphones transfer music to speakers for playback provided by an embodiment of the present application
  • FIG. 12 is another schematic diagram of the cross-device dialogue service connection system provided by the embodiment of the present application.
  • FIG. 13 is another schematic flow diagram of the cross-device dialogue service connection method provided by the embodiment of the present application.
  • FIG. 14 is a schematic diagram of a scenario in which a mobile phone recommends music to a smart speaker according to an embodiment of the present application
  • FIG. 15 is a schematic diagram of a hardware structure of an electronic device provided by an embodiment of the present application.
  • Voice service system or dialogue service system.
  • FIG. 1 it shows a schematic diagram of a voice service system provided by an embodiment of the present application.
  • the speech system can include a speech recognition (Automatic Speech Recognition, ASR) module 11, semantic understanding (Natural Language Understanding, NLU) module 12, dialogue management (Dialogue Management, DM) module 13 and speech synthesis (Text To Speech, TTS) module 14.
  • ASR Automatic Speech Recognition
  • NLU Natural Language Understanding
  • DM Dialog Management
  • TTS Text To Speech
  • the speech recognition module 11 is used for converting the speech information input by the user 15 into text information.
  • the semantic understanding module 12 is used for performing semantic understanding according to the text information output by the speech recognition module 11 to obtain semantic information, and the semantic information usually includes intent and slot value.
  • the dialog management module 13 is used to update the system state according to the semantic information and the dialog state output by the semantic understanding module 12, and output the next system action.
  • the dialogue management module 13 includes a dialogue state tracking (Dialog State Tracking, DST) submodule and a dialogue decision (Dialog Policy, DP) submodule.
  • the dialogue state tracking sub-module is used to maintain and update the dialogue state
  • the dialogue decision-making sub-module is used to generate system behavior according to the dialogue state and semantic information to determine the next action.
  • the electronic device can perform corresponding operations according to the instruction output by the dialogue management module 13 .
  • the instruction output by the dialogue management module 13 is an instruction for instructing to output voice.
  • the speech synthesis module 14 can generate speech information according to the instruction output by the dialogue management module 13, and obtain the output speech.
  • the voice information input by the user 15 is "play a song”
  • the dialogue management model 13 outputs an instruction for instructing the output voice
  • the speech synthesis module 14 generates the output voice "what do you want to play” according to the instruction for instructing the output voice? song?".
  • the electronic device will perform corresponding operations in response to the instruction.
  • the output of the dialogue management module 13 may be embodied as an execution instruction, and the execution instruction is used to indicate the next action.
  • the voice information input by the user 15 is "play song A”
  • the dialogue management module 13 outputs an execution instruction to play song A, and the electronic device automatically plays song A in response to the execution instruction.
  • a natural language generation (Natural Language Generation, NLG) module may also be included.
  • the natural language generating module is used to textualize the system actions output by the dialogue management module 13 to obtain natural language text.
  • the natural language text output by the natural language generation module can be used as the input of the speech synthesis module 14; the speech synthesis module 14 converts the input natural speech text into speech information to obtain output speech.
  • the intent may refer to the user's purpose expressed in the user's voice.
  • the user's voice is "what's the weather in Shenzhen today", and the intention of the voice is “query the weather”.
  • the user's voice is "play a song”, and the intention of the voice is "play music”.
  • One or more slots can be configured under each intent.
  • the slot refers to the key information that the system needs to collect from the user's voice.
  • the configured slots may include location slots and time slots.
  • the location slot is used to determine which location needs to be queried for the weather, and the time slot is used to determine when the weather needs to be queried.
  • a slot includes attributes such as a slot value.
  • the slot value refers to the specific parameters of the slot, and can also be called the entity of the slot. For example, the user's voice is "what's the weather in Shenzhen today", the location slot and time slot can be extracted from the voice, the entity of the location slot is "Beijing", and the entity of the time slot is "today”.
  • the intent category and the slot configured under each intent category can be preset.
  • the slots configured under the intention of recommending music include, but are not limited to, target device slots, where the target device slots are used to indicate the target device for continuing the dialogue service.
  • a mobile phone needs to connect the dialog service to a large-screen device.
  • the source device is a mobile phone
  • the target device is a large-screen device.
  • the electronic device can conduct one or more rounds of man-machine dialogue with the user, so as to realize the corresponding voice service.
  • FIG. 2 it shows a schematic diagram of a scene where a mobile phone is controlled by voice to play music according to an embodiment of the present application.
  • the main interface 22 of the mobile phone 21 includes application programs such as application store, clock, memo, gallery and music.
  • the mobile phone 21 collects the user's voice "Xiaoyi Xiaoyi, recommend a song";
  • the developed voice service system performs voice recognition, semantic understanding and other processes on the user's voice, determines that the user's intention is to recommend music, and obtains the execution instructions of the recommended music.
  • the mobile phone 21 can determine the recommended song according to the preset recommendation rules. For example, the mobile phone 21 can use the most played song in the last 7 days as the recommended song according to the user's historical play records.
  • the mobile phone 21 automatically plays the song A and displays the voice assistant interface 23 in response to the execution instruction of the recommended music.
  • the voice assistant interface 23 includes the text 24 of the user's voice, the voice assistant's answer sentence text 25 for the user's voice, and a music control 26 . At this moment, the song being played is displayed as song A in the music control 26 .
  • the mobile phone 21 After the mobile phone 21 automatically plays song A in response to the user's voice "Xiaoyi Xiaoyi, recommend a song", if the user wants to change a song, then input the user's voice "change one" to the mobile phone 21. After the mobile phone 21 collects the user's voice "change one", the voice recognition module converts the user's voice into text information, and displays the text information 27 of the user's input voice in the voice assistant interface 23 .
  • the mobile phone 21 inputs the text information of "change one" into the semantic understanding module, and the semantic understanding module determines that the user's intention is to change the song list according to information such as historical intentions and input text information, and obtains an execution instruction to play another song.
  • the historical intention is to recommend music, which is an intention determined according to the user voice "Xiaoyi Xiaoyi, recommend a song”.
  • the mobile phone 21 automatically plays the song B and displays the language assistant interface 28 in response to the execution instruction of playing another song.
  • the voice assistant interface 28 includes the text 29 of the user's voice and the music control 26. At this time, the song being played is Song B displayed in the music control 26.
  • the entire dialogue interaction process between the user and the voice assistant Xiaoyi of the mobile phone 21 can be as follows:
  • the user's voice "change another" does not clarify the user's intention, but the mobile phone 21 can still accurately identify the user's intention according to historical intentions and contextual information such as dialogue materials.
  • the dialogue material may include "Xiaoyi Xiaoyi, recommend a song". This is because the entire dialog interaction process takes place on the side of the mobile phone 21, and the mobile phone 21 stores relevant information of the dialog process.
  • FIG. 3 shows a schematic diagram of a scene where a mobile phone recommends music to a large-screen device according to an embodiment of the present application.
  • the first electronic device is a mobile phone
  • the second electronic device is a large-screen device.
  • Voice assistants are installed in mobile phones and large-screen devices, and the voice service system in Figure 1 is deployed.
  • the user 31 inputs the user's voice "recommend a song to the large-screen device” into the mobile phone 32 .
  • the voice assistant in the mobile phone 32 uses the voice service system shown in Figure 1 to determine the intention of the user's voice "recommend a song to the large-screen device” as Recommend music, and can extract the target device slot, the entity of the target device slot is a large-screen device; then, the mobile phone 32 generates an execution instruction for recommending music, and displays the voice assistant interface 33, and outputs the answer voice "OK" for the user's voice "; finally, send an execution instruction of recommending music to the large-screen device 34.
  • the execution instruction of the recommended music includes but not limited to song name information, which is used to instruct the large-screen device 34 to play the song.
  • the large-screen device 34 After receiving the execution instruction from the mobile phone 33, the large-screen device 34 pops up a window 35 in response to the execution instruction. A prompt message is displayed in the window 35, which is used to ask the user whether to play the song A recommended by the mobile phone. The user can allow the large-screen device 34 to play song A by clicking the "play” button on the window 35; or allow the large-screen device 34 to cancel playing song A by clicking the "cancel" button on the window.
  • the user can also input the voice "play” or "cancel” to the large-screen device 34 to indicate the button selection intention to the large-screen device 34 .
  • the large screen device 34 selects the "play” button
  • the large screen device 34 selects the "cancel” button.
  • the large-screen device 34 displays the window 35, if the user 31 wants to change a song, the user voice "change one" can be input to the large-screen device 34.
  • the voice assistant in the large-screen device 34 collects the user's voice "change one", when performing intention recognition, it inputs the text information "change one" to the semantic understanding module.
  • the large-screen device 34 does not have contextual information such as the historical intention of the dialogue process "recommend music", the entity of the target device slot, and the historical corpus "recommend a song to the large-screen device” of "change”, so that semantic understanding
  • the module cannot recognize the intention of the user's voice, so that the large-screen device 34 cannot play another song in response to the user's voice.
  • the large-screen device 34 can only choose to play or cancel, and cannot recognize other user voices associated with the previous conversation, and cannot realize cross-device conversation service continuity.
  • the inventor found that the relevant information used to describe the voice intention of the user can be transmitted to the target device along with the service flow, so as to realize the inter-device dialogue service continuation.
  • the embodiment of the present application provides a cross-device dialog service connection solution.
  • the first electronic device may transmit context information such as the intent and slot of the latest N rounds of conversations to the second electronic device. That is, the first electronic device transmits the context information such as the intent and slot of the latest N rounds of conversations to the second electronic device along with the service flow.
  • the second electronic device performs intent recognition, it can accurately recognize the intent of the user's voice according to the received context information such as the intent and the slot, so as to realize cross-device dialogue service continuity.
  • N is a positive integer greater than or equal to 1.
  • FIG. 4 it shows a schematic diagram of a system for inter-device dialog service continuity provided by an embodiment of the present application.
  • the system may include a first electronic device 41 and a second electronic device 42 .
  • the first electronic device 41 and the second electronic device 42 may exchange information through a communication connection, and the communication connection may be, for example, a Bluetooth connection or a Wi-Fi point-to-point connection.
  • the first electronic device 41 and the second electronic device 42 may include a voice assistant application program, or an application program integrated with a voice assistant function.
  • the inter-device dialog service continuity system may include one or at least two second electronic devices 42 . That is to say, the first electronic device 41 can transfer services to one or at least two second electronic devices 42 at the same time, and transmit information such as intent and slot to the second electronic devices 42 when the services are transferred.
  • the first electronic device 41 and the second electronic device 42 may be devices logged into the same user account.
  • the first electronic device 41 is a mobile phone
  • the second electronic device 42 is a large-screen device, and both the mobile phone and the large-screen device log in to the same Huawei user account.
  • the first electronic device 41 and the second electronic device 42 may also be devices logged in with different user accounts, for example, the account logged in by the first electronic device 41 is user A, and the account logged in by the second electronic device 42 is user B. At this time, the first electronic device 41 and the second electronic device 42 may belong to the same group, for example, belong to the same family group.
  • the first electronic device 41 and the second electronic device 42 may be devices that have established a trusted connection, for example, the first electronic device 41 is a mobile phone, and the second electronic device 42 is a large-screen device, and the mobile phone and the large-screen device pass " One-touch connection" to establish a trusted connection.
  • both the first electronic device 41 and the second electronic device 42 may include the voice service system corresponding to FIG. 1 , or some modules in the voice service system.
  • the first electronic device 41 and the second electronic device 42 are generally rich devices.
  • a rich device refers to a device with abundant resources
  • the resource-rich device may refer to an electronic device with sufficient storage space, and/or an electronic device with sufficient processing performance.
  • an electronic device with sufficient processing performance, sufficient memory, and sufficient storage space can be called a rich device or a fat device.
  • rich devices may include mobile phones, computers, servers, and tablets.
  • a thin device as opposed to a rich device refers to a device with limited resources, and the device with limited resources may refer to an electronic device with limited storage space, and/or an electronic device with limited processing performance.
  • the device with limited resources may refer to an electronic device with limited storage space, and/or an electronic device with limited processing performance.
  • an electronic device with low processing performance, less memory, and less storage space can be called a thin device.
  • thin devices may include earphones, speakers, and watches.
  • the first electronic device 51 includes a first application program 511, a first speech recognition module 512, a first A semantic understanding module 513, a first dialog management module 514, and a first instruction interaction service 515;
  • the second electronic device 52 includes a second application program 521, a second speech recognition module 522, a second semantic understanding module 523, and a second dialog management module. module 524 and the second instruction interaction service 525 .
  • first electronic device 51 and the second electronic device 52 may include the NLG module and the TTS module, or may not include the NLG module and the TTS module.
  • the first application program 511 may be a voice assistant, or an application program integrated with a voice assistant function.
  • the second application program 521 may be a voice assistant, or an application program integrated with a voice assistant function.
  • the first instruction interaction service 515 and the second instruction interaction service 525 are used for instruction interaction between electronic devices. In other embodiments, instruction interaction between devices may also be implemented in other ways.
  • the first application program 511 inputs the user's voice into the first voice recognition module 512, and obtains the text information output by the first voice recognition module 512;
  • the text information output by a voice recognition module 512 is input to the first semantic understanding module 513 to obtain the semantic information of the user voice output by the first semantic understanding module 513, the semantic information includes the intention extracted from the user voice and the slot corresponding to the intention bits, etc.; and then input the semantic information to the first dialogue management module 514 to obtain an execution instruction.
  • the first application program 511 After the first application program 511 obtains the execution instruction output by the first dialog management module 514 , it transmits the execution instruction, the historical intent and the slot corresponding to the historical intent to the second electronic device 52 through the first instruction interaction service 515 .
  • the historical intents may include intents corresponding to the latest N rounds of conversations.
  • the first electronic device 51 may determine whether to send information such as execution instructions, historical intentions, and corresponding slots to the second electronic device 52 according to the user's voice, that is, to determine whether to transfer services, that is, to determine whether to It is necessary to send an instruction to the second electronic device 52, and when sending the instruction, context information such as historical intent and corresponding slot is carried.
  • the first electronic device 51 can extract the target device slot from the user's voice, and the entity in the target device slot is not its own device, it can determine that the service needs to be transferred, and use the entity in the target device slot as the transfer The target device for the business. That is, the first electronic device 51 determines that it needs to send information such as the execution command, historical intent, and corresponding slot to the second electronic device 52 (ie, the target device).
  • the mobile phone 32 can extract the slot of the target device from the user's voice "recommend a song to a large-screen device", and the entity of the target device slot is a large-screen device, then it is determined that a transfer is required business, and the target device for transferring the business is a large-screen device.
  • the first electronic device 51 can extract at least two target device slots from the user's voice, and the entity of the at least two target device slots is not the device, the first electronic device 51 can determine that it needs to send The target devices corresponding to the at least two target device slots send information such as execution instructions, historical intentions, and corresponding slots. At this time, the first electronic device 51 may simultaneously distribute information such as the execution instruction, historical intent, and corresponding slots to the target devices corresponding to the at least two target device slots.
  • the first electronic device 51 When the first electronic device 51 distributes information such as execution instructions, historical intentions, and corresponding slots to at least two second electronic devices 52 (i.e. target devices), it may first determine whether the connection between the at least two second electronic devices 52 Whether a communication connection has been established between them.
  • the first electronic device 51 may further determine whether the communication connection with at least two second electronic devices 52 is the same. Specifically, according to the type of communication connection, the first electronic device 51 correlates information such as execution instructions, historical intentions, and corresponding slots sent to the same second electronic device 52, and then associates the associated execution instructions, historical intentions, etc. Information such as the corresponding slot is sent to the corresponding second electronic device 52 .
  • the first electronic device 51 is a mobile phone
  • the second electronic device 52 includes a large-screen device and a tablet.
  • the user's voice received by the mobile phone is "Recommend a song to large-screen devices and tablets”, and the mobile phone can extract two target device slots from the user's voice.
  • the entities of these two target device slots are "big-screen device” and "tablet”, that is, the second electronic device includes a large-screen device and a tablet.
  • the mobile phone can determine that the business needs to be transferred to the large-screen device and the tablet, that is, it needs to send information such as execution instructions, historical intentions, and corresponding slots to the large-screen device and the tablet respectively.
  • the mobile phone detects that communication connections have been established with both the large-screen device and the tablet, and the type of communication connection with the large-screen device is a Wi-Fi point-to-point connection, and the type of communication connection with the tablet is a Bluetooth connection.
  • the Wi-Fi point-to-point protocol the mobile phone packs information such as execution instructions, historical intentions, and corresponding slots, and then sends the data packets to the large-screen device through a Wi-Fi point-to-point connection, so that the execution instructions, historical intentions, and corresponding slots and other information are sent to the large screen device.
  • the mobile phone packages information such as execution instructions, historical intentions, and corresponding slots, and then sends the data packets to the tablet through the Bluetooth connection.
  • the first electronic device 51 can judge whether it has been paired with the second electronic device 52, and if it has been paired, it can use the second electronic device 52 stored locally. 52, and send a connection establishment request to the second electronic device 52, so as to establish a communication connection with the second electronic device 52.
  • the first electronic device 51 sends information such as execution instructions, historical intentions and corresponding slots to the second electronic device 52 according to the type of the communication connection.
  • the relevant information of the second electronic device 52 may exemplarily include a device identifier, an IP address, and the like.
  • the mobile phone detects that no connection has been established with the large-screen device, but the relevant information of the large-screen device is stored locally, it will send a message to the large-screen device based on the information such as the IP address and device identification of the large-screen device.
  • -Fi peer-to-peer connection request After receiving the request for establishing a Wi-Fi point-to-point connection, the large-screen device may establish a Wi-Fi point-to-point connection with the mobile phone in response to the request.
  • the first electronic device 51 may prompt the user that the corresponding device cannot be found.
  • the mobile phone can prompt the user through a prompt window or a voice prompt that the large-screen device cannot be found, and please establish a connection with the large-screen device.
  • the first electronic device 51 can obtain the relevant information of the second electronic device 52, and can also initiate an application to the second electronic device 52 according to the relevant information of the second electronic device 52.
  • a request to establish a communication connection can be performed even if the first electronic device 51 is not paired with the second electronic device 52, it can obtain the relevant information of the second electronic device 52, and can also initiate an application to the second electronic device 52 according to the relevant information of the second electronic device 52.
  • the first electronic device 51 can extract at least two device slots from the user's voice, and the entity in one of the two device slots is the device, and the entity in the other slots is not the device, then determine Information such as execution instructions, historical intentions, and corresponding slots needs to be sent to the corresponding second electronic device 52 (that is, the device corresponding to other slots). At this time, after the first electronic device 51 obtains the execution instruction, it can execute the first execution instruction on its own device.
  • the user's voice collected by the mobile phone is "play song A and recommend the song A to the large-screen device".
  • the mobile phone can extract two device slots from the user's voice, the entity of one device slot is "this device”, and the entity of the other device slot is "big screen device”.
  • the user voice since the user voice includes the slot of the target device, and the entity of the slot of the target device is not the device, it is determined that information such as execution instructions, historical intentions, and corresponding slots need to be sent to the large-screen device.
  • the mobile phone can obtain the execution instruction of playing song A and the execution instruction of recommending song A according to the user's voice "play song A and recommend the song A to the large-screen device". After obtaining the execution instruction to play the song A, the mobile phone can automatically play the song A. At the same time, the mobile phone will also send information such as the execution instructions, historical intentions and corresponding slots of the recommended song A to the large-screen device.
  • the first electronic device 51 can determine whether to send execution instructions, historical intentions, corresponding slots and other information to one or more second electronic devices 52 according to the user's voice.
  • the first electronic device sends information such as execution instructions, historical intentions, and corresponding slots to at least two second electronic devices through a user voice, so as to connect dialogue services to multiple second electronic devices, which can Improve the convenience of service transfer and improve user experience.
  • user A wants to recommend song A to user B and user C at the same time.
  • User A's device is a mobile phone
  • user B's device is a large-screen device A
  • user C's device is a large-screen device B.
  • the mobile phone, large-screen device A and mobile phone B all belong to the same group (for example, a family group or a friend group).
  • User A inputs the user voice "recommend song A to large-screen device A and large-screen device B" into the mobile phone.
  • the mobile phone can extract two target device slots from the user's voice, and the entities of these two target device slots are "big-screen device A" and "big-screen device B", then it can be determined that the large-screen Device A and large-screen device B send information such as execution instructions, historical intentions, and corresponding slots. Then, the mobile phone sends information such as execution instructions, historical intentions, and corresponding slots corresponding to the user's voice "recommend song A to large-screen device A and large-screen device B" to large-screen device A and large-screen device B respectively.
  • the historical intent includes the recommended music intent extracted from the user voice "recommend song A to large-screen device A and large-screen device B", and the corresponding slot includes the user voice "recommend song A to large-screen device A and large-screen device B".
  • Song A extracts the song name slot, and the entity of the song name slot is song A.
  • Large-screen device A receives information such as execution instructions, historical intentions, and corresponding slots from the mobile phone, and then recommends song A to user B. Similarly, large-screen device B recommends song A to user C. In this way, compared to the user recommending songs to two users through two voice commands, user A can recommend songs to large-screen device A and large-screen device B at the same time through a single voice command, which is more convenient and user experience better.
  • the first electronic device 51 may determine that it is not necessary to send information such as execution instructions, historical intentions, and corresponding slots to the second electronic device. At this time, after obtaining the execution instruction output by the first dialog management module 514, the first electronic device 51 executes the execution instruction to obtain a corresponding execution result.
  • the mobile phone 21 if the mobile phone 21 cannot extract the slot of the target device from the user's voice "Xiaoyi Xiaoyi, recommend a song", then it is determined that there is no need to send execution instructions, historical intentions and information to the second electronic device 52. Corresponding slot and other information. Therefore, after obtaining the execution instruction to play the song, the mobile phone 21 automatically plays the corresponding song in response to the execution instruction.
  • the first electronic device 51 determines that it is necessary to send information such as execution instructions, historical intentions, and corresponding slots to the second electronic device 52
  • the first electronic device 51 in addition to transmitting the execution instructions to the second electronic device 52 (ie, the target device)
  • context information such as historical intentions and corresponding slots is also transmitted to the second electronic device 52 .
  • the first electronic device 51 may transmit the intentions of the latest N rounds of dialogue and the corresponding slots to the second electronic device 52 .
  • the latest N rounds of dialogue may refer to the N rounds of dialogue whose occurrence time is closest to the current time point, that is, the latest N rounds of dialogue.
  • part of the dialogue between the user and the voice assistant Xiaoyi on the mobile phone can be shown in Table 1 below.
  • the target device slot is extracted from the user's voice, and the entity of the target device slot is "Large-screen device", it is determined that information such as an execution instruction and the intent of the latest N rounds of dialogue needs to be sent to the large-screen device.
  • the mobile phone transmits the execution instruction of recommending song B and the intention information of "recommending song B to the large-screen device” (ie recommended music) to the large-screen device. Furthermore, the song title slot extracted from the user's voice can also be transmitted to the large-screen device.
  • the mobile phone transmits the intentions of the two rounds of dialogue and the execution instructions of the recommended song B to the large-screen device.
  • the historical intent transmitted by the mobile phone to the large-screen device may include playing music and recommending music.
  • the mobile phone can transmit the intention of these three rounds of dialogue and the execution instructions of the recommended song B to the large-screen device.
  • the historical intent transmitted by the mobile phone to the large-screen device includes playing music and recommending music.
  • the mobile phone can determine the N rounds of conversations closest to the current time point according to the time of the conversation, and transmit the intention of each round of conversations in the N rounds of conversations and the execution instructions of the recommended song B to Large screen devices.
  • the first electronic device 51 may transmit all historical dialog intents and corresponding slots to the second electronic device 52 .
  • N the number of rounds of historical dialogue is only 2
  • the intents of the two rounds of dialogue are transmitted to the peer device.
  • the first electronic device 51 may also transmit to the second electronic device 52 all dialogue intents whose dialogue occurrence time is after the target time point.
  • the target time point may be a difference between the current time point and a preset time threshold.
  • the preset time threshold can be set as required, for example, the preset time threshold is 24 hours, 12 hours, 6 hours, or 1 hour.
  • the preset time threshold is 24 hours, and the current time point is 20:30 on June 7, 2021. According to the preset time threshold and the current time point, it can be determined that the target time point is 20:30 on June 6, 2021.
  • the mobile phone will transmit the execution command to play song B and the intent of all the conversations in Table 1 to the large-screen device.
  • the target time point is 19:30 on June 7, 2021.
  • the conversations that occurred after 19:30 on June 7, 2021 include "recommend song B to large-screen devices", "play song B” and "Xiaoyi Xiaoyi, play a song A”.
  • the mobile phone can transmit the execution instruction of playing song B and the intentions of the three rounds of dialogue to the large-screen device.
  • the first electronic device 51 may transmit all historical conversations to the large-screen device. However, if there are many historical conversation rounds, large bandwidth may be required for transmission, which increases transmission delay and affects user experience.
  • the first electronic device 51 transmits information such as the intent of the N rounds of dialogue closest to the current time point to the second electronic device 52 according to the time when the dialogue occurred, which not only enables the second electronic device 52 to Intent and other information can accurately identify the intention of the newly collected user voice, and can also make the transmission delay in a reasonable area, and the user experience is high.
  • the first electronic device 51 may transmit information such as the intent of all historical conversations to the second electronic device 52 .
  • the first electronic device 51 may also transmit the intent of the latest round of dialogue and the associated intent to the second electronic device 51 .
  • the associated intent refers to the intent associated with the intent of the latest round of dialogue.
  • the latest round of dialogue is "recommend a song to a device with a large screen", and its intent is to recommend music.
  • Intents associated with recommending music include playing music, searching for music, and the like.
  • relevant information corresponding to the association intention may also be transmitted to the second electronic device 51 .
  • the associated intent is to play music
  • the relevant information corresponding to the associated intent may include song titles and singer information corresponding to the intent of playing music.
  • the first electronic device 51 in addition to transmitting the execution instruction to the second electronic device 52 , the first electronic device 51 also transmits information describing the voice intention of the user to the second electronic device 52 .
  • the above information used to describe the voice intention of the user may specifically be an intention.
  • the first electronic device 51 extracts the intention from the user's voice, and transmits the execution instruction and the intention to the second electronic device 52 together.
  • the first electronic device 51 may transmit the intent of the latest N rounds of conversations to the second electronic device 52 .
  • the second semantic understanding module 523 can recognize the user's voice according to the intention sent by the first electronic device 51 and the text information of the user's voice output by the second voice recognition module. the purpose.
  • the information used to describe the voice intention of the user may be corpus.
  • the first electronic device 51 does not transmit the intent extracted from the user's voice to the second electronic device 52 , but transmits the dialogue material to the second electronic device 52 .
  • the dialogue material refers to the text of the user's voice.
  • the voice recognition module converts the user's voice into text to obtain dialogue material.
  • the first electronic device 51 may transmit the dialogue material of the latest N rounds of dialogue to the second electronic device 52 .
  • the dialog data transmitted from the first electronic device 51 to the second electronic device 52 includes: the text of "recommend song B to the large-screen device", the text of "play song B” text and the text of "Xiao Yi Xiao Yi, play a song A”.
  • the second semantic understanding module 523 can recognize the user's voice according to the corpus sent by the first electronic device 51 and the text information of the user's voice output by the second voice recognition module. the purpose.
  • the dialogue material sent by the first electronic device 51 to the second electronic device 52 includes “recommend a song to the large-screen device", and the user voice newly collected by the second electronic device 52 is "change another".
  • the second semantic understanding module 523 outputs the intention of the user's voice "change another" according to the input dialogue material "recommend a song to the large-screen device” and the text of the user's voice "change one".
  • the intention of the user's voice "change one" is to recommend music, and this intention includes changing the slot of the playlist.
  • the second semantic understanding module 523 of the second electronic device 52 may first extract the intent from the corpus sent by the first electronic device 51, and then recognize the The intent of the user's voice.
  • the dialogue material sent by the first electronic device 51 to the second electronic device 52 is "recommend a song to the large-screen device", and the user voice newly collected by the second electronic device 52 is "change another".
  • the second semantic understanding module 523 may first extract the intention of "recommend music” from the dialogue data of "recommend a song to a large-screen device”. Then the second semantic understanding module 523 outputs the intention of the user's voice "change one" according to the input intention "recommend music” and the text "change one" of the user's voice.
  • the above information used to describe the user's voice intention may also include intention and dialogue material at the same time, that is, when the first electronic device 51 transmits the execution instruction, it transmits the dialogue intention and dialogue material of the latest N rounds of dialogue to the second electronic device. 52.
  • the second semantic understanding module 523 of the second electronic device 52 may select the intent or dialogue material sent by the first electronic device 51 to perform intent recognition as required.
  • the target information may also be transmitted to the second electronic device 52 together.
  • the target information can be set as needed.
  • the target information may include slots.
  • the slots may include intended corresponding slots.
  • the first electronic device 51 may extract the intention and the slot corresponding to the intention from the voice of the user.
  • the intent "recommend music” and the song name slot can be extracted from the user's voice "recommend song A to the large-screen device", and the entity of the song name slot is song A.
  • the slot may also include the slot of the target device. That is, the first electronic device 51 may not only transmit the intended corresponding slot to the second electronic device 52 , but also transmit the target device slot to the second electronic device 52 .
  • the first electronic device 51 may not only transmit the intended corresponding slot to the second electronic device 52 , but also transmit the target device slot to the second electronic device 52 .
  • the user's voice "recommend song A to the large-screen device”
  • it includes the slot corresponding to "recommended music” as the song name slot, and also includes the target device slot.
  • the song name slot and the target device slot and transmitted to the second electronic device 52 together.
  • the first electronic device 51 can extract at least two target device slots from the user voice, when transmitting the target device slots, the first electronic device 51 can transmit the corresponding target device slots to the corresponding The second electronic device 52 .
  • the mobile phone extracts two target device slots from the user's voice "recommend a song to a large-screen device and a tablet".
  • the slot is transferred to the large screen device.
  • the second electronic device 52 can directly extract the intended corresponding slot and/or target device slot from the dialogue material, so that the first The electronic device 51 may not transmit the slot to the second electronic device 52 .
  • the target information may include user information such as user profile information and real-time location information of the user.
  • the user profile information exemplarily includes information such as user gender, user preference information, and user occupation.
  • the first electronic device 51 may generate a user portrait according to the collected user information.
  • the second electronic device 52 can determine personal information such as the user's preference and occupation, and provide the user with more personalized services based on the personal information. For example, if the first electronic device 51 and the second electronic device 52 have the same user account, the second electronic device 52 can recommend songs that match the user's preferences and occupation to the user based on the user information transmitted by the first electronic device 51 .
  • the target information may also include scene information, which is used to describe the scene where the user is currently located.
  • the second electronic device 52 can know the current scene of the user through the information of the scene where the user is located. For example, the current location, the current scene, etc.
  • the second electronic device 52 can provide the user with more personalized and accurate services according to the scene information, so as to realize a more open cross-device conversation service continuation.
  • the first electronic device 51 determines that the user is in a walking state through the acceleration information collected by its integrated acceleration sensor, and transmits information indicating that the user is currently in a walking state to the second electronic device 52, and the second electronic device 52 can obtain It is known that the user is currently walking, that is, it is determined that the user is in a sports scene; then, when recommending songs to the user, songs in the sports scene can be recommended to the user.
  • the object information may also include application state information.
  • Application state information refers to information related to a target application, and the target application is usually an application running in the foreground. For example, the phone is playing music and the target application is the music application. Of course, the target application may not be the application running in the foreground.
  • the relevant information of the target application may be set according to the actual application scenario.
  • the target application is a music application program
  • the relevant information of the target application may include user play records, and the user play records include information such as song names and play time.
  • the first electronic device 51 transmits the song title and other information to the second electronic device 52, and the second electronic device 52 can recognize the song title in the user's voice according to the song title and other information after collecting the user's voice.
  • the target application includes a calendar application and a navigation application, and the relevant information of the target application may include user schedule information and user navigation history information.
  • the first electronic device 51 transmits information such as the user's schedule and navigation history to the second electronic device 52. After collecting the user's voice, the second electronic device 52 can recognize the location information in the user's voice based on the information.
  • the second electronic device 52 can more accurately identify the newly collected user voice according to the application state information, and realize a more open inter-device dialogue service connection.
  • the second electronic device 52 can also provide users with more personalized and precise services according to the application status information.
  • the application state information includes historical playing information of songs, and the second electronic device 52 may recommend songs that are more in line with user preferences to the user according to the historical playing information of songs.
  • the application state information transmitted by the first electronic device 51 to the second electronic device 52 may be information associated with the intent of the latest round of dialogue.
  • the intention of the latest round of dialogue is to recommend music
  • at least two music application programs are installed on the mobile phone
  • the application state information associated with the recommended music includes relevant information of the at least two music application programs.
  • the at least two music application programs may or may not be running at the same time.
  • the relevant information of the target application may also include the identification of the foreground application or the running application.
  • the second electronic device 52 judges whether there is the same application program locally according to the application program identifier, and if yes, uses the same application program to execute the first execution instruction; if not, uses a similar application program to execute the execution instruction.
  • both the mobile phone and the large-screen device include multiple music application programs for playing music.
  • the music application programs installed on the mobile phone include Huawei Music, application 1 and application 2.
  • the phone was playing a song using Huawei Music.
  • the mobile phone collects the user's voice "recommend a song to the large-screen device", then generates the execution instruction of the recommended music corresponding to the user's voice, and transmits the execution instruction of the recommended music, historical intentions, corresponding slots and application status information to the large-screen device. screen device.
  • the application status information includes the application identification of Huawei Music.
  • Application status information includes information about the three music applications of Huawei Music, App 1, and App 2, such as playback records, song titles, and user favorite songs.
  • the large-screen device After the large-screen device receives the application status information, it first determines whether Huawei Music is installed locally according to the application identification of Huawei Music. If there is, the voice assistant of the large-screen device will use the local Huawei music to play the corresponding song in response to the execution instruction of the recommended music. If Huawei Music is not installed locally, the large-screen device can use other music applications to play corresponding songs.
  • the first electronic device 51 transmits the application state information to the second electronic device 52, and the second electronic device 52 can preferentially execute the first execution instruction on the same application program as the target application according to the application state information, so that the business flow More natural, less obtrusive, better user experience.
  • target information may include any one or any combination of slots, user information, scene information, and application state information.
  • the first electronic device 51 when the first electronic device 51 transfers the service to the second electronic device 52, in order to allow the second electronic device 52 to accurately recognize the user's voice and realize the cross-device dialogue service connection, it may The execution instruction and the information describing the user's voice intention are transmitted to the second electronic device 52 .
  • the first electronic device 51 in order to make the second electronic device 52 have a higher accuracy of intent recognition and realize a more open cross-device dialogue service connection, the first electronic device 51 can use the execution instructions to describe the user voice
  • the intention information and the target information are transmitted to the second electronic device 52 together.
  • FIG. 5 exemplarily shows that the information sent by the first electronic device 51 to the second electronic device 52 includes execution instructions, historical intentions, corresponding slots, and the like.
  • the second electronic device 52 receives information from the first electronic device 51 through the second command interaction service 525, and the information exemplarily includes information such as execution instructions, historical intentions, and corresponding slots; then, the second electronic device 52 executes the instruction Passed to the second application program 521, the second application program 512 may respond to the execution instruction.
  • the second electronic device 52 also locally stores information such as the received historical intent and the corresponding slot.
  • the second electronic device 52 after the second electronic device 52 receives the execution instruction from the first electronic device 51, it can automatically wake up the voice assistant of the device, so that the user does not need to wake up the voice assistant on the second electronic device through a wake-up word.
  • the user can directly input the corresponding user voice to the second electronic device 52, so that the cross-device dialogue business continuity is smoother and the user experience is higher.
  • the second electronic device 52 may not automatically wake up the voice assistant of the device, but collect a specific wake-up word input by the user and then Wake up the voice assistant of this device.
  • the second semantic understanding module 523 can put the context information sent by the first electronic device 51 into its own , and use the context information sent by the first electronic device 51 as the latest context. In this way, when the second electronic device 52 collects a new user voice, it can recognize the intent of the newly collected user voice according to the latest context information.
  • the context information may include historical intentions, corresponding slots and dialogue materials, may include historical intentions and corresponding slots, and may also include historical intentions and dialogue materials.
  • the second electronic device 52 may also create a new session according to the received context information.
  • the new session includes information such as historical intentions sent by the first electronic device 51 , corresponding slots, and dialogue material.
  • the second semantic understanding module 523 can accurately identify the intent of the newly collected user voice according to the session creation time and using the information contained in the newly created session.
  • the second electronic device 52 may also set the priority of the received context information as the highest priority. In this way, when the second electronic device 52 collects a new user voice, the second semantic understanding module 523 can accurately identify the intent of the newly collected user voice according to the highest priority context information.
  • the second electronic device 52 can first determine whether there is an ongoing task, if there is no ongoing task, the second electronic device 52 can execute The execution command sent by the first electronic device 51; if there is currently an ongoing task, you can wait for the execution of the current task to be completed before executing the execution command sent by the first electronic device 51, and you can further judge the remaining time of the current task. If the time is less than a certain threshold, the execution instruction sent by the first electronic device 51 can be executed after the execution of the current task is completed; otherwise, the current task can be interrupted and the execution instruction sent by the first electronic device 51 can be executed. In this way, the dialog service connection can be made more timely and the user experience is better.
  • the dialogue service flows from the first electronic device 51 to the second electronic device 52 .
  • the second electronic device 52 collects the user's voice, it can accurately recognize the intention of the user's voice according to information such as historical intentions and slots transmitted by the first electronic device 51 .
  • the second application program 521 inputs the user's voice to the second voice recognition module 522 to obtain the text information output by the second voice recognition module 522; then, the second voice recognition module 522 The output text information is input to the second semantic understanding module 523, and the second semantic understanding module 523 extracts the user's voice according to the text information output by the second speech recognition module 522, historical intentions and slots transmitted by the first electronic device 51. Intent and slot; finally, input the semantic information output by the second semantic understanding module 523 to the second dialog management module 524 to obtain the execution instruction.
  • the second application program 521 responds to the execution instruction and obtains an execution result corresponding to the execution instruction.
  • the second dialogue management module 52 may select required information to generate corresponding execution instructions as required.
  • the second dialog management module 524 outputs an execution instruction for recommending songs.
  • the input of the second dialogue management module 524 may include semantic information and the like.
  • the music application program After the music application program receives the execution instruction of the recommended song, it may determine that the recommended song is song A according to the user information and application status information from the first electronic device 51 .
  • the execution command output by the second dialogue management module 524 does not include the information of the recommended song, but is determined by the music application program to recommend the song.
  • the second dialogue management module 524 may output an execution instruction of recommending song A.
  • the input of the second dialog management module 524 may include semantic information and the name of the recommended song.
  • the title of the recommended song may be determined by the system according to user information and application status information from the first electronic device 51 .
  • the music application After receiving the execution instruction of recommending song A, the music application automatically recommends song A without performing a recommendation operation.
  • the execution instruction output by the second dialog management module 524 includes information about recommended songs.
  • the execution command output by the first dialogue management module 514 on the side of the first electronic device 51 may include information about recommended songs, or may not include information about recommended songs.
  • the second electronic device 52 can provide users with more personalized and precise services according to the target information, and realize a more open cross-device dialogue business connection , improve user experience.
  • the second electronic device 52 recommends songs that are more suitable for the user's identity and current location to the user according to the user's occupation information and user's real-time location information in the user information. Specifically, the second electronic device 52 determines that the user's occupation is a classroom according to the user's occupation information; and determines that the user's current location is a school according to the user's real-time location information. Based on the user's occupation and the user's current location, the second electronic device 52 recommends children's songs to the user. At this time, the user accounts of the first electronic device 51 and the second electronic device 52 are not the same user.
  • the second electronic device 52 determines that the user's current location is at home according to the real-time location information of the user. Based on the current location of the user, the second electronic device 52 recommends songs that match the user's preferences. At this time, the user accounts of the first electronic device 51 and the second electronic device 52 are the same user. It can be seen from the above that when the business needs to be transferred from the first electronic device 51 to the second electronic device 52, the first electronic device 51 transmits information such as execution instructions, historical intentions and slots to the second electronic device 52, so that the second electronic device 52 The second electronic device 52 can recognize the intention of the user's voice according to the information such as the historical intention and the slot position transmitted by the first electronic device 51, and realize the inter-device dialogue service connection.
  • Fig. 6A and Fig. 6B are schematic diagrams of scenarios where a mobile phone recommends music to a large-screen device according to an embodiment of the present application
  • Fig. 7 is a schematic flow diagram of a mobile phone recommending music to a large-screen device according to an embodiment of the present application.
  • the first electronic device is a mobile phone 62
  • the second electronic device is a large-screen device 64 .
  • Voice assistants are installed on the mobile phone 62 and the large-screen device 64, and the voice service system corresponding to FIG. 1 is deployed.
  • the process may include the following steps:
  • Step S701 the mobile phone 62 collects the voice of the first user of the user 61 .
  • the first user's voice is specifically "recommend a song to the large-screen device".
  • the user 61 wakes up the voice assistant of the mobile phone 62 through a specific wake-up word, and then the user 61 tells the voice assistant of the mobile phone 62 "recommend a song to a large-screen device", and the mobile phone 62 uses a sound collection device such as a microphone, The collected voice data "recommend a song to the large-screen device”.
  • Step S702 the mobile phone 62 converts the voice of the first user into a first text.
  • the voice service system corresponding to FIG. 1 or some modules of the voice service system are deployed in the mobile phone 62 .
  • the mobile phone 62 includes a first speech recognition module 512 , a first semantic understanding module 513 and a first dialogue management module 514 .
  • the voice assistant of the mobile phone 62 inputs the first user's voice into the first voice recognition module 512, and the first voice recognition module 512 sends the first user's voice The voice is converted into the first text "recommend a song to a large-screen device”.
  • Step S703 the mobile phone 62 extracts the first intention and the first slot from the first text.
  • the first slot is a slot configured by the first intent.
  • the first speech recognition module 512 obtains the first text "recommend a song to a large-screen device", it inputs the first text "recommend a song to a large-screen device” into the first semantic understanding module 513 .
  • the first semantic understanding module 513 performs semantic understanding on the first text, and outputs the first intent and the first slot.
  • the first intention is to recommend music
  • the first slot includes the target device slot.
  • the entity of the target device slot is the large-screen device 64 .
  • the first slot can also include other slots, for example, if the user voice contains the title of the song, then the first slot includes the slot of the song title.
  • the mobile phone 62 may display a voice assistant interface 63 after recognizing the intent of the first user's voice.
  • the voice assistant interface 63 displays the first text "Recommend a song to the large-screen device” and the answer text "OK" for the first user's voice.
  • the mobile phone 62 determines that it needs to send information such as execution instructions, historical intentions and corresponding slots to the large-screen device 64.
  • Step S704 the mobile phone 62 generates a first execution instruction corresponding to the voice of the first user according to the first intention and the first slot.
  • the first semantic understanding module 513 inputs information such as the first intent and the first slot to the first dialog management module 514 .
  • the first dialog management module 514 outputs a first execution instruction according to information such as the first intention and the first slot.
  • the first execution instruction is an instruction for recommending music.
  • Step S705 the mobile phone 62 sends information such as the first execution command, historical intentions, and corresponding slots to the large-screen device 64 .
  • the historical intent includes the first intent
  • the corresponding slot refers to a slot corresponding to the historical intent, which includes the first slot.
  • the first slot may include a target device slot, or may not include a target device slot.
  • the historical intentions include intentions corresponding to other historical dialogues.
  • historical intents include the intents of the last 3 rounds of dialogue, and each round of dialogue has its corresponding intent. At this time, only one round of dialogue has been carried out between the mobile phone 62 and the user 61, so the historical intentions may only include the first intention.
  • the corresponding slot includes the first slot corresponding to the first intent.
  • the mobile phone 62 can also transmit dialogue data, user information, scene information and application status information to the large-screen device 64 .
  • the information transmitted by the mobile phone 62 to the large-screen device 64 may be as shown in Table 2.
  • the information transmitted by the mobile phone 62 to the large-screen device 63 may be as follows:
  • nluResult refers to the intent recognition result, specifically the intent recognition result on the side of the mobile phone 62 at this time, and the intent recognition result includes the intent and the slot.
  • intentNumber refers to the serial number of the intent
  • intentName refers to the name of the intent. In this case, the name of the intent is recommended music.
  • slots refers to slots. At this time, the name of the slot is the device name, and the specific parameter of the slot is device B, which is specifically the large-screen device 64 at this time.
  • orgAsrText refers to the text information output by the speech recognition module, specifically the speech recognition result on the mobile phone 62 side, and the text information is specifically "recommend a song to the large-screen device".
  • the information sent by the mobile phone 62 to the large-screen device 64 may include the first execution instruction, historical intentions, slots corresponding to historical intentions, dialogue material information, user information, scene information, and application status information.
  • the dialogue corpus includes the corpus "recommend a song to a large-screen device".
  • the application running on the foreground of the mobile phone 62 at this time is a music application, and the application state information may include information such as user playback records, user information may include information such as user portraits and user real-time locations, and scene information may include information representing the user's walking .
  • Step S706 the large-screen device 64 executes the first execution instruction.
  • the large-screen device 64 transmits the first execution instruction to the voice assistant of the large-screen device 64, and the voice assistant of the large-screen device 64 obtains a corresponding execution result.
  • the large-screen device 62 executes the first execution instruction, and displays a window 65 on the interface.
  • the window 65 displays prompt information for prompting the user whether to play the song A recommended by the mobile phone.
  • two option buttons of "play” and "cancel” are also displayed on the window 65.
  • Step S707 the large-screen device 64 collects the voice of the second user of the user 61 .
  • the user 61 inputs a second user voice into the large-screen device 64 .
  • the voice of the second user is specifically "change another".
  • the large-screen device 64 can automatically wake up the voice assistant of the device after receiving the first execution instruction, so that the user can directly input the second user voice to the large-screen device without waking up the large-screen device 64 through a specific wake-up word. voice assistant.
  • Step S708 the large-screen device 64 converts the voice of the second user into a second text.
  • the voice service system corresponding to FIG. 1 or some modules of the voice service system are deployed in the large-screen device 64 .
  • the large-screen device 64 includes a second voice recognition module 522 , a second semantic understanding module 523 and a second dialogue management module 524 .
  • the voice assistant of the large-screen device 64 inputs the second user's voice to the second voice recognition module 522, and the second voice recognition module 522 converts the second user's voice into the second text "for one".
  • Step S709 the large-screen device 64 extracts the second intent and the second slot from the second text according to information such as the historical intent and the corresponding slot.
  • the second speech recognition module 522 obtains the second text "change one"
  • the second semantic understanding module 523 performs semantic recognition according to information such as the second text "change another" and the historical intention "recommended music”, and obtains the second intention and the second slot.
  • the second intention is to recommend music
  • the second slot may include a device slot and a single slot for changing songs.
  • the entity of the device slot is the large-screen device 64 .
  • the target device is not included in "replacement”
  • the second slot may not include a device slot, and in this case, the second execution instruction is executed on the device by default.
  • the large-screen device 64 can recognize the intention of "recommended music” as “recommended music” according to the historical intention "recommended music” from the mobile phone 62, that is, the intention of "replaced” inherits the historical intention "recommended music”.
  • the large-screen device 64 may not recognize the intention of the voice of the second user based on information such as historical intentions from the mobile phone.
  • the second user's voice is "play song A”
  • the intention of "play music” is specified in the user's voice
  • the large-screen device 64 can recognize "play song A” without “recommending music” based on the historical intention from the mobile phone 62.
  • A" means "play music”.
  • the large-screen device 64 can send an interactive voice " I can't understand what you mean, please give me more time to learn your habits.”
  • the large-screen device 64 when the large-screen device 64 performs semantic understanding on the second text, it can select corresponding information according to needs. Slot for semantic understanding of the second text.
  • the mobile phone 62 also transmits one or more of user information, application state information, dialogue material and scene information to the large-screen device 64 .
  • the large-screen device 64 can also perform semantic understanding based on these information. For example, the large-screen device 64 can perform semantic understanding on the second text according to historical intentions, dialogue data, corresponding slots, and song title information in application state information.
  • Step S710 the large-screen device 64 generates a second execution instruction corresponding to the second user's voice according to the second intention and the second slot.
  • the second semantic understanding module 523 inputs the second intent and the second slot to the second dialogue management module 524 .
  • the second dialog management module 524 generates a second execution instruction according to the second intent and the second slot.
  • the second execution instruction is an instruction to recommend music.
  • Step S711 the large-screen device 64 executes the second execution instruction.
  • the voice assistant of the large-screen device 64 acquires the second execution instruction output by the second dialog management module 524, it responds to the second execution instruction to obtain a corresponding execution result.
  • the large-screen device 64 executes the second execution instruction, and displays a window 66 in which prompt information is displayed, which is used to ask the user whether to play song B or not.
  • Song B is a recommended song determined by the large-screen device 64 according to the recommendation rules.
  • the large-screen device may also prompt the user that song B is about to be played by voice before playing song B.
  • the voice prompt information is "OK, song B will be played for you soon".
  • the large-screen device determines that song B needs to be played, it can also directly play song B without prompting the user.
  • the embodiment of the present application provides prompt information (for example, prompt window, prompt voice, etc.) for executing the second execution instruction, which can make the cross-device dialogue service connection not abrupt, more humane, and better user experience .
  • prompt information for example, prompt window, prompt voice, etc.
  • the information transmitted by the mobile phone 62 to the large-screen device 64 may include other information in addition to historical intentions and corresponding slots. Based on this information, the large-screen device 64 can more accurately recognize the intention of the user's voice, Achieving a more open cross-device dialogue continuation can also provide more personalized and precise services according to users to improve user experience.
  • the user 61 does not like the song B recommended by the large-screen device 64, and then inputs the third user's voice "change xxxx" to the large-screen device 64 again, where "xxxx" is the title of the song.
  • the large-screen device 64 receives the voice of the third user, it can recognize that "xxxx” is the title of the song according to the historical playback records of the music application sent by the mobile phone 62, and then accurately recognize that the voice of the third user is intended to recommend music.
  • Historical playback records include information such as song titles.
  • the song that the user 61 wants to play may be a newly released song, and there is no information about the song on the large-screen device 64 yet. If the historical playback record of the music application program on the mobile phone 62 is not passed to the large-screen device 64, the large-screen device 64 may not recognize that "xxxx" is the title of the song, and then cannot recognize the intention of the third user's voice.
  • the second electronic device may provide services to the user according to the user information and application status information transmitted by the first electronic device.
  • the mobile phone 62 and the large-screen device 64 log in to the same Huawei user account.
  • the information transmitted from the mobile phone 62 to the large-screen device 64 includes the first execution instruction, historical intent, corresponding slot, user information, scene information and application status information.
  • the user information includes user portrait information and user real-time location information.
  • the application status information includes information about the music application on the mobile phone 62 .
  • the scene information includes information characterizing the walking of the user.
  • the large-screen device 64 After the large-screen device 64 receives the user's voice "change one", according to the information such as the historical intention sent by the mobile phone 62, it recognizes the intention of the user's voice “change one" as "recommended music”, and based on the intention of "recommended music", the user information, scene information, application state information, etc., to generate a second execution instruction for recommending music.
  • the large-screen device 64 determines that the recommended song is song B according to the user information and application status information sent by the mobile phone 62, and generates a second execution instruction for recommending song B.
  • the large-screen device 64 can know that the user of the mobile phone 62 and the large-screen device 64 is a teacher, and the song type preferred by the user is pop music; according to the real-time location information in the user information , it can be known that the current location of the user is home; in addition, according to the scene information in the user information, it can be determined that the user is currently walking, that is, the scene where the user is currently in is a sports scene. At this time, since the user is in a sports scene, songs of the sports scene are recommended to the user. And, since the user is at home, songs that meet the user's preferences are recommended to the user.
  • the large-screen device 64 determines a set of candidate songs that have been played by the user more than a preset threshold in the last 7 days according to information such as the user's historical playback records in the application status information. Finally, the large-screen device 64 selects one or more popular songs in sports scenes from the candidate song collection as recommended songs. At this point, song B is determined as the recommended song.
  • the first electronic device may not send user information and scene information to the second electronic device.
  • the first electronic device distributes information such as the first execution instruction, historical intent, and corresponding slot to the second electronic device, if it determines that the second electronic device is not the electronic device of the same user as the first electronic device, it does not send the information to the second electronic device.
  • personal information such as recorded user information and scene information.
  • the second electronic device may provide services to the user according to the user information and scene information recorded by the device.
  • the mobile phone 62 and the large-screen device 64 do not log in with the same Huawei user account.
  • the account is the account of user B.
  • the information transmitted from the mobile phone 62 to the large-screen device 64 includes the first execution instruction, historical intentions, and corresponding slots.
  • the large-screen device 64 receives the user's voice "change one", according to the historical intentions sent by the mobile phone 62, it recognizes the intention of the user's voice "change one" as "recommended music”, and based on the intention of "recommended music", this
  • the application status information and user information of the device, etc. generate a second execution instruction for recommending music.
  • the large-screen device 64 determines that the recommended song is song E according to the user information of the device and the scene information of the device in response to the second execution instruction for recommending the song E.
  • the large-screen device 64 determines that the current location is a school according to the positioning information in the scene information of the device. And, according to the user information of the device, it is determined that the user of the large-screen device 64 is a teacher or a school. Based on this, when recommending songs, the large-screen device 64 recommends songs that are more in line with the student's identity and student preferences, for example, recommending children's songs.
  • the first electronic device 51 determines that the user account logged in by the second electronic device is not the same user as the device, it may also transmit the application state information and user information of the device to the second electronic device.
  • the second electronic device 52 provides a service to the user, it may only use part of the application state information or user information from the first electronic device 51 as the basis information for providing the service.
  • the mobile phone 62 transmits the application information and user information to the large-screen device 64 together, so that the large-screen device 64 can recommend the best song to the user based on the application information and user information sent by the mobile phone 62 (that is, in the Recommend song B when at home, and recommend song E) when at school, so as to provide users with more personalized and precise services and improve user experience.
  • the mobile phone 62 synchronizes information such as historical intentions and corresponding slots to the large-screen device 64, and the large-screen device 64 accurately recognizes the intention of the newly collected user voice according to the information synchronized by the mobile phone 62, realizing cross-device dialogue services continue.
  • the mobile phone 62 can conduct one or more rounds of conversations with the user 61 to collect the required information.
  • the mobile phone 62 can conduct one or more rounds of conversations with the user 61 to collect the required information.
  • there are two large-screen devices one is the large-screen device in the living room, and the other is the large-screen device in the bedroom.
  • the mobile phone 62 collects the first user's voice "recommend a song to the large-screen device"
  • the mobile phone 62 is not sure which large-screen device the user is talking about, so it can output the voice "recommended to the living room”
  • the big screen device in the bedroom, or the big screen device in the bedroom The user 61 may input a corresponding voice to the mobile phone 62 in response to the output voice of the mobile phone 62 .
  • the user 61 inputs the voice "large-screen device in the living room” to the mobile phone 62 in response to the voice output by the mobile phone 62 "is it recommended for the large-screen device in the living room or the large-screen device in the bedroom?"
  • the mobile phone 62 has made it clear that the user wants to recommend songs to the large-screen device in the living room.
  • the mobile phone 62 can also display text prompt information on the interface.
  • the text prompt information includes two options of "large-screen device in the living room” and “large-screen device in the bedroom", and the user 61 can choose one of the options as required.
  • the cross-device dialogue service connection solution provided by the embodiment of the present application can be applied to other scenarios besides the music recommendation scenario shown above.
  • FIG. 8A to FIG. 8C schematic diagrams of navigation scenarios provided by the embodiment of the present application are shown.
  • the user performs navigation through the mobile phone 81 and plays music through the vehicle device 82 .
  • the mobile phone 81 displays a navigation page 811
  • the vehicle 82 displays a music playing interface 821 .
  • the first electronic device is specifically the mobile phone 81
  • the second electronic device is the car machine 82.
  • Both the mobile phone 81 and the car machine 82 are deployed with the voice service system corresponding to FIG. 1 and are equipped with a voice assistant.
  • the user can wake up the voice assistant of the mobile phone 81 through the wake-up word "Xiaoyi Xiaoyi", and then input the user's voice "navigate to location A" into the mobile phone 81 .
  • the mobile phone 81 determines that the intention of the user's voice is navigation, and the entity of the target location slot is location A, and generates a corresponding execution instruction; the mobile phone 81 responds to the execution instruction, opens the navigation application, and obtains Create a route from the user's current location to point A, and display the navigation interface 811 .
  • the user wants to connect the navigation task from the mobile phone 81 to the car machine 82, that is, use the car machine 82 for navigation.
  • the user can input the user's voice into the mobile phone 81 "transfer the current navigation task flow to the car machine".
  • the mobile phone 81 determines that the intention of the user's voice "connecting the current navigation task to the vehicle” is to continue the navigation task, and determines that the entity of the target device slot is the vehicle, and generates a corresponding execution command.
  • the target device slot can be extracted from the user's voice "connect the current navigation task to the car device", and the entity of the target device slot is the car device, it can be confirmed that the service needs to be transferred.
  • the mobile phone 81 transmits the execution instruction corresponding to the user's voice "connect the current navigation task to the vehicle", historical intentions and corresponding slots and other information to the vehicle 82. Intent, and the intention of the user's voice "connect the current navigation task to the vehicle".
  • the corresponding slots of historical intentions include the slots extracted from the user voice "navigate to location A” and the slots extracted from the user voice "connect the current navigation task to the vehicle”.
  • the execution instruction includes navigation route information.
  • the mobile phone 81 can also transmit the dialogue material, as well as the navigation application status information and the schedule application status information to the car machine 82.
  • the dialogue material includes the corpus "navigate to location A" and the corpus "continue the current navigation task to the car".
  • the navigation application status information includes the user's historical navigation target location and historical navigation route, etc.
  • the schedule application status information includes the user's schedule item information.
  • the car machine 82 After the car machine 82 receives information such as execution instructions, historical intentions, and corresponding slots from the mobile phone 81, in response to the execution instructions, it displays a navigation interface 822 as shown in FIG. 8B , and stores information such as historical intentions and corresponding slots. Locally, information such as historical intentions, corresponding slots, and dialogue materials are used as the latest context of the local semantic understanding module.
  • the mobile phone 81 After the mobile phone 81 has connected the navigation task to the vehicle 82, it can exit the navigation interface and display the main interface 812 as shown in FIG. 8B , or it can also be in the off-screen state.
  • the user can input the user's voice "re-planning" to the car machine 82 .
  • the car machine 82 collects the "re-planning" of the user's voice, it first converts the voice into text through the voice recognition module, and then inputs the text into the semantic understanding module.
  • the semantic understanding module determines the intention of the user's voice "replanning” as Plan the navigation route from the current location to location A, the entity in the starting slot is the current location, and the entity in the ending slot is location A.
  • the semantic understanding module inputs information such as the identified intent and slot to the dialog management module, and the dialog management module outputs execution instructions according to the information such as the intent and slot.
  • the car machine 82 re-plans a navigation route from the current location to the location A in response to the execution command of the user voice "re-planning", and displays the navigation interface 823 shown in 8C.
  • the user can control the mobile phone 81 to transfer the navigation task to the car-machine 82 through voice, and when transferring the business, the information such as the intent, corpus, and slot of the recent rounds of conversations , and transmit to the vehicle 82 together.
  • the car-machine 82 can accurately identify the intention of the user's voice according to the intention, corpus, slot and other information of the most recent rounds of dialogue, and realize the cross-device dialogue service connection between the mobile phone 81 and the car-machine 82 .
  • FIG. 8D shows a schematic diagram of a video recommendation scene provided by the embodiment of the present application.
  • the mobile phone 84 displays a video playback interface 841 , and the video playback interface 841 shows that video 1 is currently being played.
  • the user 83 wants to recommend video 1 to the large-screen device 85, that is, he wants to continue the video playback task to the large-screen device 85, and use the large-screen device 85 to play video 1, so he says to the mobile phone 84 "Recommend video 1 to Big screen devices".
  • the mobile phone 84 collects the user's voice "recommend video 1 to large-screen devices", it processes the user's voice “recommend video 1 to large-screen devices” based on the corresponding voice service system in Figure 1, and determines the user voice "recommend video 1 to large-screen devices”. "Screen device” intent and slot, and generate corresponding execution instructions.
  • the intention of the user voice "recommend video 1 to the large-screen device” is to recommend a video
  • the slot of the target device is the large-screen device 85
  • the entity of the video name slot is video 1.
  • the corresponding execution instruction is an instruction to play a video.
  • the mobile phone 84 determines that the service needs to be transferred to the large-screen device 85 .
  • the mobile phone 84 determines that the video playback task needs to be transferred to the large-screen device 85, and then sends the user's voice "recommended video 1 to the large-screen device" to the large-screen device 85. Big screen devices".
  • the large-screen device 85 displays an interface 851, and the interface 851 shows that the large-screen device 85 is playing another video.
  • the large-screen device 85 executes the execution command corresponding to the user's voice "recommend video 1 to the large-screen device", and a window 852 pops up on the interface 851 .
  • a prompt message is displayed on the window 852 for asking the user whether to play the video 1 from the mobile phone.
  • the user can select the "play" option on the window 852 to allow the large-screen device 85 to play the video 1, or select the "cancel” option on the window 852 to allow the large-screen device 85 not to play the video 1.
  • the large-screen device 85 also uses the received user voice "recommend video 1 to the large-screen device" as the latest context of the local semantic understanding module.
  • the large-screen device 85 collects the user's voice “don't care about it”, it converts the user's voice into text through the voice recognition module, and then through the voice understanding module, according to the text "don't care about it” and the latest context, that is, the user's voice " Recommend video 1 for large-screen devices” and its slot information, determine the intention of the user’s voice “Don’t ignore it” is to cancel the playback, the entity of the device slot is a large-screen device, and generate the corresponding execution through the dialogue management module instruction.
  • the large-screen device 85 acquires the execution instruction of the user's voice "don't care about it", in response to the execution instruction, the video 1 is not played, and the window 852 is removed from the interface 851 .
  • the user can transfer the video playback task from the mobile phone to the large-screen device by voice "recommend video 1 to the large-screen device", and when transferring the business, the mobile phone will transfer the most recent rounds of conversations Information such as intentions and slots are transmitted to the large-screen device along with the execution instructions, so that the large-screen device can accurately recognize the intention of the user's voice according to the historical intentions and slots sent by the mobile phone, and realize the Inter-device dialogue business continuity between devices.
  • the first electronic device 41 and the second electronic device 42 may not include the voice service system corresponding to FIG. 1, or some modules of the voice service system, but deploy the voice service system on the first electronic device 41 and a device other than the second electronic device 42.
  • the first electronic device 41 and the second electronic device 42 are usually thin devices, that is, the first electronic device 41 and the second electronic device 42 cannot deploy voice recognition in the voice service system due to very limited processing resources and memory resources. engine, semantic understanding engine, and dialogue management engine.
  • the first electronic device 41 and the second electronic device 42 can also be rich devices.
  • the first electronic device 41 and the second electronic device 42 have the conditions for deploying the voice service system, in fact, the voice service The system is deployed in devices other than the first electronic device 41 and the second electronic device 42 .
  • the electronic device 92 includes a second application program 921 and a second instruction interaction service 922
  • the third electronic device 93 includes a first speech recognition module 931, a first semantic understanding module 932 and a first dialogue management module 933
  • the fourth electronic device 94 includes The second speech recognition module 941 , the second semantic understanding module 942 and the second dialogue management module 943 .
  • the first electronic device 91 is connected to the third electronic device 93 in communication, and the second electronic device 92 is connected to the fourth electronic device 94 in communication.
  • the third electronic device 92 and the fourth electronic device 94 may be cloud servers, which may further include an NLG module and a TTS module.
  • the third electronic device 92 and the fourth electronic device 94 may also be terminal devices such as mobile phones and computers.
  • FIG. 9 For similarities or similarities between FIG. 5 and FIG. 9 , reference may be made to the introduction of FIG. 5 above, which will not be repeated here. The system flow in FIG. 9 will be described below in conjunction with FIG. 10 .
  • the first electronic device 91 locally stores user information and/or application state information, and when sending information such as the first execution instruction, historical intentions, and corresponding slots to the second electronic device 92, the user information and /or the application state information is sent to the second electronic device 92 .
  • information such as the first execution instruction, historical intentions, and corresponding slots
  • the user information and /or the application state information is sent to the second electronic device 92 .
  • relevant descriptions such as user information and application status information, you can refer to the corresponding content above, and will not repeat them here.
  • the method may include the following steps:
  • Step S1001 the first electronic device 91 collects the voice of the first user.
  • FIG. 11 which shows a schematic diagram of a scene where earphones transfer music to speakers for playback provided by the embodiment of the present application
  • user 111 may input a first user voice "transfer music to speakers for playback" to smart earphones 112 .
  • the user 111 uses the smart earphone 112 to play music on the way home, and the music played is the music stored locally in the smart earphone 112; after returning home, the user 111 wants to transfer the music being played by the smart earphone 112 to The smart speaker 113 plays, so say to the smart earphone 112 "transfer music to the speaker for playback".
  • the smart earphone 112 collects the voice of the first user "transfer the music to the speaker for playback" through the sound collection device.
  • the first electronic device 91 is a smart earphone 112
  • the second electronic device 92 is a smart speaker 113
  • the third electronic device 93 is a cloud server 115
  • the fourth electronic device 94 is a cloud server 114 .
  • the smart earphone 112 includes a processor, a memory, etc., stores multiple songs locally, can be connected to wireless Wi-Fi, and is installed with a voice assistant application.
  • the smart speaker 113 includes a processor, a memory, etc., can be connected to wireless Wi-Fi, and is installed with a voice assistant application program.
  • Step S1002 the first electronic device 91 sends the first user voice to the fourth electronic device 94 .
  • the smart earphone 112 automatically connects to the home wireless router after returning home, and after collecting the voice of the first user "transfer music to the speaker for playback", the voice in the smart earphone 112
  • the assistant uploads the first user's voice "transfer the music to the speaker for playback” to the cloud server 114 through wireless Wi-Fi.
  • Step S1003 the fourth electronic device 94 converts the voice of the first user into a first text.
  • the fourth electronic device 94 After the fourth electronic device 94 receives the first user voice, it inputs the first user voice to the first voice recognition module 941, and the first voice recognition module 941 converts the first user voice into first text.
  • Step S1004 the fourth electronic device 94 extracts the first intent and the first slot from the first text.
  • the first speech recognition module 941 inputs the obtained first text to the first semantic understanding module 942 .
  • the first semantic understanding module 942 performs intent recognition on the first text to obtain the first intent and the first slot.
  • the first text is "transfer music to speakers for playback"
  • the first slot includes the target device slot
  • the entity of the target device slot is a speaker
  • Step S1005 the fourth electronic device 94 generates a first execution instruction corresponding to the voice of the first user according to the first intention and the first slot.
  • the first semantic understanding module 942 inputs the first intent and the first slot to the first dialogue management module 943 .
  • the first dialog management module 943 generates a first execution instruction according to the first intent and the first slot.
  • the first intent is to switch and play music
  • the first slot includes the target device slot
  • the first execution instruction is an execution instruction to play music.
  • Step S1006 the fourth electronic device 94 sends the first execution instruction, the first intention and the first slot to the first electronic device 91 .
  • the fourth electronic device 94 may also transmit the first text to the first electronic device 91 .
  • the cloud server 114 may transmit the text of the first user's voice "transfer the music to the speaker for playback" to the smart earphone 112 .
  • the first electronic device 91 sends to the second electronic device 92 information such as the first execution instruction, the historical intention and the corresponding slot, the historical intention includes the first intention, and the corresponding slot includes the first slot.
  • the historical intentions may include the intentions of the latest N rounds of conversations.
  • context information and application status information may also be transmitted to the second electronic device.
  • Relevant information such as the intent, slot, and corpus of the latest N rounds of conversations can be stored locally in the first electronic device 91, or can be stored in the fourth electronic device 94.
  • the device 92 After the device 92 sends information such as execution instructions, intentions, and slots, it transmits related information such as intentions, slots, and corpus of the latest N rounds of conversations to the first electronic device 91 .
  • the smart earphone 112 can send the information about the currently playing song, the first execution instruction, and the history Information such as the intent and the corresponding slot is transmitted to the smart speaker 113 .
  • the relevant information of the currently playing song may include song title information, singer information, and playing progress information.
  • connection between the smart earphone 112 and the smart speaker 113 can be through Bluetooth connection, WiFi point-to-point connection, or the smart earphone 112 and the smart speaker 113 are connected to the same wireless router.
  • Step S1008 the second electronic device 92 executes the first execution instruction.
  • the smart speaker 113 can learn the name of the song and the progress of the song based on the information about the song currently being played.
  • the name of the song is Song A ;
  • the smart speaker 113 responds to the first execution instruction, and sends a prompt voice to the user 111 "Song A from the earphone, do you want to play it?". If the user 111 does not want to play the song A, he says "change another one" to the smart speaker 113.
  • the smart speaker 113 when the smart speaker 113 responds to the first execution instruction, it can also directly play the corresponding song without sending out the corresponding prompt voice. At this time, after the smart speaker 113 plays the song A, the user 111 can say "change another one" to the smart speaker 113 to play another song.
  • Step S1009 the second electronic device 92 collects the voice of the second user.
  • the second user's voice is "change another".
  • Step S1010 the second electronic device 92 sends information such as the second user's voice, historical intentions, and corresponding slots to the third electronic device 93 .
  • Step S1011 the third electronic device 93 converts the voice of the second user into a second text.
  • the third electronic device 93 may input the second user's voice to the second voice recognition module 931 .
  • the second voice recognition module 931 converts the second user's voice into a second text.
  • the cloud server 115 converts the second user's voice "change one" into the second text "change one".
  • Step S1012 the third electronic device 93 extracts the second intent and the second slot from the second text according to information such as the historical intent and the corresponding slot.
  • the second speech recognition module 931 inputs the second text to the second semantic understanding module 932 .
  • the second semantic understanding module 932 determines the second intention and the second slot of the second user's voice according to information such as the second text, the target intention, and the first slot.
  • the target intent is music transfer and playback
  • the entity in the slot of the target device is the smart speaker 113
  • the second text is "Change”.
  • the semantic understanding module on the cloud server 115 can determine the second intent For playing music, the second slot includes a device slot, and the entity of the device slot is a smart speaker 113 .
  • Step S1013 the third electronic device 93 generates a second execution instruction corresponding to the second user's voice according to the second intention and the second slot.
  • the second semantic understanding module 932 inputs the obtained information such as the second intent and the second slot to the second dialogue management module 933 , and the second dialogue management module 933 inputs the second execution instruction.
  • Step S1014 the third electronic device 93 sends a second execution instruction to the second electronic device 92 .
  • Step S1015 the second electronic device 92 executes the second execution instruction.
  • the smart speaker 113 after receiving the second execution instruction sent by the cloud server 115 , the smart speaker 113 automatically plays song B in response to the second execution instruction.
  • one of the first electronic device 41 and the second electronic device 42 does not include the voice service system corresponding to FIG. 1, or some modules of the voice service system, while the other is deployed with the corresponding voice service system or some modules of the voice service system.
  • FIG. 1 Exemplarily, refer to another schematic diagram of a cross-device dialog service continuity system provided by an embodiment of the present application shown in FIG. .
  • the first electronic device 121 includes a first application program 1211 , a first speech recognition module 1212 , a first semantic understanding module 1213 , a first dialogue management module 1214 and a first instruction interaction service 1215 .
  • the second electronic device 122 includes a second application program 1221 and a second instruction interaction service 1222 .
  • the third electronic device 123 includes a second voice recognition module 1231 , a second semantic understanding module 1232 and a second dialogue management module 1233 .
  • the first electronic device 121 locally stores user information and/or application state information, and when sending information such as the first execution instruction, historical intentions, and corresponding slots to the second electronic device 122, the user information and /or the application state information is sent to the second electronic device 122 .
  • information such as the first execution instruction, historical intentions, and corresponding slots
  • the user information and /or the application state information is sent to the second electronic device 122 .
  • relevant descriptions such as user information and application status information, you can refer to the corresponding content above, and will not repeat them here.
  • the method may include the following steps:
  • Step S1301 the first electronic device 121 collects the voice of the first user.
  • Step S1302 the first electronic device 121 converts the voice of the first user into a first text.
  • the first application program 1211 inputs the first user voice to the first voice recognition module 1212 .
  • the first voice recognition module 1212 converts the first user's voice into a first text.
  • the first electronic device 121 is a mobile phone 142
  • the second electronic device is a smart speaker 143
  • the third electronic device 123 is a cloud server 144 .
  • FIG. 14 is a schematic diagram of a scene where a mobile phone recommends music to a smart speaker according to an embodiment of the present application.
  • the mobile phone 142 displays a playback interface 1421 , and the playback interface 1421 shows that the mobile phone 142 is playing song C.
  • the user 141 said to the mobile phone 142, "Recommend a song to the speaker”
  • the mobile phone 142 collected the voice of the first user "Recommend a song to the speaker”
  • the voice of the first user "Recommend a song "to the speaker” is transformed into the first text "recommend a song to the speaker”.
  • Step S1303 the first electronic device 121 extracts the first intent and the first slot from the first text.
  • the first speech recognition module 1212 converts the first user's voice into the first text
  • the first text is input to the first semantic understanding module 1213, so that the first semantic understanding module 1213 can extract the first intention and the first slot.
  • the first text is "recommend a song to the speaker”
  • the first intent is to recommend a song
  • the first slot includes the slot of the target device
  • the entity of the target device slot is the speaker.
  • Step S1304 the first electronic device 12 generates a first execution instruction corresponding to the voice of the first user according to the first intention and the first slot.
  • the first semantic understanding module 1213 obtains the first intention and the first slot, it inputs the first intention and the first slot to the first dialogue management module 1214, and obtains the output of the second dialogue management module 1214. First execute the command.
  • the first execution instruction is an instruction to play a song.
  • the mobile phone 142 collects the first user's voice "recommend a song to the speaker", it processes the first user's voice based on the voice service system, and displays an interface 1422 on which the first user's voice is displayed. text, and the handset 142 responds to the first user's voice with the text "OK".
  • Step S1305 the first electronic device 121 sends information such as the first execution instruction, the historical intent and the corresponding slot to the second electronic device 122, the historical intent includes the first intent, and the corresponding slot includes the first slot.
  • Step S1306 the second electronic device 122 executes the first execution instruction.
  • the smart speaker 143 responds to the first execution instruction, and sends a prompt voice "Song A from the mobile phone, do you want to play it?" to the user 141 to ask the user whether to play it.
  • the smart speaker 143 can also automatically play the song A in response to the first execution instruction without asking the user whether to play it.
  • Step S1307 the second electronic device 122 collects the voice of the second user.
  • the smart speaker 143 is sending out the prompt voice "Song A from the mobile phone, do you want to play it?", the user 141 says “change one” to the smart speaker 143, and the smart speaker 143 collects the voice of the second user "change one". ".
  • Step S1308 the second electronic device 122 sends information such as the second user's voice, historical intentions, and corresponding slots to the third electronic device 123 .
  • Step S1309 the third electronic device 123 converts the voice of the second user into a second text.
  • the second application program 1221 of the second electronic device 122 transmits information such as the second user's voice, historical intentions, and corresponding slots to the third electronic device 123 .
  • the third electronic device 123 inputs the second user's voice to the second voice recognition module 1231 to obtain a second text output by the second voice recognition module 1231 .
  • Step S1310 the third electronic device 123 extracts the second intent and the second slot from the second text according to information such as the historical intent and the corresponding slot.
  • the third electronic device 123 inputs the second text to the second semantic understanding module 1232 .
  • the second semantic understanding module 1232 outputs the second intent and the second slot according to the second text, the historical intent, and information about the slot.
  • the second text is "change another"
  • the historical intent includes recommended songs
  • the corresponding slot includes the target device slot
  • the entity of the target device slot is a smart speaker 143
  • the second intent is to play a song
  • the second slot includes a device slot
  • the entity of the device slot is a smart speaker 143 .
  • Step S1311 the third electronic device 123 generates a second execution instruction corresponding to the second user's voice according to the second intention and the second slot.
  • the second semantic understanding module 1232 inputs the second intention and the second slot to the second dialog management module 1233 , and obtains a second execution instruction output by the second dialog management module 1233 .
  • Step S1312 the third electronic device 123 sends a second execution instruction to the second electronic device 122 .
  • Step S1313 the second electronic device 122 executes the second execution instruction.
  • the smart speaker 143 after receiving the second execution instruction, the smart speaker 143 automatically plays song B in response to the second execution instruction.
  • the type of the electronic device involved in this embodiment of the present application may be arbitrary.
  • the first electronic device may be, but not limited to, a mobile phone, a tablet computer, a smart speaker, a smart large screen (also called a smart TV), or a wearable device.
  • the second electronic device may also be, but not limited to, a mobile phone, a tablet computer, a smart speaker, a smart large screen (also called a smart TV), or a wearable device.
  • FIG. 15 is a schematic diagram of a hardware structure of an electronic device provided by an embodiment of the present application.
  • an electronic device 1500 may include a processor 1510, an internal memory 1520, a communication module 1530, an audio module 1540, a speaker 1541, a microphone 1542, and an antenna.
  • the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the electronic device 1500 .
  • the electronic device 1500 may include more or fewer components than shown in the figure, or combine certain components, or separate certain components, or arrange different components.
  • the illustrated components can be realized in hardware, software or a combination of software and hardware.
  • the electronic device 1500 when the electronic device 1500 is a mobile phone, the electronic device 1500 may also include an external memory interface, a universal serial bus (universal serial bus, USB) interface, a charging management module, a power management module, a battery, a receiver, an earphone jack, and a sensor module , buttons, motors, indicators, cameras, display screens, and subscriber identification module (subscriber identification module, SIM) card interfaces, etc.
  • the sensor module can include pressure sensor, gyroscope sensor, air pressure sensor, magnetic sensor, acceleration sensor, distance sensor, proximity light sensor, fingerprint sensor, temperature sensor, touch sensor, ambient light sensor, bone conduction sensor, etc.
  • the processor 1510 may include one or more processing units, for example: the processor 1510 may include an application processor (application processor, AP), a modem processor, a controller, a memory, a digital signal processor (digital signal processor) , DSP), baseband processor, and/or neural network processor (neural-network processing unit, NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • the controller may be the nerve center and command center of the electronic device 1500 .
  • the controller can generate an operation control signal according to the instruction opcode and timing signal, and complete the control of fetching and executing the instruction.
  • a memory may also be provided in the processor 1510 for storing instructions and data.
  • processor 1510 may include one or more interfaces.
  • the interface may include an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a general-purpose input/output (general-purpose input/output, GPIO) interface, and the like.
  • I2S integrated circuit sound
  • PCM pulse code modulation
  • GPIO general-purpose input/output
  • the I2S interface can be used for audio communication.
  • processor 1510 may include multiple sets of I2S buses.
  • the processor 1510 may be coupled to the audio module 1540 through an I2S bus to implement communication between the processor 1510 and the audio module 1540 .
  • the PCM interface can also be used for audio communication, sampling, quantizing and encoding the analog signal.
  • the audio module 1540 and the wireless communication module 1 in the communication module 1530 may be coupled through a PCM bus interface. Both I2S interface and PCM interface can be used for audio communication.
  • the GPIO interface can be configured by software.
  • the GPIO interface can be configured as a control signal or as a data signal.
  • the GPIO interface can be used to connect the processor 1510 with the audio module 1540 and so on.
  • the GPIO interface can also be configured as an I2S interface, etc.
  • the interface connection relationship between modules shown in the embodiment of the present application is only a schematic illustration, and does not constitute a structural limitation of the electronic device 1500 .
  • the electronic device 1500 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
  • the communication module 1530 may include a mobile communication module and/or a wireless communication module.
  • the wireless communication function of the electronic device 1500 may be realized by an antenna, a mobile communication module, a wireless communication module, a modem processor, a baseband processor, and the like.
  • the mobile communication module can provide wireless communication solutions including 2G/3G/4G/5G applied on the electronic device 1500 .
  • the mobile communication module may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA) and the like.
  • the mobile communication module can receive electromagnetic waves through the antenna, filter and amplify the received electromagnetic waves, and send them to the modem processor for demodulation.
  • the mobile communication module can also amplify the signal modulated by the modem processor, convert it into electromagnetic wave and radiate it through the antenna 1 .
  • at least part of the functional modules of the mobile communication module may be set in the processor 1510 .
  • at least part of the functional modules of the mobile communication module and at least part of the modules of the processor 1510 may be set in the same device.
  • a modem processor may include a modulator and a demodulator.
  • the modulator is used for modulating the low-frequency baseband signal to be transmitted into a medium-high frequency signal.
  • the demodulator is used to demodulate the received electromagnetic wave signal into a low frequency baseband signal. Then the demodulator sends the demodulated low-frequency baseband signal to the baseband processor for processing.
  • the low-frequency baseband signal is passed to the application processor after being processed by the baseband processor.
  • the application processor outputs a sound signal through an audio device (not limited to a speaker, etc.).
  • the wireless communication module can provide wireless local area networks (wireless local area networks, WLAN) (such as wireless fidelity (Wi-Fi) network), bluetooth (bluetooth, BT), global navigation satellite system, etc. applied on the electronic device 1500. (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
  • the wireless communication module may be one or more devices integrating at least one communication processing module.
  • the wireless communication module receives electromagnetic waves through the antenna, frequency-modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 1510 .
  • the wireless communication module can also receive the signal to be sent from the processor 1510, frequency-modulate it, amplify it, and convert it into electromagnetic wave to radiate through the antenna.
  • the mobile phone transmits information such as execution instructions, historical intentions, and slots to the large-screen device through a Wi-Fi point-to-point connection.
  • the NPU is a neural-network (NN) computing processor.
  • NPU neural-network
  • the NPU By referring to the structure of biological neural networks, such as the transfer mode between neurons in the human brain, it can quickly process input information and continuously learn by itself.
  • Applications such as intelligent cognition of the electronic device 100 can be realized through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.
  • the mobile phone recognizes the input user voice through the NPU, and obtains the text information of the user voice; and realizes semantic understanding of the text information of the user voice, and extracts the slot and intention of the user voice, etc.
  • Internal memory 1520 may be used to store computer-executable program code, which includes instructions.
  • the processor 1510 executes various functional applications and data processing of the electronic device 1500 by executing instructions stored in the internal memory 1520 .
  • the internal memory 1520 may include an area for storing programs and an area for storing data.
  • the storage program area can store an operating system, at least one application program required by a function (such as a sound playing function) and the like.
  • the storage data area can store data (such as audio data, etc.) created during the use of the electronic device 100 .
  • the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (universal flash storage, UFS) and the like.
  • the internal memory 1520 stores a voice assistant application program or an application program integrated with a voice assistant function.
  • the electronic device 1500 may implement an audio function through an audio module 1540, a speaker 1541, a microphone 1542, and an application processor. Such as music playback, recording, etc.
  • the audio module 1540 is used to convert digital audio information into analog audio signal output, and is also used to convert analog audio input into digital audio signal.
  • the audio module 1540 can also be used to encode and decode audio signals.
  • the audio module 1540 may be set in the processor 1510 , or some functional modules of the audio module 1540 may be set in the processor 1510 .
  • Speaker 1541 also called “horn" is used to convert audio electrical signals into sound signals.
  • Electronic device 1500 can listen to music through speaker 170A, or listen to hands-free calls.
  • the microphone 1542 also called “microphone” or “microphone” is used to convert sound signals into electrical signals.
  • the electronic device 1500 may be provided with at least one microphone 1542 .
  • the electronic device 1500 may be provided with two microphones 1542, which may also implement a noise reduction function in addition to collecting sound signals.
  • the electronic device 1500 can also be provided with three, four or more microphones 1542 to collect sound signals, reduce noise, identify sound sources, and realize directional recording functions, etc.
  • the electronic device 1510 can collect the voice of the user through the microphone 1542 and the audio module 1540, and output the voice through the speaker 1541 and the audio module 1540, so as to realize man-machine dialogue.
  • the software system of the electronic device 1500 may adopt a layered architecture, an event-driven architecture, a micro-kernel architecture, a micro-service architecture, or a cloud architecture.
  • the software system of the electronic device 1500 may adopt a layered architecture, an event-driven architecture, a micro-kernel architecture, a micro-service architecture, or a cloud architecture.
  • the embodiment of the present application uses a layered architecture as an example to illustrate the software structure of the electronic device 1500 .
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate through software interfaces.
  • the Android system is divided into four layers, which are respectively the application program layer, the application program framework layer, the Android runtime (Android runtime) and the system library, and the kernel layer from top to bottom.
  • the application layer can include some application packages.
  • the application package can include applications such as voice assistant, camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, and short message.
  • the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
  • the application framework layer includes some predefined functions.
  • the application framework layer may include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, and the like.
  • a window manager is used to manage window programs.
  • the window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, capture the screen, etc.
  • Content providers are used to store and retrieve data and make it accessible to applications. Said data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebook, etc.
  • the view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on.
  • the view system can be used to build applications.
  • a display interface can consist of one or more views. For example, a display interface including a text message notification icon may include a view for displaying text and a view for displaying pictures.
  • the phone manager is used to provide a communication function of the electronic device 1500 . For example, the management of call status (including connected, hung up, etc.).
  • the resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.
  • the notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and can automatically disappear after a short stay without user interaction.
  • the notification manager is used to notify the download completion, message reminder, etc.
  • the notification manager can also be a notification that appears on the top status bar of the system in the form of a chart or scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window.
  • prompting text information in the status bar issuing a prompt sound, vibrating the electronic device, and flashing the indicator light, etc.
  • the Android Runtime includes core library and virtual machine. The Android runtime is responsible for the scheduling and management of the Android system.
  • the core library consists of two parts: one part is the function function that the java language needs to call, and the other part is the core library of Android.
  • the application layer and the application framework layer run in virtual machines.
  • the virtual machine executes the java files of the application program layer and the application program framework layer as binary files.
  • the virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer includes at least a display driver, a camera driver, an audio driver, and a sensor driver.
  • the electronic device may include a memory, a processor, and a computer program stored in the memory and operable on the processor.
  • the processor executes the computer program, the method in any one of the above method embodiments is implemented.
  • the embodiment of the present application also provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps in the foregoing method embodiments can be realized.
  • An embodiment of the present application provides a computer program product.
  • the computer program product runs on an electronic device, the electronic device can implement the steps in the foregoing method embodiments when executed.
  • the embodiment of the present application also provides a chip system, the chip system includes a processor, the processor is coupled with the memory, and the processor executes the computer program stored in the memory, so as to realize the above-mentioned method embodiments. method.
  • the chip system may be a single chip, or a chip module composed of multiple chips.
  • references to "one embodiment” or “some embodiments” or the like in the specification of the present application means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application.
  • appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically stated otherwise.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种跨设备的对话业务接续方法、系统、电子设备和存储介质,用于实现跨设备的对话业务接续。首先,第一电子设备(62)在采集第一用户语音后,若确定第一用户语音包含用于指示向第二电子设备(64)发送指令的信息,则向第二电子设备(64)发送第一信息和第一执行指令,第一信息包括用于描述第一用户语音的意图的信息,第一执行指令为第一用户语音对应的执行指令;然后,第二电子设备(64)在接收到第一执行指令之后,采集第二用户语音,再执行第二用户语音对应的第二执行指令,第二执行指令为根据第一信息和第二用户语音生成的指令。

Description

跨设备的对话业务接续方法、系统、电子设备和存储介质
本申请要求于2021年06月18日提交国家知识产权局、申请号为202110681520.3、申请名称为“跨设备的对话业务接续方法、系统、电子设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及自然语言处理(Natural Language Processing,NLP)领域,尤其涉及一种跨设备的对话业务接续方法、系统、电子设备和计算机可读存储介质。
背景技术
随着人工智能(Artificial Intelligence,AI)技术的不断发展,作为AI分支的NLP的应用也越来越广泛。
目前,电子设备可以基于NLP的对话系统与用户进行人机交互,实现相应的语音业务。例如,在唤醒手机的语音助手之后,用户向手机输入语音“播放歌曲A”;手机基于对话系统对用户的输入语音进行处理,得到播放歌曲A的执行指令,并响应于该执行指令,自动播放歌曲A。
现阶段,电子设备还不能实现跨设备的对话业务接续。
发明内容
本申请实施例提供一种跨设备的对话业务接续方法、系统、电子设备和计算机可读存储介质,可以实现跨设备的对话业务接续。
第一方面,本申请实施例提供一种跨设备的对话业务接续系统,该系统包括第一电子设备和至少一个第二电子设备。
其中,第一电子设备用于:采集第一用户语音;若确定第一用户语音包含用于指示向第二电子设备发送指令的信息,则向第二电子设备发送第一信息和第一执行指令,第一信息包括用于描述第一用户语音的意图的信息,第一执行指令为第一用户语音对应的执行指令;
第二电子设备用于:在接收到第一执行指令之后,采集第二用户语音;执行第二用户语音对应的第二执行指令,第二执行指令为根据第一信息和第二用户语音生成的指令。
基于上述技术方案,第一电子设备在向第二电子设备发送第一执行指令的时候,还向第二电子设备发送第一信息,即还发送用于描述第一用户语音的意图的信息,也即将第一信息随着业务流传递给第二电子设备。这样,第二电子设备可以根据来自第一电子设备的用于描述第一用户语音的意图的信息,对新采集的第二用户语音进行语义理解,以确定出第二用户语音的意图,从而实现了跨设备的对话业务接续。
示例性地,第一电子设备为手机,第二电子设备为大屏设备,第一用户语音为“推荐一首歌曲给大屏设备”,该用户语音的意图为推荐音乐;第二用户语音为“换一个”。手机向大屏设备发送第一执行指令和推荐音乐的意图,大屏设备在采集到“换一个”的时候,根据推荐音乐的意图识别出“换一个”的意图为推荐音乐,并响应于“换一 个”,给用户推荐另一首歌曲。
在第一方面的一种可能的实现方式中,用于描述第一用户语音的意图的信息包括第一用户语音的第一文本和/或第一用户语音的第一意图。
示例性地,第一用户语音为“推荐一首歌曲给大屏设备”,第一文本为文本“推荐一首歌曲给大屏设备”,第一意图为推荐音乐。
在第一方面的一种可能的实现方式中,第一信息包括N轮对话的文本和/或意图,N为大于1的正整数;
N轮对话的文本包括第一用户语音的第一文本,N轮对话的意图包括第一用户语音的第一意图;其中,N轮对话为第一电子设备采集的用户对话。
在该实现方式中,第一电子设备将N轮对话的意图等信息传递给第二电子设备,可以让第二电子设备可以更加准确地识别新采集的用户语音的意图,实现更加开放的跨设备对话业务接续。该N轮对话可以包括第一用户语音。
示例性,第一电子设备可以将最近发生的N轮对话的相关信息传递给第二电子设备。例如,N=3。
在第一方面的一种可能的实现方式中,第一执行指令包括用于表征第一用户语音的槽位的信息。这样,让第二电子设备可以更加准确地识别出新采集的用户语音,实现更加开放的跨设备对话业务接续。
例如,第一用户语音为“推荐歌曲A给大屏设备”,第一电子设备不仅将该用户语音的意图信息等传递给第二电子设备,还将从该用户语音中提取的歌曲槽位一并传递给第二电子设备。
在第一方面的一种可能的实现方式中,第一电子设备具体用于:对第一用户语音进行语音识别,得到第一文本;对第一文本进行语义理解,得到第一用户语音的第一意图和第一槽位;若第一槽位包括目标设备槽位,且目标设备槽位的实体为第二电子设备,则确定第一用户语音包含用于指示向第二电子设备发送指令的信息;根据第一意图和第一槽位,生成第一用户语音对应的第一执行指令。
在第一方面的一种可能的实现方式中,该系统还包括与第一电子设备通信连接的第三电子设备;第一电子设备具体用于:向第三电子设备发送第一用户语音;接收来自第三电子设备的第一槽位、第一意图和第一执行指令,第一槽位和第一意图为第三电子设备从第一用户语音中提取的,第一执行指令为第三电子设备根据第一槽位和第一意图生成的第一用户语音对应的执行指令;若第一槽位包括目标设备槽位,且目标设备槽位的实体为第二电子设备,则确定第一用户语音包含用于指示向第二电子设备发送指令的信息。
在该实现方式中,第一电子设备可以使用第三电子设备的语音业务能力,对第一用户语音进行解析和识别,这样,第一电子设备可以是不具备部署语音业务系统的能力的设备,从而使得跨设备对话业务接续的应用范围更加广泛。
示例性地,第一电子设备也可以为智能手表、智能耳机以及智能音箱等设备。即使这些设备不具备部署语音识别、语义理解和对话管理等模块的能力,仍然可以实现跨设备对话业务接续。
在第一方面的一种可能的实现方式中,第二电子设备具体用于:对第二用户语音 进行语音识别,得到第二文本;根据第一信息,对第二文本进行语义理解,得到第二用户语音的语义信息;根据第二用户语音的语义信息,生成第二用户语音对应的第二执行指令。
在第一方面的一种可能的实现方式中,第二电子设备具体用于:将第一信息作为语义理解模块的最新上下文,第二电子设备包括语义理解模块;将第二文本输入语义理解模块,获得语义理解模块输出的第二用户语音的语义信息,其中,语义理解模块采用最新上下文对第二文本进行语义理解。
在第一方面的一种可能的实现方式中,该系统还包括与第二电子设备通信连接的第四电子设备;第二电子设备具体用于:向第四电子设备发送第二用户语音和第一信息;接收来自第四电子设备的第二用户语音的语义信息和第二执行指令;
其中,第二用户语音的语义信息为第四电子设备根据第一信息对第二用户语音进行语义理解得到的信息,第二执行指令为第四电子设备根据第二用户语音的语义信息生成的第二用户语音对应的执行指令。
在该实现方式中,第二电子设备可以使用第四电子设备的语音业务能力,对第二用户语音进行解析和识别,这样,第二电子设备可以是不具备部署语音业务系统的能力的设备,从而使得跨设备对话业务接续的应用范围更加广泛。
在第一方面的一种可能的实现方式中,第一电子设备具体用于:确定第一电子设备的用户账号和第二电子设备的用户账号是否为同一个用户;若是,向第二电子设备发送第一执行指令和第一信息,并向第二电子设备发送第二信息,第二信息包括第一用户信息、场景信息和第一应用状态信息中的任意一种或任意组合;
其中,第一用户信息为用于描述第一电子设备的用户的信息,第一应用状态信息为用于表征第一电子设备上的第一目标应用的信息,场景信息为用于描述用户场景的信息;
第二电子设备具体用于:根据第一信息、第二用户语音和第二信息,生成第二执行指令。
在该实现方式中,第一电子设备自动识别两个设备的用户是否为同一个用户,如果是同一个用户,第一电子设备除了将第一执行指令和第一信息传递给第二电子设备之外,还将第二信息发送给第二电子设备。这样,第二电子设备则可以根据第二信息,给用户提供更加个性化和精准化的服务,以提高跨设备对话业务接续下的用户体验。
示例性地,第一用户信息为用户A的信息,通过用户A的信息可以得知用户偏好歌曲类型为流行歌曲。通过场景信息可以得知用户处于行走状态,即处于运动场景。第一目标应用为手机上安装的华为音乐,第一应用状态信息包括华为音乐上的歌曲历史播放记录。
此时,第一用户语音为“推荐一首歌曲给大屏设备”,第二用户语音为“换一个”。手机将这些信息发送给大屏设备之后,大屏设备在生成第二执行指令时,根据用户处于运动场景、用户偏好歌曲类型为流行音乐,确定出要推荐运动场景下的流行音乐歌曲。进一步地,基于歌曲历史播放记录,筛选出播放次数最多,且属于运动场景下的流行音乐歌曲作为推荐歌曲。这样,使得推荐出的歌曲更加符合用户喜好。
另外,第二信息还可以用于第二电子设备的语义理解,使得第二电子设备可以更 加准确地理解新采集的用户语音的意图,实现更加开放的跨设备对话业务接续。例如,第一电子设备将历史播放歌曲信息传递给第二电子设备,历史播放歌曲信息包括歌曲名称。当第二电子设备采集到“换XXX”时,可以根据歌曲名称识别“XXX”为歌曲名称,进而识别出新采集的用户语音的意图为播放歌曲XXX。
在第一方面的一种可能的实现方式中,第一电子设备具体用于:若第一电子设备的用户账号和第二电子设备的用户账号不是同一个用户,向第二电子设备发送第一执行指令和第一信息;
第二电子设备具体用于:根据第一信息、第二用户语音和第三信息,生成第二执行指令,第三信息包括第二用户信息和/或第二应用状态信息;其中,第二用户信息为用于描述第二电子设备的用户的信息,第二应用状态信息为用于表征第二电子设备上的第二目标应用的信息。
在该实现方式中,第一电子设备自动识别两个设备的账号是否是同一个用户,如果不是,则可以不用发送第一电子设备上的用户信息。此时,第二电子设备可以根据本设备的相关信息,给用户提供更加个性化和精准的服务。示例性地,第二目标应用可以为第二电子设备上安装的华为音乐应用程序。
在第一方面的一种可能的实现方式中,若存在至少两个第二电子设备,且至少两个第二电子设备与第一电子设备的连接方式不同,第一电子设备具体用于:
确定与至少两个第二电子设备之间的通信连接的类型;
根据通信连接的类型,通过不同的通信连接分别向至少第二电子设备发送第一信息和第一执行指令。
在该实现方式中,如果需要向至少两个第二电子设备分发第一执行指令和第一信息,第一电子设备自动识别通信连接类型,并根据通信连接类型向对应的第二电子设备发送对应的信息。
通过一个语音命令同时向两个第二电子设备分发对应的信息,以将对话业务接续至至少两个第二电子设备,便捷性更高,用户体验更佳。
在第一方面的一种可能的实现方式中,第二电子设备具体用于:在执行第一执行指令时,或提示用户是否执行第一执行指令时,采集第二用户语音。通过提示用户是否执行第一执行指令,可以使得跨设备对话业务的接续更加人性化。
在第一方面的一种可能的实现方式中,第二电子设备还用于:在接收第一执行指令后,唤醒语音助手,第二电子设备包括语音助手。
在该实现方式中,第二电子设备自动唤醒语音助手,不用用户通过特定唤醒词唤醒第二点设备上的语音助手,使得跨设备对话业务接续更加流畅,用户体验更佳。
在第一方面的一种可能的实现方式中,第一执行指令为推荐音乐的指令,第二执行指令为用于推荐另一首歌曲的指令。
第二方面,本申请实施例提供一种跨设备的对话业务接续方法,应用于第一电子设备,该方法包括:采集第一用户语音;确定第一用户语音包含用于指示向第二电子设备发送指令的信息后,向第二电子设备发送第一信息和第一执行指令;其中,第一信息包括用于描述第一用户语音的意图的信息,第一执行指令为第一用户语音对应的执行指令。
在第二方面的一种可能的实现方式中,用于描述第一用户语音的意图的信息包括第一用户语音的第一文本和/或第一用户语音的第一意图。
在第二方面的一种可能的实现方式中,第一信息包括N轮对话的文本和/或意图,N为大于1的正整数;
N轮对话的文本包括第一用户语音的第一文本,N轮对话的意图包括第一用户语音的第一意图;其中,N轮对话为第一电子设备采集的用户对话。
在第二方面的一种可能的实现方式中,第一执行指令包括用于表征第一用户语音的槽位的信息。
在第二方面的一种可能的实现方式中,确定第一用户语音包含用于指示第二电子设备的信息后,向第二电子设备发送第一信息和第一执行指令,包括:
对第一用户语音进行语音识别,得到第一文本;
对第一文本进行语义理解,得到第一用户语音的第一意图和第一槽位;
若第一槽位包括目标设备槽位,且目标设备槽位的实体为第二电子设备,则确定第一用户语音包含用于指示向第二电子设备发送指令的信息;
根据第一意图和第一槽位,生成第一用户语音对应的第一执行指令;
向第二电子设备发送第一信息和第一执行指令,第一信息包括第一意图和/或第一文本。
在第二方面的一种可能的实现方式中,确定第一用户语音包含用于指示第二电子设备的信息后,向第二电子设备发送第一信息和第一执行指令,包括:
向第三电子设备发送第一用户语音;
接收来自第三电子设备的第一槽位、第一意图和第一执行指令,第一槽位和第一意图为第三电子设备从第一用户语音中提取的,第一执行指令为第三电子设备根据第一槽位和第一意图生成的第一用户语音对应的执行指令;
若第一槽位包括目标设备槽位,且目标设备槽位的实体为第二电子设备,则确定第一用户语音包含用于指示向第二电子设备发送指令的信息;
向第二电子设备发送第一信息和第一执行指令,第一信息包括第一意图和/或第一用户语音的第一文本。
在第二方面的一种可能的实现方式中,在向第二电子设备发送第一信息和第一执行指令之前,方法还包括:
确定第一电子设备的用户账号和第二电子设备的用户账号是否为同一个用户;
若是,进入向第二电子设备发送第一执行指令和第一信息的步骤,并向第二电子设备发送第二信息,第二信息包括第一用户信息、场景信息和第一应用状态信息中的任意一种或任意组合;
其中,第一用户信息为用于描述第一电子设备的用户的信息,场景信息为用于描述用户场景的信息,第一应用状态信息为用于表征第一电子设备上的第一目标应用的信息。
在第二方面的一种可能的实现方式中,若存在至少两个第二电子设备,向第二电子设备发送第一信息和第一执行指令,包括:
确定与至少两个第二电子设备之间的通信连接的类型;
根据通信连接的类型,通过不同的通信连接分别向至少第二电子设备发送第一信息和第一执行指令。
在第二方面的一种可能的实现方式中,第一执行指令为推荐音乐的指令。
第三方面,本申请实施例提供一种跨设备的对话业务接续方法,应用于第二电子设备,该方法包括:
接收来自第一电子设备的第一执行指令和第一信息,第一信息包括用于描述第一用户语音的意图的信息,第一执行指令为第一用户语音对应的执行指令,第一用户语音为第一电子设备采集的,且包含用于指示向第二电子设备发送指令的信息的语音;
采集第二用户语音;执行第二用户语音对应的第二执行指令,第二执行指令为根据第一信息和第二用户语音生成的指令。
在第三方面的一种可能的实现方式中,用于描述第一用户语音的意图的信息包括第一用户语音的第一文本和/或第一用户语音的第一意图。
在第三方面的一种可能的实现方式中,第一信息包括N轮对话的文本和/或意图,N为大于1的正整数;
N轮对话的文本包括第一用户语音的第一文本,N轮对话的意图包括第一用户语音的第一意图;其中,N轮对话为第一电子设备采集的用户对话。
在第三方面的一种可能的实现方式中,第一执行指令包括用于表征第一用户语音的槽位的信息。
在第三方面的一种可能的实现方式中,执行第二用户语音对应的第二执行指令,包括:
对第二用户语音进行语音识别,得到第二文本;
根据第一信息,对第二文本进行语义理解,得到第二用户语音的语义信息;
根据第二用户语音的语义信息,生成第二用户语音对应的第二执行指令;
执行第二执行指令。
在第三方面的一种可能的实现方式中,根据第一信息,对第二文本进行语义理解,得到第二用户语音的语义信息,包括:
将第一信息作为语义理解模块的最新上下文,第二电子设备包括语义理解模块;
将第二文本输入语义理解模块,获得语义理解模块输出的第二用户语音的语义信息,其中,语义理解模块采用最新上下文对第二文本进行语义理解。
在第三方面的一种可能的实现方式中,若第一电子设备的用户账号和第二电子设备的用户账号是同一个用户,方法还包括:
接收来自第一电子设备的第二信息,第二信息包括第一用户信息、场景信息和第一应用状态信息中的任意一种或任意组合;
根据第二用户语音的语义信息,生成第二用户语音对应的第二执行指令,包括:
根据语义信息和第二信息,生成第二执行指令;
其中,第一用户信息为用于描述第一电子设备的用户的信息,场景信息为用于描述用户场景的信息,第一应用状态信息为用于表征第一电子设备上的第一目标应用的信息。
在第三方面的一种可能的实现方式中,若第一电子设备的用户账号和第二电子设 备的用户账号不是同一个用户,根据第二用户语音的语义信息,生成第二用户语音对应的第二执行指令,包括:
根据语义信息和第三信息,生成第二执行指令;
第三信息包括第二用户信息和/或第二应用状态信息,第二用户信息为用于描述第二电子设备的用户的信息,第二应用状态信息为用于表征第二电子设备上的第二目标应用的信息。
在第三方面的一种可能的实现方式中,执行第二用户语音对应的第二执行指令,包括:
向第四电子设备发送第二用户语音和第一信息;
接收来自第四电子设备的第二用户语音的语义信息和第二执行指令;
其中,第二用户语音的语义信息为第四电子设备根据第一信息对第二用户语音进行语义理解得到的信息,第二执行指令为第四电子设备根据第二用户语音的语义信息生成的第二用户语音对应的执行指令;
执行第二执行指令。
在第三方面的一种可能的实现方式中,采集第二用户语音,包括:在执行第一执行指令时,或提示用户是否执行第一执行指令时,采集第二用户语音。
在第三方面的一种可能的实现方式中,在采集第二用户语音之前,该方法还包括:在接收第一执行指令后,唤醒语音助手,第二电子设备包括语音助手。
在第三方面的一种可能的实现方式中,第二执行指令为用于推荐另一首歌曲的指令。
第四方面,本申请实施例提供一种电子设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,处理器执行计算机程序时实现如上述第二方面或第三方面任一项的方法。
第五方面,本申请实施例提供一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序被处理器执行时实现如上述第二方面或第三方面任一项的方法。
第六方面,本申请实施例提供一种芯片系统,该芯片系统包括处理器,处理器与存储器耦合,处理器执行存储器中存储的计算机程序,以实现如上述第二方面或第三方面任一项所述的方法。该芯片系统可以为单个芯片,或者多个芯片组成的芯片模组。
第七方面,本申请实施例提供一种计算机程序产品,当计算机程序产品在电子设备上运行时,使得电子设备执行上述第二方面或第三方面任一项所述的方法。
可以理解的是,上述第二方面至第七方面的有益效果可以参见上述第一方面中的相关描述,在此不再赘述。
附图说明
图1为本申请实施例提供的语音业务系统示意图;
图2为本申请实施例提供的语音控制手机播放音乐的场景示意图;
图3为本申请实施例提供的手机给大屏设备推荐音乐的场景示意图;
图4为本申请实施例提供的跨设备的对话业务接续系统的一种示意图;
图5为本申请实施例提供的跨设备的对话业务接续系统的另一种示意图;
图6A~图6B为本申请实施例提供的手机给大屏设备推荐音乐的场景示意图;
图7为本申请实施例提供的手机给大屏设备推荐音乐的流程示意图;
图8A~图8C为本申请实施例提供的导航场景示意图;
图8D为本申请实施例提供的视频推荐场景示意图;
图9为本申请实施例提供的跨设备的对话业务接续系统的另一种示意图;
图10为本申请实施例提供的跨设备的对话业务接续方法的另一种流程示意图;
图11为本申请实施例提供的耳机转接音乐至音箱播放的场景示意图;
图12为本申请实施例提供的跨设备的对话业务接续系统的又一种示意图;
图13为本申请实施例提供的跨设备的对话业务接续方法的又一种流程示意图;
图14为本申请实施例提供的手机给智能音箱推荐音乐的场景示意图;
图15为本申请实施例提供的电子设备硬件结构示意图。
具体实施方式
下面示例性介绍本申请实施例可能涉及的相关内容。
(1)语音业务系统,或称对话业务系统。参见图1,示出了本申请实施例提供的语音业务系统示意图。如图1所示,该语音系统可以包括语音识别(Automatic Speech Recognition,ASR)模块11、语义理解(Natural Language Understanding,NLU)模块12、对话管理(Dialogue Management,DM)模块13以及语音合成(Text To Speech,TTS)模块14。
其中,语音识别模块11用于将用户15输入的语音信息转化成文本信息。
语义理解模块12用于根据语音识别模块11输出的文本信息进行语义理解,得到语义信息,该语义信息通常包括意图和槽位值。
对话管理模块13用于根据语义理解模块12输出的语义信息以及对话状态,更新系统状态,并输出下一步的系统动作。
对话管理模块13中包括对话状态追踪(Dialog State Tracking,DST)子模块和对话决策(Dialog Policy,DP)子模块。对话状态追踪子模块用于维护和更新对话状态,对话决策子模块用于根据对话状态和语义信息等,产生系统行为,以决定下一步的动作。
电子设备可以根据对话管理模块13输出的指令,执行对应的操作。如果对话管理模块13输出的指令为用于指示输出语音的指令。此时,语音合成模块14可以根据对话管理模块13输出的指令,生成语音信息,得到输出语音。例如,用户15输入的语音信息为“播放一首歌曲”,对话管理模型13输出用于指示输出语音的指令,语音合成模块14根据用于指示输出语音的指令,生成输出语音“你要播放什么歌曲?”。
如果对话管理模块13输出的指令是其他类型的指令,电子设备则响应于该指令,执行对应的操作。示例性地,本申请实施例中,对话管理模块13的输出可以具体表现为执行指令,该执行指令用于指示下一步的动作。例如,用户15的输入语音信息为“播放歌曲A”,对话管理模块13输出播放歌曲A的执行指令,电子设备响应于该执行指令,自动播放歌曲A。
可以理解的是,在其他语音业务系统中,除了可以包括图1示出的模块之外,还可以包括自然语言生成(Natural Language Generation,NLG)模块。自然语言生成模 块用于将对话管理模块13输出的系统动作进行文本化,得到自然语言文本。而自然语言生成模块输出的自然语言文本,可以作为语音合成模块14的输入;语音合成模块14将输入的自然语音文本转化为语音信息,得到输出语音。
(2)意图、槽位和槽位值。
意图可以是指用户语音中表达的用户目的。例如,用户语音为“深圳今天的天气怎么样”,该语音的意图为“查询天气”。又例如,用户语音为“播放一首歌曲”,该语音的意图为“播放音乐”。
每个意图下可以配置有一个或多个槽位。槽位是指系统需要从用户语音中收集的关键信息。例如,针对查询天气这一意图,配置的槽位可以包括地点槽位和时间槽位。地点槽位用于确定需要查询哪个地点的天气,时间槽位用于确定需要查询什么时候的天气。
槽位包括槽位值等属性,槽位值是指槽位的具体参数,又可称为槽位的实体。例如,用户语音为“今天深圳的天气怎么样”,从该语音中可以提取出地点槽位和时间槽位,地点槽位的实体为“北京”,时间槽位的实体为“今天”。
具体应用中,可以预先设置意图类别,以及每个意图类别下所配置的槽位。示例性地,在本申请实施例中,推荐音乐意图下配置的槽位包括但不限于目标设备槽位,该目标设备槽位用于指示接续对话业务的目标设备。例如,手机需要将对话业务接续至大屏设备,此时,源设备为手机,目标设备为大屏设备。
电子设备可以基于上述语音业务系统,与用户进行一轮或多轮的人机对话,以实现相应的语音业务。
示例性地,参见图2,示出了本申请实施例提供的语音控制手机播放音乐的场景示意图。如图2所示,手机21的主界面22上包括应用商城、时钟、备忘录、图库以及音乐等应用程序。
用户通过唤醒词“小艺小艺”唤醒手机21的语音助手小艺之后,手机21采集到用户语音“小艺小艺,推荐一首歌曲”;然后,手机21的语音助手通过上述图1示出的语音业务系统,对用户语音进行语音识别、语义理解等过程,确定出用户意图为推荐音乐,并得到推荐音乐的执行指令。此时,由于从用户语音中提取不出歌曲名称槽位的实体,手机21可以根据预设推荐规则,确定出推荐歌曲。例如,手机21可以根据用户的历史播放记录,将最近7天内播放最多的歌曲作为推荐歌曲。
手机21响应于推荐音乐的执行指令,自动播放歌曲A,并显示语音助手界面23。语音助手界面23包括用户语音的文本24,语音助手针对用户语音的回答语句文本25,以及音乐控件26。此时,音乐控件26内显示正在播放的歌曲为歌曲A。
手机21在响应于用户语音“小艺小艺,推荐一首歌曲”,自动播放歌曲A之后,用户想换一首歌,则向手机21输入用户语音“换一个”。手机21采集到用户语音“换一个”之后,通过语音识别模块,将用户语音转化成文本信息,并在语音助手界面23内显示用户输入语音的文本信息27。
手机21将“换一个”的文本信息输入语义理解模块,语义理解模块根据历史意图和输入的文本信息等信息,确定出用户意图为换歌单,并得到播放另一首歌曲的执行指令。此时,历史意图是推荐音乐,其是根据用户语音“小艺小艺,推荐一首歌曲” 确定出来的意图。
手机21响应于播放另一首歌曲的执行指令,自动播放歌曲B,并显示语言助手界面28。语音助手界面28包括用户语音的文本29以及音乐控件26,此时,音乐控件26内显示正在播放的歌曲为歌曲B。
在图2示出的场景中,用户和手机21的语音助手小艺之间的整个对话交互过程可以如下:
用户:小艺小艺。
小艺:在的。
用户:小艺小艺,推荐一首歌曲。
小艺:好的。
用户:换一个。
小艺:好的。
在该对话过程中,用户语音“换一个”并没有明确用户意图,但手机21仍然可以根据历史意图和对话语料等上下文信息,准确地识别出用户意图。此时,对话语料可以包括“小艺小艺,推荐一首歌曲”。这是因为整个对话交互过程均发生在手机21一侧,手机21上存储有对话过程的相关信息。
但是,在一些情况下,例如,跨设备的对话业务接续,上述对话过程的一部分发生在第一电子设备,另一部分发生在第二电子设备,第二电子设备上没有整个对话过程的相关信息,可能导致第二电子设备不能识别用户意图,进而不能实现跨设备的对话业务接续。
示例性地,参见图3,示出了本申请实施例提供的手机给大屏设备推荐音乐的场景示意图。此时,第一电子设备为手机,第二电子设备为大屏设备。手机和大屏设备内均设置有语音助手,并部署有图1的语音业务系统。
如图3所示,用户31向手机32输入用户语音“推荐一首歌曲给大屏设备”。手机32采集到用户语音“推荐一首歌曲给大屏设备”之后,手机32内的语音助手使用图1示出的语音业务系统,确定用户语音“推荐一首歌曲给大屏设备”的意图为推荐音乐,并且能提取出目标设备槽位,目标设备槽位的实体为大屏设备;然后,手机32生成推荐音乐的执行指令,并显示语音助手界面33,输出针对用户语音的回答语音“好的”;最后,向大屏设备34发送推荐音乐的执行指令。该推荐音乐的执行指令包括但不限于歌曲名称信息,用于指示大屏设备34播放歌曲。
大屏设备34接收到来自手机33的执行指令之后,响应于该执行指令,弹出窗口35。窗口35内显示有提示信息,用于询问用户是否播放手机推荐的歌曲A。用户可以通过点击窗口35上的“播放”按钮,以让大屏设备34播放歌曲A;也可以通过点击窗口上的“取消”按钮,以让大屏设备34取消播放歌曲A。
用户也可以通过向大屏设备34输入语音“播放”或“取消”,以向大屏设备34表明按钮选择意图。当用户输入语音为“播放”时,大屏设备34则选择“播放”按钮,当用户输入语音为“取消”时,大屏设备34则选择“取消”按钮。
在大屏设备34显示窗口35之后,如果用户31想要换一首歌曲,则可以向大屏设备34输入用户语音“换一个”。大屏设备34内的语音助手采集到用户语音“换一个” 之后,在进行意图识别时,将文本信息“换一个”输入至语义理解模块。此时,大屏设备34本地没有对话过程的历史意图“推荐音乐”、目标设备槽位的实体以及“换一个”的历史语料“推荐一首歌曲给大屏设备”等上下文信息,使得语义理解模块无法识别该用户语音的意图,进而使得大屏设备34无法响应于该用户语音,播放另一首歌曲。
由上可见,手机32转接业务给大屏设备34之后,大屏设备34只能选择播放或取消,无法识别与之前的对话关联的其他用户语音,无法实现跨设备的对话业务接续。
发明人在研究过程中发现,可以将用于描述用户语音意图的相关信息,随着业务流一并传输给目标设备,以实现跨设备的对话业务接续。
本申请实施例提供一种跨设备的对话业务接续方案。在一些实施例中,第一电子设备转接业务给第二电子设备时,可以将最近N轮对话的意图和槽位等上下文信息传输给第二电子设备。也即,第一电子设备将最近N轮对话的意图和槽位等上下文信息随着业务流传输至第二电子设备。这样,第二电子设备在进行意图识别时,可以根据接收到的意图和槽位等上下文信息,准确识别出用户语音的意图,以实现跨设备的对话业务接续。其中,N为大于或等于1的正整数。
下面将结合附图,对本申请实施例提供的跨设备的对话业务接续方案进行详细阐述。以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。
参见图4,示出了本申请实施例提供的跨设备的对话业务接续系统的一种示意图,该系统可以包括第一电子设备41和第二电子设备42。第一电子设备41和第二电子设备42可以通过通信连接进行信息交互,该通信连接可以示例性为蓝牙连接或Wi-Fi点对点连接等。
第一电子设备41和第二电子设备42可以包括语音助手应用程序,也可以包括集成有语音助手功能的应用程序。
本申请实施例中,跨设备的对话业务接续系统可以包括一个或至少两个第二电子设备42。也就是说,第一电子设备41可以同时将业务流转至一个或至少两个第二电子设备42,并在业务流转的时候,将意图和槽位等信息传输至第二电子设备42。
第一电子设备41和第二电子设备42可以是登陆同一个用户账号的设备。例如,第一电子设备41为手机,第二电子设备42为大屏设备,手机和大屏设备均登陆同一个华为用户账号。
当然,第一电子设备41和第二电子设备42也可以是登录不同用户账号的设备,例如,第一电子设备41登陆的账号为用户A,第二电子设备42登陆的账号为用户B。此时,第一电子设备41和第二电子设备42可以同属于一个群组,例如,同属于一个家庭群组。或者,第一电子设备41和第二电子设备42可以是已建立可信连接的设备,例如,第一电子设备41为手机,第二电子设备42为大屏设备,手机和大屏设备通过“一碰连”,建立可信连接。
在一些实施例中,第一电子设备41和第二电子设备42均可以包括图1对应的语音业务系统,或者该语音业务系统内的部分模块。
此时,第一电子设备41和第二电子设备42通常是富设备。富设备是指:资源丰 富的设备,该资源丰富的设备可以是指存储空间充裕的电子设备,和/或处理性能充裕的电子设备等。一般情况下,处理性能充裕,内存充裕,以及存储空间充裕的电子设备,可以称为富设备或胖设备。例如,富设备可以包括手机、电脑、服务器以及平板等。
与富设备相对的瘦设备是指:资源受限的设备,该资源受限的设备可以是指存储空间受限的电子设备,和/或处理性能受限的电子设备等。一般情况下,处理性能低,内存少,存储空间少的电子设备,可以称为瘦设备。例如,瘦设备可以包括耳机、音箱以及手表等。
示例性地,参见图5,示出了本申请实施例提供的跨设备的对话业务接续系统的另一种示意图,第一电子设备51包括第一应用程序511、第一语音识别模块512、第一语义理解模块513、第一对话管理模块514以及第一指令交互服务515;第二电子设备52包括第二应用程序521、第二语音识别模块522、第二语义理解模块523、第二对话管理模块524以及第二指令交互服务525。
需要说明的是,第一电子设备51和第二电子设备52可以包括NLG模块和TTS模块,也可以不包括NLG模块和TTS模块。
第一应用程序511可以为语音助手,也可以为集成有语音助手功能的应用程序。同理,第二应用程序521可以为语音助手,也可以为集成有语音助手功能的应用程序。
第一指令交互服务515和第二指令交互服务525用于电子设备之间的指令交互。在其他实施例中,也可以通过其他方式实现设备间的指令交互。
如图5所示,第一电子设备51采集到用户语音之后,第一应用程序511将用户语音输入至第一语音识别模块512,得到第一语音识别模块512输出的文本信息;然后,将第一语音识别模块512输出的文本信息,输入至第一语义理解模块513,得到第一语义理解模块513输出的用户语音的语义信息,该语义信息包括从用户语音提取的意图以及该意图对应的槽位等;再将语义信息输入至第一对话管理模块514,得到执行指令。
第一应用程序511获取到第一对话管理模块514输出的执行指令之后,将该执行指令、历史意图和历史意图对应的槽位等信息通过第一指令交互服务515传输至第二电子设备52。历史意图可以包括最近发生的N轮对话对应的意图。
需要说明的是,第一电子设备51可以根据用户语音,确定是否需要向第二电子设备52发送执行指令、历史意图和对应槽位等信息,即确定是否需要转接业务,也即,确定是否需要向第二电子设备52发送指令,并在发送指令的时候,携带历史意图和对应槽位等上下文信息。
例如,第一电子设备51如果可以从用户语音中提取出目标设备槽位,且目标设备槽位的实体不是本设备,则可以确定需要转接业务,并将目标设备槽位的实体作为转接业务的目标设备。也即,第一电子设备51确定需要向第二电子设备52(即目标设备)发送执行指令、历史意图和对应槽位等信息。例如,在图3的场景中,手机32从用户语音“推荐一首歌曲给大屏设备”中可以提取出目标设备槽位,且目标设备槽位的实体为大屏设备,则确定需要转接业务,且转接业务的目标设备是大屏设备。
进一步地,如果第一电子设备51可以从用户语音中提取出至少两个目标设备槽位, 且这至少两个目标设备槽位的实体均不是本设备,第一电子设备51则可以确定需要向这至少两个目标设备槽位对应的目标设备发送执行指令、历史意图和对应槽位等信息。此时,第一电子设备51可以将执行指令、历史意图和对应槽位等信息,同时分发至这至少两个目标设备槽位对应的目标设备。
第一电子设备51在将执行指令、历史意图和对应槽位等信息分发给至少两个第二电子设备52(即目标设备)的过程中,可以先判断与至少两个第二电子设备52之间是否已建立通信连接。
如果已建立通信连接,第一电子设备51可以进一步判断与至少两个第二电子设备52之间的通信连接是否相同。具体地,第一电子设备51根据通信连接的类型,将发送给同一个第二电子设备52的执行指令、历史意图和对应槽位等信息进行关联后,再将关联后的执行指令、历史意图和对应槽位等信息发送至对应的第二电子设备52。
例如,第一电子设备51为手机,第二电子设备52包括大屏设备和平板。手机接收到的用户语音为“推荐一首歌曲给大屏设备和平板”,手机从该用户语音中可以提取出两个目标设备槽位,这两个目标设备槽位的实体分别为“大屏设备”和“平板”,即第二电子设备包括大屏设备和平板。手机则可以确定需要将业务转接至大屏设备和平板,即需要分别向大屏设备和平板发送执行指令、历史意图和对应槽位等信息。
此时,手机检测出与大屏设备和平板均已建立通信连接,且与大屏设备之间的通信连接的类型为Wi-Fi点对点连接,与平板的通信连接的类型为蓝牙连接。手机根据Wi-Fi点对点协议,将执行指令、历史意图和对应槽位等信息进行打包后,将数据包通过Wi-Fi点对点连接发送给大屏设备,以将执行指令、历史意图和对应槽位等信息发送至大屏设备。同理,手机根据蓝牙协议,将执行指令、历史意图和对应槽位等信息进行打包后,将数据包通过蓝牙连接发送给平板。
第一电子设备51和第二电子设备52之间如果没有建立通信连接,第一电子设备51可以判断是否已经与第二电子设备52配对,如果已配对,则可以根据本地存储的第二电子设备52的相关信息,向第二电子设备52发送建立连接的请求,以建立与第二电子设备52之间的通信连接。建立通信连接之后,第一电子设备51再根据通信连接的类型,将执行指令、历史意图和对应槽位等信息发送至第二电子设备52。其中,第二电子设备52的相关信息可以示例性包括设备标识和IP地址等。
例如,手机检测到与大屏设备之间没有建立连接,但本地上存储有大屏设备的相关信息,则根据大屏设备的IP地址和设备标识等信息,向大屏设备发送用于建立Wi-Fi点对点连接的请求。大屏设备在接收到用于建立Wi-Fi点对点连接的请求之后,可以响应于该请求,与手机建立Wi-Fi点对点连接。
第一电子设备51和第二电子设备52之间如果没有建立通信连接,也没有配对,第一电子设备51可以提示用户找不到对应的设备。
例如,手机检测到与大屏设备之间没有建立连接,也没有配对,手机可以通过提示窗口或者提示语音,提示用户找不到大屏设备,请建立与大屏设备之间的连接。
当然,第一电子设备51即使没有和第二电子设备52配对,但可以获取到第二电子设备52的相关信息,也可以根据第二电子设备52的相关信息,向第二电子设备52发起用于建立通信连接的请求。
另外,第一电子设备51如果可以从用户语音提取出至少两个设备槽位,且这两个设备槽位中有一个槽位的实体是本设备,其它槽位的实体不是本设备,则确定需要向对应的第二电子设备52(即其它槽位对应的设备)发送执行指令、历史意图和对应槽位等信息。此时,第一电子设备51在得到执行指令之后,可以在本设备执行第一执行指令。
例如,手机采集到的用户语音为“播放歌曲A,并将该歌曲A推荐给大屏设备”。手机从该用户语音中可以提取出两个设备槽位,一个设备槽位的实体是“本设备”,另一个设备槽位的实体是“大屏设备”。此时,由于用户语音中包括目标设备槽位,且目标设备槽位的实体不是本设备,则确定需要将执行指令、历史意图和对应槽位等信息发送至大屏设备。
手机根据用户语音“播放歌曲A,并将该歌曲A推荐给大屏设备”可以获得播放歌曲A的执行指令,推荐歌曲A的执行指令。在获得播放歌曲A的执行指令之后,手机可以自动播放歌曲A。同时,手机还将推荐歌曲A的执行指令、历史意图和对应槽位等信息发送至大屏设备。
可以看出,本申请实施例中,第一电子设备51可以根据用户语音,确定是否需要向一个或多个第二电子设备52发送执行指令、历史意图和对应槽位等信息。
相较而言,第一电子设备通过一个用户语音,分别向至少两个第二电子设备发送执行指令、历史意图和对应槽位等信息,以将对话业务接续至多个第二电子设备,这样可以提高业务转接的便捷性,提高用户体验。
示例性地,用户A想将歌曲A同时推荐用户B和用户C。用户A的设备为手机,用户B的设备为大屏设备A,用户C的设备为大屏设备B。手机、大屏设备A和手机B均属于同一个群组(例如,家庭群组或好友群组)。
用户A向手机输入用户语音“向大屏设备A和大屏设备B推荐歌曲A”。手机从该用户语音中可以提取出两个目标设备槽位,且这两个目标设备槽位的实体分别为“大屏设备A”和“大屏设备B”,则可以确定需要分别向大屏设备A和大屏设备B发送执行指令、历史意图和对应槽位等信息。然后,手机则将用户语音“向大屏设备A和大屏设备B推荐歌曲A”对应的执行指令、历史意图和对应槽位等信息分别发送至大屏设备A和大屏设备B。此时,历史意图包括从用户语音“向大屏设备A和大屏设备B推荐歌曲A”提取出的推荐音乐意图,对应槽位包括从用户语音“向大屏设备A和大屏设备B推荐歌曲A”提取出的歌曲名称槽位,且歌曲名称槽位的实体为歌曲A。
大屏设备A接收到来自手机的执行指令、历史意图和对应槽位等信息,则向用户B推荐歌曲A。同理,大屏设备B向用户C推荐歌曲A。这样,相较于用户通过两个语音命令分别给两个用户推荐歌曲,用户A通过一个语音命令,即可将歌曲同时推荐给大屏设备A和大屏设备B,便捷性较高,用户体验较好。
如果第一电子设备51从用户语音中提取不到目标设备槽位,则可以确定不需要向第二电子设备发送执行指令、历史意图和对应槽位等信息。此时,第一电子设备51在获得第一对话管理模块514输出的执行指令之后,则执行该执行指令,得到对应的执行结果。
例如,在图2的场景中,手机21从用户语音“小艺小艺,推荐一首歌曲”提取不 出目标设备槽位,则确定不需要向第二电子设备52发送执行指令、历史意图和对应槽位等信息。因此,在获得播放歌曲的执行指令之后,手机21则响应于该执行指令,自动播放歌曲相应的歌曲。
当第一电子设备51确定需要向第二电子设备52发送执行指令、历史意图和对应槽位等信息时,第一电子设备51除了将执行指令传输至第二电子设备52(即目标设备)之外,还将历史意图和对应的槽位等上下文信息传输至第二电子设备52。通常情况下,第一电子设备51可以将最近N轮对话的意图和对应槽位传输至第二电子设备52。N的值可以根据实际需要进行设定,示例性地,N=1,或N=3,或其他。即第一电子设备51可以将最近3轮对话的意图和对应槽位传输至第二电子设备52。
其中,最近N轮对话可以是指对话发生时间距离当前时间点最近的N轮对话,即最近发生的N轮对话。
例如,用户和手机上的语音助手小艺之间的部分对话可以如下表1所示。
表1
Figure PCTCN2022084544-appb-000001
假设当前时间点为2021年6月7日20时30分,手机采集到“推荐歌曲B给大屏设备”之后,从该用户语音中提取出目标设备槽位,且目标设备槽位的实体为“大屏设备”,则确定需要向大屏设备发送执行指令和最近N轮对话的意图等信息。
此时,如果N=1,则根据对话发生时间,确定出距离当前时间点最近的1轮对话为“推荐歌曲B给大屏设备”。即手机将推荐歌曲B的执行指令和“推荐歌曲B给大屏设备”的意图信息(即推荐音乐)传输至大屏设备。进一步地,还可以将从用户语音提取出的歌曲名称槽位一并传输至大屏设备。
如果N=2,根据对话发生时间,确定出距离当前时间点最近的2轮对话分别为“推荐歌曲B给大屏设备”和“播放歌曲B”。手机将这两轮对话的意图和推荐歌曲B的执行指令传输至大屏设备。此时,手机传输给大屏设备的历史意图可以包括播放音乐和推荐音乐。
如果N=3,根据对话发生时间,确定出距离当前时间点最近的3轮对话分别为“推荐歌曲B给大屏设备”、“播放歌曲B”以及“小艺小艺,播放一首歌曲A”。手机可以将这3轮对话的意图和推荐歌曲B的执行指令传输至大屏设备。此时,手机传输给大屏设备的历史意图包括播放音乐和推荐音乐。
同理,如果N=4或者其他值,手机可以根据对话发生时间,确定出距离当前时间点最近的N轮对话,将这N轮对话中每轮对话的意图和推荐歌曲B的执行指令传输至大屏设备。
可以理解的是,当历史对话的轮数小于N时,第一电子设备51可以将所有历史对话的意图和对应槽位均传输至第二电子设备52。例如,当N=3时,但历史对话的轮数只有2轮,即用户和设备之间只发生了两轮对话,则将这两轮对话的意图均传输至对端设备。
在另一些实施例中,第一电子设备51也可以将对话发生时间位于目标时间点之后的对话的意图均传输至第二电子设备52。目标时间点可以是当前时间点和预设时间阈值之间的差值。预设时间阈值可以根据需要设定,例如,预设时间阈值为24小时、12小时、6小时或者1小时等。
例如,以上表1示出的对话数据为例,预设时间阈值为24小时,当前时间点为2021年6月7日20时30分。根据预设时间阈值和当前时间点,则可以确定出目标时间点为2021年6月6日20时30分。
此时,由于表1中对话的发生时间均位于2021年6月6日20时30分之后,手机将播放歌曲B的执行指令和表1中所有对话的意图均传输至大屏设备。
又例如,如果预设时间阈值为1小时,目标时间点则为2021年6月7日19时30分。此时,对话发生时间位于2021年6月7日19时30分之后的对话包括“推荐歌曲B给大屏设备”、“播放歌曲B”以及“小艺小艺,播放一首歌曲A”。手机可以将播放歌曲B的执行指令和这3轮对话的意图均传输至大屏设备。
当然,在又一些实施例中,第一电子设备51可以将所有的历史对话均传输至大屏设备。但是,如果历史对话轮数较多,传输的时候可能需要占用较大的带宽,使得传输时延增大,影响用户体验。
相较而言,第一电子设备51根据对话发生时间,将距离当前时间点最近的N轮对话的意图等信息传输至第二电子设备52,不仅可以使得第二电子设备52可以根据传输过去的意图等信息,准确识别新采集的用户语音的意图,还可以使得传输时延位于合理区域,用户体验较高。
当然,在某些情况下,如果第一电子设备51和第二电子设备52之间的带宽很大,第一电子设备51可以将所有的历史对话的意图等信息传输给第二电子设备52。
在又一些实施例中,第一电子设备51也可以将最近一轮对话的意图以及关联意图一并传输至第二电子设备51。其中,关联意图是指与最近一轮对话的意图关联的意图。 例如,最近一轮对话为“推荐一首歌曲给大屏设备”,其意图为推荐音乐。与推荐音乐关联的意图包括播放音乐、搜索音乐等。
进一步地,除了可以将关联意图传输给第二电子设备51,还可以将关联意图对应的相关信息一并传输给第二电子设备51。例如,关联意图为播放音乐,关联意图对应的相关信息可以包括播放音乐这一意图对应的歌曲名称和歌手信息等。
需要说明的是,第一电子设备51除了将执行指令传输给第二电子设备52之外,还将用于描述用户语音意图的信息传输给第二电子设备52。
上述用于描述用户语音意图的信息可以具体为意图。第一电子设备51从用户语音中提取出意图,并将执行指令和意图一并传输给第二电子设备52。具体应用中,第一电子设备51可以将最近N轮对话的意图传输给第二电子设备52。
此时,第二电子设备52在采集到用户语音之后,第二语义理解模块523可以根据第一电子设备51发送的意图,以及第二语音识别模块输出的用户语音的文本信息,识别出用户语音的意图。
上述用于描述用户语音意图的信息可以为语料。第一电子设备51不将从用户语音中提取出的意图传输给第二电子设备52,而是将对话语料传输给第二电子设备52。其中,对话语料是指用户语音的文本。第一电子设备51在采集到用户语音之后,通过语音识别模块将用户语音转化成文本,得到对话语料。具体应用中,第一电子设备51可以将最近N轮对话的对话语料传输给第二电子设备52。例如,以表1的对话数据为例,N=3,第一电子设备51传输给第二电子设备52的对话语料包括:“推荐歌曲B给大屏设备”的文本、“播放歌曲B”的文本以及“小艺小艺,播放一首歌曲A”的文本。
此时,第二电子设备52在采集到用户语音之后,第二语义理解模块523可以根据第一电子设备51发送的语料,以及第二语音识别模块输出的用户语音的文本信息,识别出用户语音的意图。
例如,第一电子设备51发送给第二电子设备52的对话语料包括“推荐一首歌曲给大屏设备”,第二电子设备52新采集到的用户语音为“换一个”。第二语义理解模块523根据输入的对话语料“推荐一首歌曲给大屏设备”,以及用户语音的文本“换一个”,输出用户语音“换一个”的意图。用户语音“换一个”的意图为推荐音乐,该意图包括换歌单的槽位。
或者,第二电子设备52的第二语义理解模块523也可以先从第一电子设备51发送的语料中提取出意图,然后再根据新采集的用户语音和从对话语料中提取出的意图,识别出用户语音的意图。
例如,第一电子设备51发送给第二电子设备52的对话语料为“推荐一首歌曲给大屏设备”,第二电子设备52新采集到的用户语音为“换一个”。第二语义理解模块523可以先从“推荐一首歌曲给大屏设备”这一对话语料中,提取出“推荐音乐”的意图。然后第二语义理解模块523根据输入的意图“推荐音乐”,以及用户语音的文本“换一个”,输出用户语音“换一个”的意图。
上述用于描述用户语音意图的信息也可以同时包括意图和对话语料,即第一电子设备51在传输执行指令的时候,将最近N轮对话的对话意图和对话语料一并传输给 第二电子设备52。
此时,第二电子设备52的第二语义理解模块523可以根据需要,选择第一电子设备51发送的意图或者对话语料进行意图识别。
进一步地,为了提高第二电子设备52意图识别的准确性,实现更开放的跨设备对话接续,第一电子设备51向第二电子设备52发送执行指令的时候,除了将用于描述用户语音意图的信息一并传输给第二电子设备52之外,还可以将目标信息一并传输给第二电子设备52。
该目标信息可以根据需要设定。示例性地,该目标信息可以包括槽位。该槽位可以包括意图对应的槽位。第一电子设备51可以从用户语音中提取出意图以及意图对应的槽位。例如,从用户语音“推荐歌曲A给大屏设备”中可以提取出意图“推荐音乐”以及歌曲名称槽位,歌曲名称槽位的实体为歌曲A。
该槽位除了可以包括意图对应的槽位,还可以包括目标设备槽位。也即,第一电子设备51除了可以将意图对应的槽位传输给第二电子设备52之外,还可以将目标设备槽位一并传输给第二电子设备52。例如,针对用户语音“推荐歌曲A给大屏设备”,其包括“推荐音乐”对应的槽位为歌曲名称槽位,还包括目标设备槽位,此时将歌曲名称槽位和目标设备槽位一并传输至第二电子设备52。
可以理解的是,如果第一电子设备51可以从用户语音中提取出至少两个目标设备槽位,在传输目标设备槽位时,第一电子设备51可以将对应的目标设备槽位传输给对应的第二电子设备52。
例如,手机从用户语音“推荐一首歌曲给大屏设备和平板”中提取出两个目标设备槽位,手机可以大屏设备对应的目标设备槽位传输给大屏设备,将平板的目标设备槽位传输给大屏设备。
当然,第一电子设备51如果将对话语料一并传输给第二电子设备52,第二电子设备52可以直接从对话语料中提取出意图对应的槽位和/或目标设备槽位,这样第一电子设备51可以不用将槽位传输至第二电子设备52。
该目标信息可以包括用户画像信息、用户实时位置信息等用户信息。用户画像信息示例性包括用户性别、用户喜好信息以及用户职业等信息。第一电子设备51可以根据收集的用户信息,生成用户画像。
第二电子设备52可以根据第一电子设备51发送的用户信息,确定出用户喜好、职业等个人相关的信息,根据个人相关的信息,可以给用户提供更加个性化的服务。例如,如果第一电子设备51和第二电子设备52是同一个用户账号,第二电子设备52可以根据第一电子设备51传输的用户信息,给用户推荐符合用户偏好、符合用户职业的歌曲。该目标信息也可以包括场景信息,该场景信息用于描述用户当前所处场景。第二电子设备52通过用户所处场景的信息,可以得知用户当前场景。例如,当前所处位置,当前所处场景等。并且,第二电子设备52可以根据场景信息,给用户提供更加个性化、更加准确的服务,以实现更加开放的跨设备的对话业务接续。
例如,第一电子设备51通过自身集成的加速度传感器采集到的加速度信息,确定出用户处于行走状态,将表征用户当前处于行走状态的信息传输至第二电子设备52,第二电子设备52可以得知用户当前正在行走,即确定出用户处于运动场景;然后,在 给用户推荐歌曲的时候,可以给用户推荐运动场景下的歌曲。
又例如,第一电子设备51和第二电子设备52如果登陆的是同一个用户账号。在推荐歌曲场景,第二电子设备52根据用户实时位置信息,确定出用户当前处于家中,然后给用户推荐符合用户喜好的歌曲。该目标信息也可以包括应用状态信息。应用状态信息是指目标应用的相关信息,该目标应用通常是前台运行的应用程序。例如,手机正在播放音乐,目标应用则为音乐应用程序。当然,目标应用也可以不是前台运行的应用程序。
目标应用的相关信息可以根据实际应用场景设定。例如,在播放音乐场景,目标应用为音乐应用程序,目标应用的相关信息则可以包括用户播放记录,该用户播放记录包括歌曲名称和播放时间等信息。第一电子设备51通过将歌曲名称等信息传输给第二电子设备52,第二电子设备52在采集到用户语音之后,可以根据歌曲名称等信息,识别出用户语音中的歌曲名称。又例如,在导航场景,目标应用包括日程应用程序和导航应用程序,目标应用的相关信息则可以包括用户日程信息以及用户导航历史记录信息等。第一电子设备51通过将用户日程和导航历史记录等信息传输给第二电子设备52,第二电子设备52在采集到用户语音之后,可以根据这些信息,识别出用户语音中的地点信息。
由上可见,第二电子设备52可以根据应用状态信息,更准确地识别新采集的用户语音,实现更加开放的跨设备对话业务接续。此外,在一些情况下,第二电子设备52还可以根据应用状态信息,给用户提供更个性化、更精准的服务。例如,应用状态信息包括歌曲的历史播放信息,第二电子设备52可以根据歌曲的历史播放信息,给用户推荐更符合用户偏好的歌曲。
第一电子设备51传输给第二电子设备52的应用状态信息可以是与最近一轮对话的意图关联的信息。例如,最近一轮对话的意图为推荐音乐,手机上安装有至少两个音乐应用程序,与推荐音乐关联的应用状态信息包括这至少两个音乐应用程序的相关信息。这至少两个音乐应用程序可以同时处于运行状态,也可以不处于运行状态。
目标应用的相关信息还可以包括前台应用或正在运行的应用程序标识。第二电子设备52根据该应用程序标识,判断本地是否有相同的应用程序,如果有,则使用与相同的应用程序执行第一执行指令;如果没有,则使用类似的应用程序执行执行指令。
示例性地,在推荐音乐场景,手机和大屏设备均包括多个播放音乐的音乐应用程序,此时,手机上安装的音乐应用程序包括华为音乐、应用1和应用2。
某个时刻,手机正在使用华为音乐播放歌曲。手机采集到用户语音“推荐一首歌曲给大屏设备”,则生成该用户语音对应的推荐音乐的执行指令,并将推荐音乐的执行指令、历史意图、对应槽位和应用状态信息传输给大屏设备。应用状态信息包括华为音乐的应用程序标识。应用状态信息包括华为音乐、应用1和应用2这3个音乐应用程序的相关信息,例如,播放记录、歌曲名称和用户喜好歌曲等。
大屏设备在接收到应用状态信息之后,先根据华为音乐的应用程序标识,确定本地是否安装有华为音乐。如果有,则大屏设备的语音助手则响应于推荐音乐的执行指令,使用本地的华为音乐播放相应的歌曲。如果本地没有安装有华为音乐,大屏设备则可以使用其他的音乐应用程序播放相应的歌曲。
相较而言,第一电子设备51将应用状态信息传输至第二电子设备52,第二电子设备52可以根据应用状态信息,优先与目标应用相同的应用程序执行第一执行指令,使得业务流转更加自然,不突兀,用户体验更佳。
需要说明的是,目标信息可以包括槽位、用户信息、场景信息和应用状态信息中的任意一个或任意组合。
也就是说,在一些实施例中,第一电子设备51在将业务转接给第二电子设备52时,为了让第二电子设备52可以准确识别用户语音,实现跨设备的对话业务接续,可以将执行指令和用于描述用户语音意图的信息传输给第二电子设备52。在另一些实施例中,进一步地,为了让第二电子设备52的意图识别准确率更高,实现更加开放的跨设备对话业务接续,第一电子设备51可以将执行指令、用于描述用户语音意图的信息、以及目标信息一并传输给第二电子设备52。图5中示例性示出了第一电子设备51向第二电子设备52发送的信息包括执行指令、历史意图和对应槽位等。
具体应用中,第一电子设备51在向第二电子设备52发送执行指令的时候,携带的信息越多,所占用的带宽可能就越高。而第一电子设备51和第二电子设备52之间的带宽不可能无限制地大,所以如果携带的信息过多,可能会增加传输时延,影响用户体验。基于此,可以需要选择发送执行指令的时候所携带的信息。第二电子设备52通过第二指令交互服务525接收来自第一电子设备51的信息,该信息示例性包括执行指令、历史意图和对应槽位等信息;然后,第二电子设备52将该执行指令传递至第二应用程序521,第二应用程序512可以响应于该执行指令。另外,第二电子设备52还将接收到的历史意图和对应槽位等信息存储在本地。
在一些实施例中,第二电子设备52在接收到来自第一电子设备51的执行指令之后,可以自动唤醒本设备的语音助手,这样,用户可以不用通过唤醒词唤醒第二电子设备上的语音助手,用户可以直接向第二电子设备52输入对应的用户语音,从而使得跨设备的对话业务接续更加流畅,用户体验更高。
当然,在另一些实施例中,第二电子设备52在接收到第二电子设备51的执行指令之后,也可以不自动唤醒本设备的语音助手,而是采集到用户输入的特定唤醒词后再唤醒本设备的语音助手。
需要说明的是,第二电子设备52在接收到来自第一电子设备的历史意图和对应槽位等上下文信息之后,第二语义理解模块523可以将第一电子设备51发送的上下文信息放到自身的历史上下文中,并将第一电子设备51发送的上下文信息作为最新的上下文。这样,第二电子设备52在采集到新的用户语音时,可以根据最新的上下文信息,识别新采集的用户语音的意图。
示例性地,上下文信息可以包括历史意图、对应槽位和对话语料,可以包括历史意图和对应槽位,也可以包括历史意图和对话语料。
或者,第二电子设备52在接收到来自第一电子设备51的上下文信息之后,也可以根据接收到的上下文信息,创建一个新的会话。该新的会话内包括第一电子设备51发送的历史意图、对应槽位和对话语料等信息。这样,第二电子设备52在采集到新的用户语音,第二语义理解模块523可以根据会话创建时间,使用最新创建的会话包含的信息,准确识别新采集的用户语音的意图。
或者,第二电子设备52也可以将接收到的上下文信息的优先级设置为最高优先级。这样,第二电子设备52在采集到新的用户语音,第二语义理解模块523可以根据最高优先级的上下文信息,准确识别新采集的用户语音的意图。
还需要说明的是,第二电子设备52在接收第一电子设备51发送的执行指令之后,可以先判断当前是否有正在进行的任务,如果当前没有正在进行的任务,第二电子设备52可以执行第一电子设备51发送的执行指令;如果当前有正在进行的任务,可以等待当前任务执行完成后,再执行第一电子设备51发送的执行指令,也可以进一步判断当前任务的剩余时间,如果剩余时间小于一定阈值,则可以等待当前任务执行完毕后再执行第一电子设备51发送的执行指令,反之,可以中断当前任务,执行第一电子设备51发送的执行指令。这样,可以让对话业务接续更及时,用户体验更佳。
第二应用程序521响应于该执行指令,得到对应的执行结果之后,对话业务则从第一电子设备51流转至第二电子设备52。之后,第二电子设备52采集到用户语音之后,可以将根据第一电子设备51传输的历史意图和槽位等信息,准确识别用户语音的意图。
第二电子设备52采集到用户语音之后,第二应用程序521将该用户语音输入至第二语音识别模块522,获得第二语音识别模块522输出的文本信息;然后,将第二语音识别模块522输出的文本信息输入至第二语义理解模块523,第二语义理解模块523根据第二语音识别模块522输出的文本信息、第一电子设备51传输的历史意图和槽位等信息,提取出用户语音的意图和槽位;最后,将第二语义理解模块523输出的语义信息输入至第二对话管理模块524,获得执行指令。第二应用程序521获得第二对话管理模块524输出的执行指令之后,响应于该执行指令,得到该执行指令对应的执行结果。其中,第二对话管理模块52可以根据需要,选择所需要的信息生成对应的执行指令。
例如,在推荐音乐场景下,第二对话管理模块524输出推荐歌曲的执行指令。此时,第二对话管理模块524的输入可以包括语义信息等。音乐应用程序接收到推荐歌曲的执行指令之后,可以根据来自第一电子设备51的用户信息和应用状态信息等,确定出推荐歌曲为歌曲A。
也就是说,第二对话管理模块524输出的执行指令不包括推荐歌曲的信息,而是由音乐应用程序确定推荐歌曲的。
在另一些实施例中,第二对话管理模块524可以输出推荐歌曲A的执行指令。此时,第二对话管理模块524的输入可以包括语义信息和推荐歌曲的名称。该推荐歌曲的名称可以是系统根据来自第一电子设备51的用户信息和应用状态信息等确定出来的。音乐应用程序在接收到推荐歌曲A的执行指令之后,不用执行推荐操作,自动推荐歌曲A。
也就是说,第二对话管理模块524输出的执行指令包括推荐歌曲的信息。
同理,第一电子设备51一侧的第一对话管理模块514输出的执行指令,可以包括推荐歌曲的信息,也可以不包括推荐歌曲的信息。
如果第一电子设备51传输给第二电子设备52的信息中包括目标信息,第二电子设备52可以根据目标信息给用户提供更加个性化、更精准的服务,实现更加开放的跨 设备对话业务接续,提高用户体验。
例如,在推荐歌曲场景,第二电子设备52根据用户信息中的用户职业信息和用户实时位置信息,给用户推荐更符合用户身份和当前位置的歌曲。具体地,第二电子设备52根据用户职业信息,确定出用户职业为教室;根据用户实时位置信息,确定用户当前所在位置为学校。基于用户职业和用户当前所处位置,第二电子设备52则给用户推荐儿童歌曲。此时,第一电子设备51和第二电子设备52的用户账号不是同一个用户。
或者,如果第二电子设备52根据用户实时位置信息,确定用户当前所在位置位家中。基于用户当前所处位置,第二电子设备52则给用户推荐符合用户喜欢的歌曲。此时,第一电子设备51和第二电子设备52的用户账号是同一个用户。由上可见,当业务需要从第一电子设备51转接至第二电子设备52时,第一电子设备51将执行指令、历史意图和槽位等信息传递给第二电子设备52,以便于第二电子设备52可以根据第一电子设备51传递的历史意图和槽位等信息,识别用户语音的意图,实现跨设备的对话业务接续。
示例性地,下面结合图6A、图6B和图7,对手机给大屏设备推荐音乐的场景进行介绍说明。图6A和图6B为本申请实施例提供的手机给大屏设备推荐音乐的场景示意图,图7为本申请实施例提供的手机给大屏设备推荐音乐的流程示意图。
如图6A和图6B所示,第一电子设备为手机62,第二电子设备为大屏设备64。手机62和大屏设备64上均安装有语音助手,并部署有图1对应的语音业务系统。
如图7所示,该流程可以包括以下步骤:
步骤S701、手机62采集用户61的第一用户语音。
此时,第一用户语音具体为“推荐一首歌曲给大屏设备”。
示例性地,用户61通过特定的唤醒词,唤醒手机62的语音助手,然后,用户61跟手机62的语音助手说“推荐一首歌曲给大屏设备”,手机62通过麦克风等声音采集装置,采集到语音数据“推荐一首歌曲给大屏设备”。
步骤S702、手机62将第一用户语音转化成第一文本。
可以理解的是,手机62内部署有图1对应的语音业务系统,或者该语音业务系统的部分模块。
以图5为例,手机62包括第一语音识别模块512、第一语义理解模块513和第一对话管理模块514。手机62采集到第一用户语音“推荐一首歌曲给大屏设备”之后,手机62的语音助手将该第一用户语音输入至第一语音识别模块512,第一语音识别模块512将第一用户语音转化成第一文本“推荐一首歌曲给大屏设备”。
步骤S703、手机62从第一文本提取出第一意图和第一槽位。
其中,第一槽位为第一意图所配置的槽位。
示例性地,第一语音识别模块512获得第一文本“推荐一首歌曲给大屏设备”之后,将第一文本“推荐一首歌曲给大屏设备”输入至第一语义理解模块513。第一语义理解模块513对第一文本进行语义理解,输出第一意图和第一槽位。此时,第一意图为推荐音乐,第一槽位包括目标设备槽位,此时,目标设备槽位的实体为大屏设备64。当然,第一槽位除了可以目标设备槽位,还可以包括其他槽位,例如,如果用户 语音中包含歌曲名称,则第一槽位则包括歌曲名称槽位。
如图6A所示,手机62在识别出第一用户语音的意图之后,可以显示语音助手界面63。语音助手界面63上显示有第一文本“推荐一首歌曲给大屏设备”,以及针对第一用户语音的回答文本“好的”。
当第一槽位包括目标设备槽位,且目标设备槽位的实体不是手机62,手机62则确定需要向大屏设备64发送执行指令、历史意图和对应槽位等信息。
步骤S704、手机62根据第一意图和第一槽位,生成第一用户语音对应的第一执行指令。
示例性地,第一语义理解模块513获得第一意图和第一槽位之后,将第一意图和第一槽位等信息输入至第一对话管理模块514。第一对话管理模块514根据第一意图和第一槽位等信息,输出第一执行指令。此时,第一执行指令为用于推荐音乐的指令。
步骤S705、手机62向大屏设备64发送第一执行指令、历史意图和对应槽位等信息。该历史意图包括第一意图,对应槽位是指历史意图对应的槽位,其包括第一槽位。第一槽位可以包括目标设备槽位,也可以不包括目标设备槽位。
需要说明的是,如果除了第一用户语音之外,还包括其他的历史对话,则该历史意图则包括其他的历史对话对应的意图。例如,历史意图包括最近3轮对话的意图,每轮对话均有其对应的意图。此时,手机62和用户61之间只进行了一轮对话,则历史意图可以只包括第一意图。相对应地,对应槽位则包括第一意图对应的第一槽位。
除了第一执行指令、历史意图和对应槽位之外,手机62还可以将对话语料、用户信息、场景信息和应用状态信息等一并传递给大屏设备64。示例性地,手机62传递给大屏设备64的信息可以如表2所示。
表2
Figure PCTCN2022084544-appb-000002
更具体地,手机62给大屏设备63传递的信息可以如下:
Figure PCTCN2022084544-appb-000003
Figure PCTCN2022084544-appb-000004
其中,nluResult是指意图识别结果,此时具体为手机62一侧的意图识别结果,该意图识别结果包括意图和槽位。intentNumber是指意图序号,intentName是指意图名称,此时,意图名称为推荐音乐。slots是指槽位,此时,槽位的名称为设备名称,槽位的具体参数是设备B,设备B此时具体为大屏设备64。
orgAsrText是指语音识别模块输出的文本信息,此处具体为手机62一侧的语音识别结果,该文本信息具体为“推荐一首歌给大屏设备”。
也就是说,手机62发送给大屏设备64的信息可以包括第一执行指令、历史意图、历史意图对应的槽位、对话语料信息、用户信息、场景信息和应用状态信息等。此处,对话语料包括语料“推荐一首歌曲给大屏设备”。手机62此时前台运行的应用程序为音乐应用程序,应用状态信息则可以包括用户播放记录等信息,用户信息可以包括用户画像和用户实时位置等信息,场景信息可以包括用于表征用户行走的信息。
手机62向大屏设备64转接业务的时候,将历史意图和对应槽位等信息同步至大屏设备64,这样大屏设备64在后续的对话中,可以根据手机62同步的历史意图和槽位等信息,识别出新输入的用户语音的意图,实现跨设备的对话接续。
步骤S706、大屏设备64执行第一执行指令。
示例性地,大屏设备64在接收到第一执行指令之后,将该第一执行指令传递至大屏设备64的语音助手,大屏设备64的语音助手响应于该第一执行指令,得到对应的执行结果。如图6A所示,大屏设备62执行第一执行指令,在界面上显示窗口65,窗口65上显示有提示信息,用于提示用户是否播放手机推荐的歌曲A。并且,窗口65上还显示有“播放”和“取消”两个选项按钮。
步骤S707、大屏设备64采集用户61的第二用户语音。
如图6B所示,大屏设备64在显示窗口65之后,用户61向大屏设备64输入第二用户语音。此时,第二用户语音具体为“换一个”。
此时,大屏设备64在接收到第一执行指令之后,可以自动唤醒本设备的语音助手,这样,用户可以直接向大屏设备输入第二用户语音,不用通过特定唤醒词唤醒大屏设备64的语音助手。
步骤S708、大屏设备64将第二用户语音转化成第二文本。
可以理解的是,大屏设备64内部署有图1对应的语音业务系统,或者该语音业务系统的部分模块。
以图5为例,大屏设备64包括第二语音识别模块522、第二语义理解模块523和第二对话管理模块524。大屏设备64采集到第二用户语音“换一个”之后,大屏设备64的语音助手将该第二用户语音输入至第二语音识别模块522,第二语音识别模块522将第二用户语音转化成第二文本“换一个”。
步骤S709、大屏设备64根据历史意图和对应槽位等信息,从第二文本中提取第二意图和第二槽位。
示例性地,第二语音识别模块522获得第二文本“换一个”之后,将第二文本“换一个”输入至第二语义理解模块523。第二语义理解模块523根据第二文本“换一个”和历史意图“推荐音乐”等信息进行语义识别,得到第二意图和第二槽位。此时,第二意图为推荐音乐,第二槽位可以包括设备槽位和换歌单槽位,此时,设备槽位的实体为大屏设备64。其中,由于“换一个”中不包含目标设备,设备槽位的实体默认为本设备。或者,第二槽位也可以不包括设备槽位,此时则默认在本设备执行第二执行指令。
大屏设备64根据来自手机62的历史意图“推荐音乐”等信息,可以识别出“换一个”的意图为“推荐音乐”,即“换一个”的意图继承了历史意图“推荐音乐”。
需要说明的是,如果第二用户语音中包括了明确的意图,大屏设备64可以不用根据来自手机的历史意图等信息,识别第二用户语音的意图。例如,第二用户语音为“播放歌曲A”,该用户语音中明确了“播放音乐”的意图,大屏设备64不用根据来自手机62的历史意图“推荐音乐”,即可识别出“播放歌曲A”的意图为“播放音乐”。
如果第二用户语音中没有包括明确的意图,并且,大屏设备64根据来自手机62的历史意图“推荐音乐”仍然不能识别该用户语音的意图,大屏设备64则可以向用户发出交互语音“我不能理解您的意思,请给我多一些时间学习您的习惯”。
其中,大屏设备64在对第二文本进行语义理解时,可以根据需要选择对应的信息,例如,大屏设备64可以只根据历史意图对第二文本进行语义理解,也可以根据历史意图和对应槽位,对第二文本进行语义理解。当然,如果手机62还将用户信息、应用状态信息、对话语料和场景信息中一种或多种一并传输给大屏设备64。大屏设备64也可以根据这些信息进行语义理解。例如,大屏设备64可以根据历史意图、对话语料、对应槽位和应用状态信息中的歌曲名称信息,对第二文本进行语义理解。
步骤S710、大屏设备64根据第二意图和第二槽位,生成第二用户语音对应的第二执行指令。
示例性地,第二语义理解模块523得到第二意图和第二槽位之后,将第二意图和第二槽位输入至第二对话管理模块524。第二对话管理模块524根据第二意图和第二槽位,生成第二执行指令。此时,第二执行指令为推荐音乐的指令。
步骤S711、大屏设备64执行第二执行指令。
示例性地,大屏设备64的语音助手获取到第二对话管理模块524输出的第二执行指令之后,响应于该第二执行指令,得到相应的执行结果。
如图6B所示,大屏设备64执行第二执行指令,显示窗口66,窗口66中显示有提示信息,用于询问用户是否播放歌曲B。歌曲B是大屏设备64根据推荐规则,确定出的推荐歌曲。
当然,在另一些实施例中,大屏设备也可以在播放歌曲B之前,通过语音提示用户即将要播放歌曲B。示例性地,语音提示信息为“好的,即将为您播放歌曲B”。
或者,大屏设备在确定出需要播放歌曲B之后,也可以直接播放歌曲B,不用提示用户。
相较而言,本申请实施例给出执行第二执行指令的提示信息(例如,提示窗口、提示语音等),可以让跨设备的对话业务接续不突兀,更具人性化,用户体验更佳。
需要说明的是,手机62传递给大屏设备64的信息除了历史意图和对应槽位之外,还可以包括其他信息,根据这些信息,大屏设备64可以更准确地识别出用户语音的意图,实现更开放的跨设备对话接续,还可以根据用户提供更加个性化、更精确的服务,以提高用户体验。
例如,用户61不喜欢大屏设备64推荐的歌曲B,则再次向大屏设备64输入第三用户语音“换xxxx”,“xxxx”是歌曲名称。大屏设备64在接收第三用户语音之后,可以根据手机62发送的音乐应用程序的历史播放记录,识别出“xxxx”是歌曲名称,进而准确识别出第三用户语音的意图为推荐音乐。历史播放记录包括歌曲名称等信息。
需要说明的是,用户61想要播放的歌曲可能是新出的歌曲,大屏设备64上还没有该歌曲的信息。如果不将手机62上的音乐应用程序的历史播放记录传递给大屏设备64,大屏设备64可能识别不出“xxxx”是歌曲名称,进而无法识别第三用户语音的意图。
如果第一电子设备和第二电子设备登陆的账号是同一个用户账号,第二电子设备可以根据第一电子设备传输过来的用户信息和应用状态信息等,给用户提供服务。
例如,在图6A~图6B的场景中,手机62和大屏设备64登录的是同一个华为用户账号。手机62传输给大屏设备64的信息包括第一执行指令、历史意图、对应槽位、用户信息、场景信息和应用状态信息。其中,用户信息包括用户画像信息和用户实时位置信息。应用状态信息包括手机62上音乐应用程序的相关信息。场景信息包括用于表征用户行走的信息。
大屏设备64在接收到用户语音“换一个”之后,根据手机62发送的历史意图等信息,识别用户语音“换一个”的意图为“推荐音乐”,并基于“推荐音乐”的意图、用户信息、场景信息和应用状态信息等,生成用于推荐音乐的第二执行指令。
大屏设备64根据手机62发送的用户信息和应用状态信息,确定出推荐歌曲为歌曲B,生成用于推荐歌曲B的第二执行指令。
具体地,大屏设备64根据用户信息中的用户画像信息,可以得知手机62和大屏设备64的用户是一名教师,用户偏好的歌曲类型为流行音乐;根据用户信息中的实时位置信息,可以得知该用户当前所在位置为家里;另外,还根据用户信息中的场景信息可以确定出用户当前在处于行走状态,即用户当前所处场景为运动场景。此时,由于用户处于运动场景,则给用户推荐运动场景的歌曲。并且,由于用户正在家里,则给用户推荐符合用户偏好的歌曲。也即,给用户推荐歌曲时,需要考虑运动场景,且符合用户偏好的歌曲。进一步地,大屏设备64还根据应用状态信息中的用户历史播放记录等信息,确定出用户最近7天播放次数大于预设次数阈值的待选歌曲集合。最后,大屏设备64从待选歌曲集合中,筛选出一首或多首运动场景的流行歌曲作为推荐歌曲。此时,确定出歌曲B作为推荐歌曲。可选的,如果第二电子设备和第二电子设备登陆的账号不是同一个用户账号,第一电子设备可以不用发送用户信息和场景信息给第二电子设备。例如,第一电子设备在向第二电子设备分发第一执行指令、历史意图和对应槽位等信息时,判断第二电子设备与第一电子设备不是同一用户的电子设备,则不发送本设备记录的用户信息和场景信息等个人相关的信息。此时,第二电子设备可以根据本设备记录的用户信息和场景信息,给用户提供服务。
例如,还以图6A~图6B的场为例,手机62和大屏设备64登录的不是同一个华为用户账号,此时,手机62登录的账号是用户A的账号,大屏设备64登录的账号为用户B的账号。
手机62传输给大屏设备64的信息包括第一执行指令、历史意图以及对应槽位。大屏设备64在接收到用户语音“换一个”之后,根据手机62发送的历史意图等信息,识别用户语音“换一个”的意图为“推荐音乐”,并基于“推荐音乐”的意图、本设备的应用状态信息和用户信息等,生成用于推荐音乐的第二执行指令。
大屏设备64响应于用于推荐歌曲E的第二执行指令,根据本设备的用户信息和本设备的场景信息,确定出推荐歌曲为歌曲E。
具体地,大屏设备64根据本设备场景信息中的定位信息,确定出当前所处位置为学校。并且,根据本设备的用户信息,确定大屏设备64的用户是一名教师或者是学校。基于此,大屏设备64在推荐歌曲的时候,推荐更加符合学生身份和学生偏好的歌曲,例如,推荐儿童歌曲。
当然,第一电子设备51在判断出第二电子设备登陆的用户账号与本设备不是同一个用户时,也可以将本设备的应用状态信息和用户信息传输至第二电子设备。第二电子设备52在给用户提供服务的时候,可以只将来自第一电子设备51的其中一部分的应用状态信息或用户信息,作为提供服务的依据信息。通过对比可知,手机62通过将应用信息和用户信息等一并传输给大屏设备64,使得大屏设备64可以基于手机62发送的应用信息和用户信息,给用户推荐最佳的歌曲(即在家里时推荐歌曲B,在学校时推荐歌曲E),以给用户提供更加个性化、精确化的服务,提高用户体验。
由上可见,手机62将历史意图和对应槽位等信息同步至大屏设备64,大屏设备64根据手机62同步的信息,准确识别出新采集的用户语音的意图,实现跨设备的对话业务接续。
需要说明的是,在图6A、图6B和图7示出的场景中,当用户语音中缺少一些关键信息,手机62可以与用户61进行一轮或多轮对话,以收集所需的信息。例如,大屏设备有两个,一个是客厅的大屏设备,另一个是卧室的大屏设备。此时,手机62在采集到第一用户语音“推荐一首歌曲给大屏设备”之后,手机62不确定用户说的大屏设备是哪一个大屏设备,因此可以输出语音“是推荐给客厅的大屏设备,还是卧室的大屏设备”。用户61可以针对手机62的输出语音,向手机62输入对应的语音。比如,用户61针对手机62输出的语音“是推荐给客厅的大屏设备,还是卧室的大屏设备呢?”,向手机62输入语音“客厅的大屏设备”。这样,手机62明确了用户是想向客厅的大屏设备推荐歌曲。
当然,手机62也可以在界面上显示文字提示信息,例如,该文字提示信息包括“客厅的大屏设备”和“卧室的大屏设备”两个选项,用户61可以根据需要选择其中一个选项。
本申请实施例提供的跨设备的对话业务接续方案除了可以应用于上文示出的音乐推荐场景,还可以应用于其他场景。
例如,参见图8A~图8C,示出了本申请实施例提供的导航场景示意图。如图8A所示,在车辆行驶过程中,用户通过手机81进行导航,通过车机82播放音乐。此时, 手机81显示导航页面811,车机82显示音乐播放界面821。此时,第一电子设备具体为手机81,第二电子设备为车机82,手机81和车机82上均部署有图1对应的语音业务系统,并且设置有语音助手。
具体应用中,用户可以通过唤醒词“小艺小艺”,唤醒手机81的语音助手,然后,向手机81输入用户语音“导航到地点A”。手机81基于图1对应的语音业务系统,确定出用户语音的意图为导航,目标地点槽位的实体为地点A,并且生成对应的执行指令;手机81响应于该执行指令,打开导航应用,得到出一条从用户当前位置至地点A的路线,并显示导航界面811。
在某个时刻(例如,手机81的电量快用完了),用户想要将导航任务从手机81接续至车机82,即使用车机82进行导航。用户可以向手机81输入用户语音“将当前导航任务流转至车机”。手机81基于图1对应的语音业务系统,确定用户语音“将当前导航任务接续至车机”的意图为导航任务接续,且确定目标设备槽位的实体是车机,生成对应的执行指令。另外,由于从用户语音“将当前导航任务接续至车机”中可以提取出目标设备槽位,且目标设备槽位的实体为车机,则可以确认需要转接业务。
手机81将用户语音“将当前导航任务接续至车机”对应的执行指令、历史意图和对应槽位等信息传输至车机82,此时,历史意图包括用户语音“导航到地点A”对应的意图,以及用户语音“将当前导航任务接续至车机”的意图。相对应地,历史意图的对应槽位包括从用户语音“导航到地点A”提取出的槽位,以及从用户语音“将当前导航任务接续至车机”中提取出的槽位。此时,执行指令包括导航路线信息。
当然,手机81还可以将对话语料,以及导航应用状态信息和日程应用状态信息等一并传输至车机82,此时,对话语料包括语料“导航到地点A”以及语料“将当前导航任务接续至车机”。导航应用状态信息包括用户的历史导航目标地点以及历史导航路线等,日程应用状态信息包括用户日程事项信息。
车机82接收到来自手机81的执行指令、历史意图和对应槽位等信息之后,响应于该执行指令,显示如图8B所示的导航界面822,并将历史意图和对应槽位等信息存储在本地,并将历史意图、对应槽位和对话语料等信息作为本地的语义理解模块的最新上下文。此时,手机81将导航任务接续至车机82之后,可以退出导航界面,显示如图8B所示的主界面812,或者也可以处于息屏状态。
用户使用车机82进行导航的过程中,可能由于某种原因(例如,当前导航路线堵车了),需要重新规划路线。此时,用户可以向车机82输入用户语音“重新规划”。车机82采集到用户语音“重新规划”之后,先通过语音识别模块将该语音转换成文本,再将文本输入至语义理解模块。语义理解模块根据最新的上下文信息,即用户语音“导航到地点A”的意图、语料“导航到地点A”,以及用户语音“重新规划”的文本,确定出用户语音“重新规划”的意图为规划当前位置至地点A的导航路线,起点槽位的实体为当前位置,终点槽位的实体为地点A。语义理解模块将识别出的意图和槽位等信息输入至对话管理模块,对话管理模块根据意图和槽位等信息,输出执行指令。
车机82响应于用户语音“重新规划”的执行指令,重新规划出一条从当前位置至地点A的导航路线,并显示如8C所示的导航界面823。
由上可见,车机导航场景下,用户可以通过语音,控制手机81将导航任务转接至 车机82,并且在转接业务的时候,将最近多轮对话的意图、语料和槽位等信息,一并传输至车机82。这样,车机82可以根据最近多轮对话的意图、语料和槽位等信息,准确识别出用户语音的意图,实现了手机81和车机82之间的跨设备对话业务接续。
又例如,参见图8D示出的本申请实施例提供的视频推荐场景示意图,如图8D所示,手机84显示视频播放界面841,视频播放界面841上显示当前正在播放的是视频1。此时,用户83想要将视频1推荐给大屏设备85,即想要将视频播放任务接续至大屏设备85,使用大屏设备85播放视频1,故对手机84说“推荐视频1给大屏设备”。
手机84采集到用户语音“推荐视频1给大屏设备”之后,基于图1对应的语音业务系统对用户语音“推荐视频1给大屏设备”进行处理,确定出用户语音“推荐视频1给大屏设备”的意图和槽位,并生成对应的执行指令。此时,用户语音“推荐视频1给大屏设备”的意图为推荐视频,目标设备槽位为大屏设备85,视频名称槽位的实体为视频1。对应的执行指令为播放视频的指令。
由于从用户语音中可以提取出目标设备槽位,且目标设备槽位的实体还不是本设备,手机84确定出需要转接业务至大屏设备85。手机84确定需要将视频播放任务流转至大屏设备85,则向大屏设备85发送用户语音“推荐视频1给大屏设备”的意图、槽位、对应的执行指令和语料“推荐视频1给大屏设备”。
此时,大屏设备85显示界面851,界面851上显示大屏设备85正在播放另一个视频。大屏设备85接收到来自手机84的信息之后,执行用户语音“推荐视频1给大屏设备”对应的执行指令,在界面851上弹出窗口852。窗口852上显示有提示信息,用于询问用户是否播放来自手机的视频1。用户可以通过选择窗口852上的“播放”选项,以让大屏设备85播放视频1,也可以选择窗口852上的“取消”选项,以让大屏设备85不播放视频1。
另外,大屏设备85还将接收到的用户语音“推荐视频1给大屏设备”的意图、槽位和语料等信息作为本地语义理解模块的最新上下文。
用户83不想播放视频1,则对大屏设备85说“别理它”。大屏设备85采集到用户语音“别理它”之后,通过语音识别模块将该用户语音转换成文本,再通过语音理解模块,根据文本“别理它”、以及最新的上下文,即用户语音“推荐视频1给大屏设备”的意图和槽位等信息,确定出用户语音“别理它”的意图为取消播放,设备槽位的实体为大屏设备,并通过对话管理模块生成对应的执行指令。
大屏设备85获取到用户语音“别理它”的执行指令之后,响应于该执行指令,不播放视频1,在界面851上去除窗口852。
由上可见,本实施例中,用户可以通过语音“推荐视频1给大屏设备”,将视频播放任务从手机转接至大屏设备,并在转接业务的时候,手机将最近多轮对话的意图和槽位等信息,随着执行指令一并传输至大屏设备,这样大屏设备即可根据手机发送的历史意图和槽位等信息,准确识别用户语音的意图,实现手机和大屏设备之间的跨设备对话业务接续。
在另一些实施例中,第一电子设备41和第二电子设备42也可以不包括图1对应的语音业务系统,或者该语音业务系统的部分模块,而是将语音业务系统部署第一电子设备41和第二电子设备42之外的设备。
此时,第一电子设备41和第二电子设备42通常是瘦设备,即第一电子设备41和第二电子设备42由于处理资源和内存资源等十分有限,无法部署语音业务系统中的语音识别引擎、语义理解引擎以及对话管理引擎等。
当然,第一电子设备41和第二电子设备42也可以是富设备,此时,虽然第一电子设备41和第二电子设备42上具备部署语音业务系统的条件,但实际上是将语音业务系统部署在第一电子设备41和第二电子设备42之外的设备。
示例性地,参见图9示出的本申请实施例提供的跨设备的对话业务接续系统的另一种示意图,第一电子设备91包括第一应用程序911和第一指令交互服务912,第二电子设备92包括第二应用程序921和第二指令交互服务922,第三电子设备93包括第一语音识别模块931、第一语义理解模块932以及第一对话管理模块933,第四电子设备94包括第二语音识别模块941、第二语义理解模块942以及第二对话管理模块943。
第一电子设备91和第三电子设备93通信连接,第二电子设备92和第四电子设备94通信连接。
第三电子设备92和第四电子设备94可以是云端服务器,其可以还包括NLG模块和TTS模块。当然,第三电子设备92和第四电子设备94也可以是手机、电脑等终端设备。
需要说明的是,图5和图9的相似或相同之处,可以参见上文图5的介绍,在此不再赘述。下面结合图10对图9的系统流程进行介绍说明。
另外,第一电子设备91本地上存储有用户信息和/或应用状态信息,在向第二电子设备92发送第一执行指令、历史意图和对应槽位等信息时,可以一并将用户信息和/或应用状态信息发送至第二电子设备92。关于用户信息和应用状态信息等相关描述,可以参见上文对应内容,在此不再赘述。
参见图10示出的本申请实施例提供的跨设备的对话业务接续方法的另一种流程示意图,该方法可以包括以下步骤:
步骤S1001、第一电子设备91采集第一用户语音。
示例性地,参见图11示出的本申请实施例提供的耳机转接音乐至音箱播放的场景示意图,用户111可以向智能耳机112输入第一用户语音“将音乐转接至音箱播放”。具体地,用户111在回家途中使用智能耳机112播放音乐,且播放的音乐是存储在智能耳机112本地的音乐;回到家中之后,用户111想要将智能耳机112正在播放的音乐转接至智能音箱113播放,故对智能耳机112说“将音乐转接至音箱播放”。智能耳机112通过声音采集装置采集到第一用户语音“将音乐转接至音箱播放”。
其中,在该场景下,第一电子设备91为智能耳机112,第二电子设备92为智能音箱113,第三电子设备93为云端服务器115,第四电子设备94为云端服务器114。
智能耳机112包括处理器和存储器等,本地存储有多首歌曲,可以连接无线Wi-Fi,安装有语音助手应用程序。智能音箱113包括处理器和存储器等,可以连接无线Wi-Fi,安装有语音助手应用程序。
步骤S1002、第一电子设备91向第四电子设备94发送第一用户语音。
以图11的场景为例,智能耳机112在回到家中后,自动与家里的无线路由器连接,并在采集到第一用户语音“将音乐转接至音箱播放”之后,智能耳机112内的语音助 手将第一用户语音“将音乐转接至音箱播放”通过无线Wi-Fi上传至云端服务器114。
步骤S1003、第四电子设备94将第一用户语音转化成第一文本。
示例性地,如图9所示,第四电子设备94接收到第一用户语音之后,将第一用户语音输入至第一语音识别模块941,第一语音识别模块941将第一用户语音转化成第一文本。
步骤S1004、第四电子设备94从第一文本中提取出第一意图和第一槽位。
如图9所示,第一语音识别模块941将得到第一文本输入至第一语义理解模块942。第一语义理解模块942对第一文本进行意图识别,得到第一意图和第一槽位。
以图11的场景为例,第一文本为“将音乐转接至音箱播放”,从该第一文本中可以确定出第一意图为音乐转接播放,第一槽位包括目标设备槽位,且目标设备槽位的实体为音箱。
步骤S1005、第四电子设备94根据第一意图和第一槽位,生成第一用户语音对应的第一执行指令。
如图9所示,第一语义理解模块942得到第一意图和第一槽位之后,将第一意图和第一槽位输入至第一对话管理模块943。第一对话管理模块943则根据第一意图和第一槽位,生成第一执行指令。
以图11的场景为例,第一意图为音乐转接播放,第一槽位包括目标设备槽位,第一执行指令则为播放音乐的执行指令。
步骤S1006、第四电子设备94向第一电子设备91发送第一执行指令、第一意图和第一槽位。
需要说明的是,第四电子设备94还可以将第一文本传输至第一电子设备91。例如,图11的场景中,云端服务器114可以将第一用户语音“将音乐转接至音箱播放”的文本传输给智能耳机112。步骤S1007、第一电子设备91向第二电子设备92发送第一执行指令、历史意图和对应槽位等信息,历史意图包括第一意图,对应槽位包括第一槽位。
需要说明的是,历史意图可以包括最近N轮对话的意图。除了可以将第一执行指令、历史意图、对应槽位传输至第二电子设备92之外,还可以将上下文信息和应用状态信息传输至第二电子设备。
最近N轮对话的意图、槽位以及语料等相关信息可以存储在第一电子设备91本地,也可以存储在第四电子设备94,此时,当第四电子设备94确定出需要向第二电子设备92发送执行指令、意图和槽位等信息之后,则将最近N轮对话的意图、槽位以及语料等相关信息一并传输给第一电子设备91。
在图11的场景中,智能耳机112在接收到云端服务器114发送的第一执行指令、第一意图和第一槽位之后,可以将当前正在播放的歌曲的相关信息、第一执行指令、历史意图和对应槽位等信息传输至智能音箱113。当前正在播放的歌曲的相关信息可以包括歌曲名称信息、歌手信息以及播放进度信息等。
智能耳机112和智能音箱113之间的连接可以是通过蓝牙连接、WiFi点对点连接,或者智能耳机112和智能音箱113连接至同一个无线路由器。
步骤S1008、第二电子设备92执行第一执行指令。
在图11的场景中,智能音箱113在接收到智能耳机112传输的信息之后,可以根据当前正在播放的歌曲的相关信息,得知歌曲名称和播放进度等信息,此时,歌曲名称为歌曲A;智能音箱113响应于第一执行指令,向用户111发出提示语音“来自耳机的歌曲A,是否播放?”。用户111不想播放歌曲A,则对智能音箱113说“换一个”。
当然,智能音箱113在响应于第一执行指令时,也可以直接播放对应的歌曲,不发出对应的提示语音。此时,用户111在智能音箱113播放歌曲A之后,可以对智能音箱113说“换一个”,以换一首歌曲播放。
步骤S1009、第二电子设备92采集第二用户语音。
在图11的场景中,第二用户语音为“换一个”。
步骤S1010、第二电子设备92向第三电子设备93发送第二用户语音、历史意图和对应槽位等信息。
步骤S1011、第三电子设备93将第二用户语音转化成第二文本。
如图9所示,第三电子设备93接收到第二用户语音、历史意图和对应槽位等信息之后,第三电子设备93可以将第二用户语音输入至第二语音识别模块931。第二语音识别模块931将第二用户语音转化成第二文本。
在图11的场景中,云端服务器115将第二用户语音“换一个”转化成第二文本“换一个”。
步骤S1012、第三电子设备93根据历史意图和对应槽位等信息,从第二文本中提取出第二意图和第二槽位。
如图9所示,第二语音识别模块931将第二文本输入至第二语义理解模块932。第二语义理解模块932根据第二文本、目标意图和第一槽位等信息,确定出第二用户语音的第二意图和第二槽位。
在图11的场景中,目标意图为音乐转接播放,目标设备槽位的实体为智能音箱113,第二文本为“换一个”,云端服务器115上的语义理解模块则可以确定出第二意图为播放音乐,第二槽位包括设备槽位,设备槽位的实体为智能音箱113。
步骤S1013、第三电子设备93根据第二意图和第二槽位,生成第二用户语音对应的第二执行指令。
如图9所示,第二语义理解模块932将得到的第二意图和第二槽位等信息输入至第二对话管理模块933,第二对话管理模块933输入第二执行指令。
步骤S1014、第三电子设备93向第二电子设备92发送第二执行指令。
步骤S1015、第二电子设备92执行第二执行指令。
以图11的场景为例,智能音箱113接收到云端服务器115发送的第二执行指令后,则响应于第二执行指令,自动播放歌曲B。
在又一些实施例中,第一电子设备41和第二电子设备42中的其中一个不包括图1对应的语音业务系统,或该语音业务系统的部分模块,而另一个则部署有图1对应的语音业务系统或者该语音业务系统的部分模块。
示例性地,参见图12示出的本申请实施例提供的跨设备的对话业务接续系统的又一种示意图,该系统可以包括第一电子设备121、第二电子设备122以及第三电子设 备123。
第一电子设备121包括第一应用程序1211、第一语音识别模块1212、第一语义理解模块1213、第一对话管理模块1214以及第一指令交互服务1215。
第二电子设备122包括第二应用程序1221和第二指令交互服务1222。
第三电子设备123包括第二语音识别模块1231、第二语义理解模块1232以及第二对话管理模块1233。
另外,第一电子设备121本地上存储有用户信息和/或应用状态信息,在向第二电子设备122发送第一执行指令、历史意图和对应槽位等信息时,可以一并将用户信息和/或应用状态信息发送至第二电子设备122。关于用户信息和应用状态信息等相关描述,可以参见上文对应内容,在此不再赘述。
参见图13示出的本申请实施例提供的跨设备的对话业务接续方法的又一种流程示意图,该方法可以包括以下步骤:
步骤S1301、第一电子设备121采集第一用户语音。
步骤S1302、第一电子设备121将第一用户语音转化成第一文本。
如图12所示,第一电子设备121采集到第一用户语音之后,第一应用程序1211将第一用户语音输入至第一语音识别模块1212。第一语音识别模块1212将第一用户语音转化成第一文本。
以图14的场景为例,第一电子设备121为手机142,第二电子设备为智能音箱143,第三电子设备123为云端服务器144。图14为本申请实施例提供的手机给智能音箱推荐音乐的场景示意图。
如图14所示,手机142显示播放界面1421,播放界面1421上显示手机142正在播放歌曲C。此时,用户141对手机142说“推荐一首歌曲给音箱”,手机142采集到第一用户语音“推荐一首歌曲给音箱”,并通过语音识别模块将第一用户语音“推荐一首歌曲给音箱”转化成第一文本“推荐一首歌曲给音箱”。
步骤S1303、第一电子设备121从第一文本中提取出第一意图和第一槽位。
如图12所示,第一语音识别模块1212将第一用户语音转化成第一文本之后,将第一文本输入至第一语义理解模块1213,以通过第一语义理解模块1213提取出第一意图和第一槽位。
在图14的场景中,第一文本为“推荐一首歌曲给音箱”,第一意图为推荐歌曲,第一槽位包括目标设备槽位,目标设备槽位的实体为音箱。
步骤S1304、第一电子设备12根据第一意图和第一槽位,生成第一用户语音对应的第一执行指令。
如图12所示,第一语义理解模块1213得到第一意图和第一槽位之后,将第一意图和第一槽位输入至第一对话管理模块1214,获得第二对话管理模块1214输出的第一执行指令。
在图14的场景中,第一执行指令为播放歌曲的指令。
如图14所示,手机142在采集到第一用户语音“推荐一首歌曲给音箱”之后,基于语音业务系统对第一用户语音进行处理,并显示界面1422,在界面1422上显示有第一文本,以及手机142针对第一用户语音的回答文本“好的”。
步骤S1305、第一电子设备121向第二电子设备122发送第一执行指令、历史意图和对应槽位等信息,历史意图包括第一意图,对应槽位包括第一槽位。
步骤S1306、第二电子设备122执行第一执行指令。
在图14的场景中,智能音箱143响应于第一执行指令,向用户141发出提示语音“来自手机的歌曲A,是否播放?”,以询问用户是否播放。当然,智能音箱143也可以响应于第一执行指令,自动播放歌曲A,不用询问用户是否播放。
步骤S1307、第二电子设备122采集第二用户语音。
如图14所示,智能音箱143在发出提示语音“来自手机的歌曲A,是否播放?”,用户141对智能音箱143说“换一个”,智能音箱143则采集到第二用户语音“换一个”。
步骤S1308、第二电子设备122向第三电子设备123发送第二用户语音、历史意图和对应槽位等信息。
步骤S1309、第三电子设备123将第二用户语音转化成第二文本。
如图12所示,第二电子设备122的第二应用程序1221将第二用户语音、历史意图和对应槽位等信息传输至第三电子设备123。第三电子设备123将第二用户语音输入至第二语音识别模块1231,获得第二语音识别模块1231输出的第二文本。
步骤S1310、第三电子设备123根据历史意图和对应槽位等信息,从第二文本中提取出第二意图和第二槽位。
如图12所示,第三电子设备123获得第二语音识别模块1231输出的第二文本之后,将第二文本输入至第二语义理解模块1232。第二语义理解模块1232根据第二文本、历史意图和对于槽位等信息,输出第二意图和第二槽位。
在图14的场景中,第二文本为“换一个”,历史意图包括推荐歌曲,对应槽位包括目标设备槽位,该目标设备槽位的实体为智能音箱143,第二意图为播放歌曲,第二槽位包括设备槽位,该设备槽位的实体为智能音箱143。
步骤S1311、第三电子设备123根据第二意图和第二槽位,生成第二用户语音对应的第二执行指令。
如图12所示,第二语义理解模块1232将第二意图和第二槽位输入至第二对话管理模块1233,获得第二对话管理模块1233输出的第二执行指令。
步骤S1312、第三电子设备123向第二电子设备122发送第二执行指令。
步骤S1313、第二电子设备122执行第二执行指令。
在图14的场景中,智能音箱143接收到第二执行指令之后,响应于第二执行指令,自动播放歌曲B。
本申请实施例涉及的电子设备的类型可以是任意的。示例性地,第一电子设备可以为但不限于手机、平板电脑、智能音箱、智慧大屏(也可称为智能电视)或者可穿戴式设备等。同理,第二电子设备也可以为但不限于手机、平板电脑、智能音箱、智慧大屏(也可称为智能电视)或者可穿戴式设备等。
作为示例而非限定,第一电子设备或第二电子设备的具体结构可以如图15所示。图15为本申请实施例提供的电子设备硬件结构示意图。
如图15所示,电子设备1500可以包括处理器1510、内部存储器1520、通信模块 1530、音频模块1540、扬声器1541、麦克风1542以及天线。
可以理解的是,本申请实施例示意的结构并不构成对电子设备1500的具体限定。在本申请另一些实施例中,电子设备1500可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
例如,当电子设备1500为手机时,电子设备1500还可以包括外部存储器接口,通用串行总线(universal serial bus,USB)接口,充电管理模块,电源管理模块,电池,受话器,耳机接口,传感器模块,按键,马达,指示器,摄像头,显示屏,以及用户标识模块(subscriber identification module,SIM)卡接口等。其中传感器模块可以包括压力传感器,陀螺仪传感器,气压传感器,磁传感器,加速度传感器,距离传感器,接近光传感器,指纹传感器,温度传感器,触摸传感器,环境光传感器,骨传导传感器等。
其中,处理器1510可以包括一个或多个处理单元,例如:处理器1510可以包括应用处理器(application processor,AP),调制解调处理器,控制器,存储器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
其中,控制器可以是电子设备1500的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。处理器1510中还可以设置存储器,用于存储指令和数据。
在一些实施例中,处理器1510可以包括一个或多个接口。接口可以包括集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用输入输出(general-purpose input/output,GPIO)接口等。
I2S接口可以用于音频通信。在一些实施例中,处理器1510可以包含多组I2S总线。处理器1510可以通过I2S总线与音频模块1540耦合,实现处理器1510与音频模块1540之间的通信。
PCM接口也可以用于音频通信,将模拟信号抽样,量化和编码。在一些实施例中,音频模块1540与通信模块1530中无线通信模块1可以通过PCM总线接口耦合。I2S接口和PCM接口都可以用于音频通信。
GPIO接口可以通过软件配置。GPIO接口可以被配置为控制信号,也可被配置为数据信号。在一些实施例中,GPIO接口可以用于连接处理器1510与音频模块1540等。GPIO接口还可以被配置为I2S接口等。
可以理解的是,本申请实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对电子设备1500的结构限定。在本申请另一些实施例中,电子设备1500也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。
通信模块1530可以包括移动通信模块和/或无线通信模块。
电子设备1500的无线通信功能可以通过天线、移动通信模块、无线通信模块、调制解调处理器以及基带处理器等实现。
天线用于发射和接收电磁波信号。移动通信模块可以提供应用在电子设备1500 上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块可以由天线接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块的至少部分功能模块可以被设置于处理器1510中。在一些实施例中,移动通信模块的至少部分功能模块可以与处理器1510的至少部分模块被设置在同一个器件中。
调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。随后解调器将解调得到的低频基带信号传送至基带处理器处理。低频基带信号经基带处理器处理后,被传递给应用处理器。应用处理器通过音频设备(不限于扬声器等)输出声音信号。
无线通信模块可以提供应用在电子设备1500上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块经由天线接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器1510。无线通信模块还可以从处理器1510接收待发送的信号,对其进行调频,放大,经天线转为电磁波辐射出去。
例如,手机通过Wi-Fi点对点连接,将执行指令、历史意图和槽位等信息传输至大屏设备。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现电子设备100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。例如,手机通过NPU实现对输入的用户语音进行识别,得到用户语音的文本信息;以及实现对用户语音的文本信息进行语义理解,提取出用户语音的槽位和意图等。
内部存储器1520可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器1510通过运行存储在内部存储器1520的指令,从而执行电子设备1500的各种功能应用以及数据处理。内部存储器1520可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能)等。存储数据区可存储电子设备100使用过程中所创建的数据(比如音频数据等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。
本申请实施例中,内部存储器1520中存储有语音助手应用程序或者集成有语音助手功能的应用程序。
电子设备1500可以通过音频模块1540,扬声器1541,麦克风1542以及应用处理 器等实现音频功能。例如音乐播放,录音等。
音频模块1540用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块1540还可以用于对音频信号编码和解码。在一些实施例中,音频模块1540可以设置于处理器1510中,或将音频模块1540的部分功能模块设置于处理器1510中。
扬声器1541,也称“喇叭”,用于将音频电信号转换为声音信号。电子设备1500可以通过扬声器170A收听音乐,或收听免提通话。
麦克风1542,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当用户和电子设备的语音助手对话时,用户可以通过人嘴靠近麦克风1542发声,将声音信号输入到麦克风1542。电子设备1500可以设置至少一个麦克风1542。在另一些实施例中,电子设备1500可以设置两个麦克风1542,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,电子设备1500还可以设置三个,四个或更多麦克风1542,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。
本申请实施例中,电子设备1510可以通过麦克风1542以及音频模块1540采集用户语音,通过扬声器1541以及音频模块1540输出语音,以实现人机对话。
电子设备1500的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。
在介绍完电子设备的硬件架构之后,下面将对该电子设备的软件系统架构进行介绍。
电子设备1500的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本申请实施例以分层架构为例,示例性说明电子设备1500的软件结构。
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。
应用程序层可以包括一些应用程序包,例如,应用程序包可以包括语音助手、相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,短信息等应用程序。
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。示例性地,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器等。
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。电话管理器用于提供电子设备1500的通信功能。例如通话状态的管理(包括接通,挂断等)。 资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。
Android Runtime包括核心库和虚拟机。Android runtime负责安卓系统的调度和管理。
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。
本申请实施例提供的电子设备,可以包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,处理器执行计算机程序时实现如上述方法实施例中任一项的方法。
本申请实施例还提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序被处理器执行时实现可实现上述各个方法实施例中的步骤。
本申请实施例提供了一种计算机程序产品,当计算机程序产品在电子设备上运行时,使得电子设备执行时实现可实现上述各个方法实施例中的步骤。
本申请实施例还提供一种芯片系统,所述芯片系统包括处理器,所述处理器与存储器耦合,所述处理器执行存储器中存储的计算机程序,以实现如上述各个方法实施例所述的方法。所述芯片系统可以为单个芯片,或者多个芯片组成的芯片模组。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。此外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。
最后应说明的是:以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (38)

  1. 一种跨设备的对话业务接续系统,其特征在于,包括第一电子设备和至少一个第二电子设备:
    所述第一电子设备用于采集第一用户语音;若确定所述第一用户语音包含用于指示向所述第二电子设备发送指令的信息,则向所述第二电子设备发送第一信息和第一执行指令,所述第一信息包括用于描述所述第一用户语音的意图的信息,所述第一执行指令为所述第一用户语音对应的执行指令;
    所述第二电子设备用于在接收到所述第一执行指令之后,采集第二用户语音;执行所述第二用户语音对应的第二执行指令,所述第二执行指令为根据所述第一信息和所述第二用户语音生成的指令。
  2. 根据权利要求1所述的系统,其特征在于,所述用于描述所述第一用户语音的意图的信息包括所述第一用户语音的第一文本和/或所述第一用户语音的第一意图。
  3. 根据权利要求2所述的系统,其特征在于,所述第一信息包括N轮对话的文本和/或意图,N为大于1的正整数;
    所述N轮对话的文本包括所述第一用户语音的第一文本,所述N轮对话的意图包括所述第一用户语音的第一意图;
    其中,所述N轮对话为所述第一电子设备采集的用户对话。
  4. 根据权利要求1所述的系统,其特征在于,所述第一执行指令包括用于表征所述第一用户语音的槽位的信息。
  5. 根据权利要求1至4任一项所述的系统,其特征在于,所述第一电子设备具体用于:
    对所述第一用户语音进行语音识别,得到第一文本;
    对所述第一文本进行语义理解,得到所述第一用户语音的第一意图和第一槽位;
    若所述第一槽位包括目标设备槽位,且所述目标设备槽位的实体为所述第二电子设备,则确定所述第一用户语音包含用于指示向所述第二电子设备发送指令的信息;
    根据所述第一意图和所述第一槽位,生成所述第一用户语音对应的所述第一执行指令。
  6. 根据权利要求1至4任一项所述的系统,其特征在于,所述系统还包括与所述第一电子设备通信连接的第三电子设备;所述第一电子设备具体用于:
    向所述第三电子设备发送所述第一用户语音;
    接收来自所述第三电子设备的第一槽位、第一意图和第一执行指令,所述第一槽位和所述第一意图为所述第三电子设备从所述第一用户语音中提取的,所述第一执行指令为所述第三电子设备根据所述第一槽位和所述第一意图生成的所述第一用户语音对应的执行指令;
    若所述第一槽位包括目标设备槽位,且所述目标设备槽位的实体为所述第二电子设备,则确定所述第一用户语音包含用于指示向所述第二电子设备发送指令的信息。
  7. 根据权利要求1至6任一项所述的系统,其特征在于,所述第二电子设备具体用于:
    对所述第二用户语音进行语音识别,得到第二文本;
    根据所述第一信息,对所述第二文本进行语义理解,得到所述第二用户语音的语义信息;
    根据所述第二用户语音的语义信息,生成所述第二用户语音对应的第二执行指令。
  8. 根据权利要求7所述的系统,其特征在于,所述第二电子设备具体用于:
    将所述第一信息作为语义理解模块的最新上下文,所述第二电子设备包括所述语义理解模块;
    将所述第二文本输入所述语义理解模块,获得所述语义理解模块输出的所述第二用户语音的语义信息,其中,所述语义理解模块采用所述最新上下文对所述第二文本进行语义理解。
  9. 根据权利要求1至6任一项所述的系统,其特征在于,所述系统还包括与所述第二电子设备通信连接的第四电子设备;
    所述第二电子设备具体用于:
    向所述第四电子设备发送所述第二用户语音和所述第一信息;
    接收来自所述第四电子设备的第二用户语音的语义信息和第二执行指令;
    其中,所述第二用户语音的语义信息为所述第四电子设备根据所述第一信息对所述第二用户语音进行语义理解得到的信息,所述第二执行指令为所述第四电子设备根据所述第二用户语音的语义信息生成的所述第二用户语音对应的执行指令。
  10. 根据权利要求7所述的系统,其特征在于,所述第一电子设备具体用于:
    确定所述第一电子设备的用户账号和所述第二电子设备的用户账号是否为同一个用户;
    若是,向所述第二电子设备发送所述第一执行指令和所述第一信息,并向所述第二电子设备发送第二信息,所述第二信息包括第一用户信息、场景信息和第一应用状态信息中的任意一种或任意组合;
    其中,所述第一用户信息为用于描述所述第一电子设备的用户的信息,所述第一应用状态信息为用于表征所述第一电子设备上的第一目标应用的信息,所述场景信息为用于描述用户场景的信息;
    所述第二电子设备具体用于:
    根据第一信息、所述第二用户语音和所述第二信息,生成所述第二执行指令。
  11. 根据权利要求10所述的系统,其特征在于,所述第一电子设备具体用于:
    若所述第一电子设备的用户账号和所述第二电子设备的用户账号不是同一个用户,向所述第二电子设备发送所述第一执行指令和所述第一信息;
    所述第二电子设备具体用于:
    根据所述第一信息、所述第二用户语音和第三信息,生成所述第二执行指令,所述第三信息包括第二用户信息和/或第二应用状态信息;
    其中,所述第二用户信息为用于描述所述第二电子设备的用户的信息,所述第二应用状态信息为用于表征所述第二电子设备上的第二目标应用的信息。
  12. 根据权利要求1所述的系统,其特征在于,若存在至少两个所述第二电子设备,且所述至少两个第二电子设备与所述第一电子设备的连接方式不同,所述第一电 子设备具体用于:
    确定与所述至少两个第二电子设备之间的通信连接的类型;
    根据所述通信连接的类型,通过不同的所述通信连接分别向所述至少第二电子设备发送所述第一信息和所述第一执行指令。
  13. 根据权利要求1所述的系统,其特征在于,所述第二电子设备具体用于:
    在执行所述第一执行指令时,或提示用户是否执行所述第一执行指令时,采集所述第二用户语音。
  14. 根据权利要求1所述的系统,其特征在于,所述第二电子设备还用于:
    在接收所述第一执行指令后,唤醒语音助手,所述第二电子设备包括所述语音助手。
  15. 根据权利要求1所述的系统,其特征在于,所述第一执行指令为推荐音乐的指令,第二执行指令为用于推荐另一首歌曲的指令。
  16. 一种跨设备的对话业务接续方法,其特征在于,应用于第一电子设备,所述方法包括:
    采集第一用户语音;
    确定所述第一用户语音包含用于指示向第二电子设备发送指令的信息后,向所述第二电子设备发送第一信息和第一执行指令;
    其中,所述第一信息包括用于描述所述第一用户语音的意图的信息,所述第一执行指令为所述第一用户语音对应的执行指令。
  17. 根据权利要求16所述的方法,其特征在于,所述用于描述所述第一用户语音的意图的信息包括所述第一用户语音的第一文本和/或所述第一用户语音的第一意图。
  18. 根据权利要求17所述的方法,其特征在于,所述第一信息包括N轮对话的文本和/或意图,N为大于1的正整数;
    所述N轮对话的文本包括所述第一用户语音的第一文本,所述N轮对话的意图包括所述第一用户语音的第一意图;
    其中,所述N轮对话为所述第一电子设备采集的用户对话。
  19. 根据权利要求16所述的方法,其特征在于,所述第一执行指令包括用于表征所述第一用户语音的槽位的信息。
  20. 根据权利要求16至19任一项所述的方法,其特征在于,确定所述第一用户语音包含用于指示所述第二电子设备的信息后,向所述第二电子设备发送第一信息和第一执行指令,包括:
    对所述第一用户语音进行语音识别,得到第一文本;
    对所述第一文本进行语义理解,得到所述第一用户语音的第一意图和第一槽位;
    若所述第一槽位包括目标设备槽位,且所述目标设备槽位的实体为所述第二电子设备,则确定所述第一用户语音包含用于指示向所述第二电子设备发送指令的信息;
    根据所述第一意图和所述第一槽位,生成所述第一用户语音对应的所述第一执行指令;
    向所述第二电子设备发送所述第一信息和所述第一执行指令,所述第一信息包括所述第一意图和/或所述第一文本。
  21. 根据权利要求16至19任一项所述的方法,其特征在于,确定所述第一用户语音包含用于指示所述第二电子设备的信息后,向所述第二电子设备发送第一信息和第一执行指令,包括:
    向第三电子设备发送所述第一用户语音;
    接收来自所述第三电子设备的第一槽位、第一意图和第一执行指令,所述第一槽位和所述第一意图为所述第三电子设备从所述第一用户语音中提取的,所述第一执行指令为所述第三电子设备根据所述第一槽位和所述第一意图生成的所述第一用户语音对应的执行指令;
    若所述第一槽位包括目标设备槽位,且所述目标设备槽位的实体为所述第二电子设备,则确定所述第一用户语音包含用于指示向所述第二电子设备发送指令的信息;
    向所述第二电子设备发送所述第一信息和所述第一执行指令,所述第一信息包括所述第一意图和/或所述第一用户语音的第一文本。
  22. 根据权利要求16所述的方法,其特征在于,在向所述第二电子设备发送所述第一信息和所述第一执行指令之前,所述方法还包括:
    确定所述第一电子设备的用户账号和所述第二电子设备的用户账号是否为同一个用户;
    若是,进入向所述第二电子设备发送所述第一执行指令和所述第一信息的步骤,并向所述第二电子设备发送第二信息,所述第二信息包括第一用户信息、场景信息和第一应用状态信息中的任意一种或任意组合;
    其中,所述第一用户信息为用于描述所述第一电子设备的用户的信息,所述场景信息为用于描述用户场景的信息,所述第一应用状态信息为用于表征所述第一电子设备上的第一目标应用的信息。
  23. 根据权利要求16所述的方法,其特征在于,若存在至少两个所述第二电子设备,向所述第二电子设备发送所述第一信息和所述第一执行指令,包括:
    确定与所述至少两个第二电子设备之间的通信连接的类型;
    根据所述通信连接的类型,通过不同的所述通信连接分别向所述至少第二电子设备发送所述第一信息和所述第一执行指令。
  24. 根据权利要求16所述的方法,其特征在于,所述第一执行指令为推荐音乐的指令。
  25. 一种跨设备的对话业务接续方法,其特征在于,应用于第二电子设备,所述方法包括:
    接收来自第一电子设备的第一执行指令和第一信息,所述第一信息包括用于描述第一用户语音的意图的信息,所述第一执行指令为所述第一用户语音对应的执行指令,所述第一用户语音为所述第一电子设备采集的,且包含用于指示向所述第二电子设备发送指令的信息的语音;
    采集第二用户语音;
    执行所述第二用户语音对应的第二执行指令,所述第二执行指令为根据所述第一信息和所述第二用户语音生成的指令。
  26. 根据权利要求25所述的方法,其特征在于,所述用于描述所述第一用户语音 的意图的信息包括所述第一用户语音的第一文本和/或所述第一用户语音的第一意图。
  27. 根据权利要求26所述的方法,其特征在于,所述第一信息包括N轮对话的文本和/或意图,N为大于1的正整数;
    所述N轮对话的文本包括所述第一用户语音的第一文本,所述N轮对话的意图包括所述第一用户语音的第一意图;
    其中,所述N轮对话为所述第一电子设备采集的用户对话。
  28. 根据权利要求25所述的方法,其特征在于,所述第一执行指令包括用于表征所述第一用户语音的槽位的信息。
  29. 根据权利要求25至28任一项所述的方法,其特征在于,执行所述第二用户语音对应的第二执行指令,包括:
    对所述第二用户语音进行语音识别,得到第二文本;
    根据所述第一信息,对所述第二文本进行语义理解,得到所述第二用户语音的语义信息;
    根据所述第二用户语音的语义信息,生成所述第二用户语音对应的第二执行指令;
    执行所述第二执行指令。
  30. 根据权利要求29所述的方法,其特征在于,根据所述第一信息,对所述第二文本进行语义理解,得到所述第二用户语音的语义信息,包括:
    将所述第一信息作为语义理解模块的最新上下文,所述第二电子设备包括所述语义理解模块;
    将所述第二文本输入所述语义理解模块,获得所述语义理解模块输出的所述第二用户语音的语义信息,其中,所述语义理解模块采用所述最新上下文对所述第二文本进行语义理解。
  31. 根据权利要求29所述的方法,其特征在于,若所述第一电子设备的用户账号和所述第二电子设备的用户账号是同一个用户,所述方法还包括:
    接收来自所述第一电子设备的第二信息,所述第二信息包括第一用户信息、场景信息和第一应用状态信息中的任意一种或任意组合;
    根据所述第二用户语音的语义信息,生成所述第二用户语音对应的第二执行指令,包括:
    根据所述语义信息和所述第二信息,生成所述第二执行指令;
    其中,所述第一用户信息为用于描述所述第一电子设备的用户的信息,所述场景信息为用于描述用户场景的信息,所述第一应用状态信息为用于表征所述第一电子设备上的第一目标应用的信息。
  32. 根据权利要求29所述的方法,其特征在于,若所述第一电子设备的用户账号和所述第二电子设备的用户账号不是同一个用户,根据所述第二用户语音的语义信息,生成所述第二用户语音对应的第二执行指令,包括:
    根据所述语义信息和第三信息,生成所述第二执行指令;
    所述第三信息包括第二用户信息和/或第二应用状态信息,所述第二用户信息为用于描述所述第二电子设备的用户的信息,所述第二应用状态信息为用于表征所述第二电子设备上的第二目标应用的信息。
  33. 根据权利要求25至28任一项所述的方法,其特征在于,执行所述第二用户语音对应的第二执行指令,包括:
    向第四电子设备发送所述第二用户语音和所述第一信息;
    接收来自所述第四电子设备的第二用户语音的语义信息和所述第二执行指令;
    其中,所述第二用户语音的语义信息为所述第四电子设备根据所述第一信息对所述第二用户语音进行语义理解得到的信息,所述第二执行指令为所述第四电子设备根据所述第二用户语音的语义信息生成的所述第二用户语音对应的执行指令;
    执行所述第二执行指令。
  34. 根据权利要求25所述的方法,其特征在于,采集第二用户语音,包括:
    在执行所述第一执行指令时,或提示用户是否执行所述第一执行指令时,采集所述第二用户语音。
  35. 根据权利要求25所述的方法,其特征在于,在采集所述第二用户语音之前,所述方法还包括:
    在接收所述第一执行指令后,唤醒语音助手,所述第二电子设备包括所述语音助手。
  36. 根据权利要求25所述的方法,其特征在于,所述第二执行指令为用于推荐另一首歌曲的指令。
  37. 一种电子设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求16至24或25至36任一项所述的方法。
  38. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求16至24或25至36任一项所述的方法。
PCT/CN2022/084544 2021-06-18 2022-03-31 跨设备的对话业务接续方法、系统、电子设备和存储介质 WO2022262366A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP22823857.2A EP4343756A1 (en) 2021-06-18 2022-03-31 Cross-device dialogue service connection method, system, electronic device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110681520.3A CN115497470A (zh) 2021-06-18 2021-06-18 跨设备的对话业务接续方法、系统、电子设备和存储介质
CN202110681520.3 2021-06-18

Publications (1)

Publication Number Publication Date
WO2022262366A1 true WO2022262366A1 (zh) 2022-12-22

Family

ID=84465250

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/084544 WO2022262366A1 (zh) 2021-06-18 2022-03-31 跨设备的对话业务接续方法、系统、电子设备和存储介质

Country Status (3)

Country Link
EP (1) EP4343756A1 (zh)
CN (1) CN115497470A (zh)
WO (1) WO2022262366A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117133284A (zh) * 2023-04-06 2023-11-28 荣耀终端有限公司 一种语音交互方法及电子设备

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105632496A (zh) * 2016-03-21 2016-06-01 珠海市杰理科技有限公司 语音识别控制装置和智能家具系统
CN106228985A (zh) * 2016-07-18 2016-12-14 广东志高空调有限公司 一种语音控制系统、控制器和家用电器设备
CN107371052A (zh) * 2017-06-13 2017-11-21 北京小米移动软件有限公司 设备控制方法及装置
CN108737933A (zh) * 2018-05-30 2018-11-02 上海与德科技有限公司 一种基于智能音箱的对话方法、装置及电子设备
CN109377992A (zh) * 2018-10-10 2019-02-22 四川长虹电器股份有限公司 基于无线通信的全空间语音交互物联网控制系统及方法
CN110741431A (zh) * 2017-05-16 2020-01-31 谷歌有限责任公司 跨设备切换
CN112017652A (zh) * 2019-05-31 2020-12-01 华为技术有限公司 一种交互方法和终端设备
US20200411001A1 (en) * 2019-06-25 2020-12-31 Miele & Cie. Kg Method for controlling the operation of an appliance by a user through voice control

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105632496A (zh) * 2016-03-21 2016-06-01 珠海市杰理科技有限公司 语音识别控制装置和智能家具系统
CN106228985A (zh) * 2016-07-18 2016-12-14 广东志高空调有限公司 一种语音控制系统、控制器和家用电器设备
CN110741431A (zh) * 2017-05-16 2020-01-31 谷歌有限责任公司 跨设备切换
CN107371052A (zh) * 2017-06-13 2017-11-21 北京小米移动软件有限公司 设备控制方法及装置
CN108737933A (zh) * 2018-05-30 2018-11-02 上海与德科技有限公司 一种基于智能音箱的对话方法、装置及电子设备
CN109377992A (zh) * 2018-10-10 2019-02-22 四川长虹电器股份有限公司 基于无线通信的全空间语音交互物联网控制系统及方法
CN112017652A (zh) * 2019-05-31 2020-12-01 华为技术有限公司 一种交互方法和终端设备
US20200411001A1 (en) * 2019-06-25 2020-12-31 Miele & Cie. Kg Method for controlling the operation of an appliance by a user through voice control

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117133284A (zh) * 2023-04-06 2023-11-28 荣耀终端有限公司 一种语音交互方法及电子设备

Also Published As

Publication number Publication date
CN115497470A (zh) 2022-12-20
EP4343756A1 (en) 2024-03-27

Similar Documents

Publication Publication Date Title
CN111345010B (zh) 一种多媒体内容同步方法、电子设备及存储介质
JP7222112B2 (ja) 歌の録音方法、音声補正方法、および電子デバイス
EP4030422B1 (en) Voice interaction method and device
WO2023029967A1 (zh) 一种播放音频的方法及电子设备
WO2022052776A1 (zh) 一种人机交互的方法、电子设备及系统
CN111666119A (zh) Ui组件显示的方法及电子设备
WO2020249091A1 (zh) 一种语音交互方法、装置及系统
CN111724775A (zh) 一种语音交互方法及电子设备
JP2022518656A (ja) 着信があるときに電子デバイス上に映像を提示するための方法、および電子デバイス
CN109819306B (zh) 一种媒体文件裁剪的方法、电子设备和服务器
WO2020239001A1 (zh) 一种哼唱识别方法及相关设备
WO2022135527A1 (zh) 一种视频录制方法及电子设备
WO2023273321A1 (zh) 一种语音控制方法及电子设备
WO2022143258A1 (zh) 一种语音交互处理方法及相关装置
WO2022262366A1 (zh) 跨设备的对话业务接续方法、系统、电子设备和存储介质
WO2022135485A1 (zh) 电子设备及其主题设置方法和介质
WO2022135157A1 (zh) 页面显示的方法、装置、电子设备以及可读存储介质
WO2022088964A1 (zh) 一种电子设备的控制方法和装置
CN113449068A (zh) 一种语音交互方法及电子设备
CN114844984B (zh) 通知消息的提醒方法及电子设备
WO2023071940A1 (zh) 跨设备的导航任务的同步方法、装置、设备及存储介质
WO2023273904A1 (zh) 音频数据的存储方法及其相关设备
WO2022053062A1 (zh) 一种IoT设备的管理方法及终端
CN114168160A (zh) 应用模块启动方法和电子设备
CN109348353B (zh) 智能音箱的服务处理方法、装置和智能音箱

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22823857

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022823857

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022823857

Country of ref document: EP

Effective date: 20231213

NENP Non-entry into the national phase

Ref country code: DE