WO2024078419A1 - Procédé d'interaction vocale, appareil d'interaction vocale et dispositif électronique - Google Patents

Procédé d'interaction vocale, appareil d'interaction vocale et dispositif électronique Download PDF

Info

Publication number
WO2024078419A1
WO2024078419A1 PCT/CN2023/123414 CN2023123414W WO2024078419A1 WO 2024078419 A1 WO2024078419 A1 WO 2024078419A1 CN 2023123414 W CN2023123414 W CN 2023123414W WO 2024078419 A1 WO2024078419 A1 WO 2024078419A1
Authority
WO
WIPO (PCT)
Prior art keywords
input
user
electronic device
time
voice
Prior art date
Application number
PCT/CN2023/123414
Other languages
English (en)
Chinese (zh)
Inventor
陈家胜
李亚楠
梅文胜
曹猛猛
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024078419A1 publication Critical patent/WO2024078419A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72433User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for voice messaging, e.g. dictaphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72436User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for text messaging, e.g. short messaging services [SMS] or e-mails
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • Embodiments of the present application relate to the field of electronic devices, and more specifically, to a voice interaction method, a voice interaction device, and an electronic device.
  • Multi-round dialogue is a typical application scenario of human-computer interaction. Through multiple rounds of dialogue with users, the voice interaction system can determine the user's true intentions and perform corresponding actions.
  • the embodiments of the present application provide a voice interaction method, a voice interaction device, and an electronic device, which can reduce the time spent by users on voice interaction, quickly determine the user's true intention, and improve the user experience.
  • a voice interaction method is provided, which is applied to an electronic device, including: receiving a first voice input from a user, the first voice input including a first slot; when the first slot includes at least two candidate items, extending a reception countdown duration from a first time to a second time, the reception countdown duration being the time during which the electronic device is continuously in a reception state after receiving the first voice input from the user; displaying a first card on a first interface, the first card being used to prompt the user to determine a target candidate item for the first slot, the first card including the at least two candidate items; determining the target candidate item based on the at least two candidate items or determining the target candidate item based on the second input of the user during a first reception period, the first reception period being the time period during which the electronic device is continuously in a reception state after receiving the first voice input from the user.
  • the electronic device when the electronic device determines that the first slot of the user's first voice input includes at least two candidates, it can extend the reception countdown time and display a first card to the user to prompt the user to determine the target candidate, thereby allowing the user to determine the target candidate within the extended reception time (first reception period). In this way, there is no need to start the next round of dialogue and broadcast it to the user, but the user can determine the target candidate in the current round of dialogue, which reduces the total duration of voice interaction and improves the user experience.
  • the at least two candidate items include a default candidate item.
  • the electronic device provides the user with at least two candidate items including a default candidate item.
  • the default candidate item may be determined based on the popularity of the at least two candidate items.
  • the default candidate item may be the target candidate item corresponding to the user's actual intention, reducing the complexity of the user's operation.
  • determining the target candidate item based on the at least two candidate items includes: when the electronic device does not receive user input during the first sound reception period, determining the target candidate item based on the default candidate item.
  • the first card prompted by the electronic device to the user has a default candidate item corresponding to the first slot. If the user makes an input within the extended reception time, it can be considered that the user recognizes that the default candidate item provided by the electronic device is the candidate item that the user actually intends. The electronic device can then use the default candidate item as the target candidate item and provide the user with corresponding services.
  • the second input is used to select the target candidate item from the at least two candidate items; or, the second input is used to input the target candidate item, and the target candidate item does not belong to the at least two candidate items.
  • the electronic device can determine the target candidate item according to the user's second input.
  • the at least two candidate items provided by the electronic device include the target candidate item that the user actually intends, and the user can input "item x" by voice, directly input the candidate item by voice, click the candidate item on the screen, or text.
  • the candidate item is determined as the target candidate item for the first slot by a method such as voice input or text input.
  • the at least two candidate items provided by the electronic device do not include the target candidate item that the user actually intends, and the user can directly input the target candidate item by voice input, text input, etc.
  • the electronic device can provide the user with corresponding services based on the target candidate item. In this way, the user can determine the target candidate item in a variety of ways, and the existence of the default candidate item reduces the complexity of the user's operation.
  • the method further includes: after the electronic device detects that the time of the first pause is greater than a preset threshold, determining that the first slot includes the at least two candidate items.
  • the first slot of the first voice input is queried to determine that the first slot includes at least two candidate items. In this way, there is no need to query whether the corresponding first slot includes multiple candidate items after the user's voice input is completed, and the length of the voice reception countdown can be adjusted in time, avoiding the electronic device from starting a second round of dialogue to facilitate the user to supplement the relevant information of the first slot, reducing the voice interaction time between the electronic device and the user, and improving the user's experience.
  • the method further includes: displaying prompt information on the first interface to prompt the user of a default action corresponding to the default candidate item.
  • the method further includes: receiving a fourth input from the user during the first sound recording period, the fourth input corresponding to a second instruction, the second instruction enabling the user to determine that the default candidate item is not the target candidate item.
  • the electronic device will also display a prompt message prompting a default action on the first interface.
  • the default action corresponds to the default candidate item.
  • the user can intuitively see the default operation performed by the electronic device after the sound reception ends.
  • the user can immediately supplement the input.
  • the user can trigger a cancellation command by inputting "cancel", "do not execute” and other voice commands, so that the electronic device knows that the default candidate item is not the target candidate item that the user actually intends, and thus does not execute the default action.
  • the user can also supplement the input of the target candidate item that the user actually intends, so that the electronic device will provide the user with corresponding services based on the target candidate item.
  • the method further includes: displaying a control on the first interface, wherein the control is used to prompt the user of the remaining value of the radio countdown duration.
  • the electronic device will display a control corresponding to the remaining value of the radio countdown duration on the interface, so that when the user believes that the remaining radio countdown duration is insufficient for additional input, the radio countdown duration can be extended through instructions, thereby avoiding the need for additional input in the next round of dialogue due to the end of radio timeout, thereby reducing the complexity of the user's operations and the total interaction time.
  • the method also includes: receiving a third input from the user during the first radio time period, the third input corresponding to a first instruction, and the first instruction is used to extend the radio countdown duration; based on the third input, extending the radio countdown duration from the second time to a third time.
  • the user when the remaining value of the radio countdown is insufficient, the user can trigger a wait instruction to extend the radio countdown by using a sentence containing words such as "wait” or "wait", which is convenient to operate.
  • the electronic device can allow the user to edit the words corresponding to the wait instruction by himself, so as to meet the user's habits and enhance the user's experience.
  • the method further includes: sending a first request according to the target candidate item, the first request being used to request the provision of a service corresponding to the target candidate item.
  • the electronic device can send a first request to the application corresponding to the intention of the first voice input according to the target candidate item, so that the first application can provide corresponding services to the user.
  • the method further includes: receiving a fifth input from the user during the first sound recording period, the fifth input corresponding to a third instruction, and the third instruction is used to end execution of an action corresponding to the first voice input.
  • the electronic device when it receives the fifth input from the user, it can stop executing the action corresponding to the first voice input and stop receiving the sound.
  • the fifth input can include words such as "end", corresponding to the end instruction, indicating that the user does not need to provide the service corresponding to the first voice input.
  • the corresponding service can be provided through, for example, the second voice input request.
  • the second input, the third input, the fourth input, and the fifth input are any of the following: voice input, click input, and text input.
  • the technical solution provided by the embodiments of the present application allows the user to select a suitable input method according to actual conditions, thereby increasing the applicable scenarios of the embodiments of the present application.
  • an electronic device including a sound receiving component for receiving a first voice input from a user; a voice analysis component for determining a first slot according to the first voice input; and an immediate response component for, when it is determined that the first slot includes at least two candidate items, extending a sound receiving countdown time from a first time to a second time, wherein the sound receiving countdown time is a time interval between the first voice input and the second voice input.
  • the instant response component is further used to call the display component of the electronic device to display a first card on the first interface, and the first card is used to prompt the user to determine the target candidate item for the first slot, and the first card includes the at least two candidates; the instant response component is also used to determine the target candidate item according to the at least two candidates or determine the target candidate item according to the second input of the user during the first reception period, and the first reception period is the time period during which the electronic device is in the reception state after receiving the first voice input from the user.
  • the at least two candidate items include a default candidate item.
  • the instant response component is specifically used to determine the target candidate item based on the default candidate item when the electronic device does not receive the user's input during the first sound reception period.
  • the second input is used to select the target candidate item from the at least two candidate items; or, the second input is used to input the target candidate item, and the target candidate item does not belong to the at least two candidate items.
  • the immediate response component is further used to: after the electronic device detects that the time of the first pause is greater than a preset threshold, determine that the first slot includes the at least two candidate items.
  • the instant response component is also used to call the display component of the electronic device to display prompt information on the first interface, and the prompt information is used to prompt the user of the default execution action corresponding to the default candidate item.
  • the instant response component is also used to call the display component to display a control on the first interface, and the control is used to prompt the user of the remaining value of the radio countdown duration.
  • the receiving component is further used to receive a third input from the user during the first receiving time period, the third input corresponds to a first instruction, and the first instruction is used to extend the receiving countdown duration; the instant response component is further used to call the receiving component to extend the receiving countdown duration from the second time to a third time based on the third input.
  • the second input is any one of the following: voice input, click input, and text input.
  • the electronic device also includes a dialogue management component, and the dialogue management component is used to send a first request based on the target candidate item, and the first request is used to request provision of a service corresponding to the target candidate item.
  • a voice interaction device comprising: a memory for storing a computer program; and a processor for executing the computer program stored in the memory, so that the device performs the method described in the first aspect or any one of the implementations of the first aspect.
  • a computer-readable medium stores a program code, and when the program code is executed on a computer, the computer executes the method described in the first aspect or any one of the implementations of the first aspect.
  • a computer program product comprises: a computer program code, which, when the computer program product runs on a computer, enables the computer to execute the method described in the first aspect or any one of the implementations of the first aspect.
  • a chip comprising a processor and a data interface, wherein the processor reads instructions stored in a memory through the data interface to execute the method in the first aspect and any possible implementation manner of the first aspect.
  • the chip may further include a memory, wherein the memory stores instructions.
  • the processor is used to execute instructions stored in the memory. When the instructions are executed, the processor is used to execute the method in the above-mentioned first aspect and any possible implementation manner of the first aspect.
  • the above chip may specifically be a field programmable gate array or a dedicated integrated circuit.
  • FIG. 1 is a schematic diagram of the hardware structure of an electronic device applicable to an embodiment of the present application.
  • FIG. 2 is a software structure block diagram of the electronic device according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a multi-round dialogue scenario.
  • FIG4 is a schematic flowchart of a voice interaction method provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of an interface of a voice interaction method provided in an embodiment of the present application.
  • FIG6 is a schematic flowchart of a voice interaction method provided in an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a voice interaction method provided in an embodiment of the present application.
  • FIG8 is a structural diagram of a voice interaction system provided in an embodiment of the present application.
  • FIG. 9 is a schematic block diagram of a voice interaction device provided in an embodiment of the present application.
  • FIG. 10 shows a schematic diagram of an electronic device provided in an embodiment of the present application.
  • a and/or B can represent: A exists alone, A and B exist at the same time, and B exists alone, where A and B can be singular or plural.
  • the character "/” generally indicates that the objects associated before and after are in an "or” relationship.
  • references to "one embodiment” or “some embodiments” etc. described in this specification mean that a particular feature, structure or characteristic described in conjunction with the embodiment is included in one or more embodiments of the present application.
  • the phrases “in one embodiment”, “in some embodiments”, “in some other embodiments”, “in some other embodiments”, etc. appearing in different places in this specification do not necessarily all refer to the same embodiment, but mean “one or more but not all embodiments", unless otherwise specifically emphasized in other ways.
  • the terms “including”, “comprising”, “having” and their variations all mean “including but not limited to”, unless otherwise specifically emphasized in other ways.
  • the electronic device can be a portable electronic device that also includes other functions such as a personal digital assistant and/or a music player function, such as a mobile phone, a tablet computer, a wearable electronic device with wireless communication function (such as a smart watch), etc.
  • portable electronic devices include but are not limited to devices equipped with Or a portable electronic device with other operating systems.
  • the portable electronic device may also be other portable electronic devices, such as a laptop computer, etc. It should also be understood that in some other embodiments, the electronic device may not be a portable electronic device, but a desktop computer.
  • the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a compass 190, a motor 191, an indicator 192, a camera 193, a display screen 194, and a subscriber identification module (SIM) card interface 195, etc.
  • SIM subscriber identification module
  • the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the electronic device 100.
  • the electronic device 100 may include more or fewer components than shown in the figure, or combine some components, or split some components, or arrange the components differently.
  • the components shown in the figure may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (AP), a modem processor, a graphics processor (GPU), an image signal processor (ISP), a controller, a video codec, a digital signal processor (DSP), a baseband processor, and/or a neural-network processing unit (NPU). Different processing units may be independent components or integrated into one or more processors.
  • the electronic device 100 may also include one or more processors 110.
  • the controller may generate an operation control signal according to the instruction opcode and the timing signal to complete the control of fetching and executing instructions.
  • a memory may be further provided in the processor 110 for storing instructions and data.
  • the memory in the processor 110 may be a cache memory.
  • the memory may store instructions or data that the processor 110 has just used or circulated. If the processor 110 needs to use the instruction or data again, it may be directly called from the memory. This avoids repeated accesses, reduces the waiting time of the processor 110, and thus improves the efficiency of the electronic device 100 in processing data or executing instructions.
  • processor 110 may include one or more interfaces.
  • the interface may include an inter-integrated circuit (IC)
  • the USB interface 130 may be an I2C interface, an inter-integrated circuit sound (I2S) interface, a pulse code modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a mobile industry processor interface (MIPI), a general-purpose input/output (GPIO) interface, a SIM card interface, and/or a USB interface, etc.
  • the USB interface 130 is an interface that complies with the USB standard specification, and specifically can be a Mini USB interface, a Micro USB interface, a USB Type C interface, etc.
  • the USB interface 130 can be used to connect a charger to charge the electronic device 100, and can also be used to transmit data between the electronic device 100 and a peripheral device.
  • the USB interface 130 can also be used to connect headphones to play audio through the headphones.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, files such as music and videos can be stored in the external memory card.
  • the internal memory 121 can be used to store one or more computer programs, which include instructions.
  • the processor 110 can run the above instructions stored in the internal memory 121.
  • the internal memory 121 may include a program storage area and a data storage area.
  • the program storage area can store an operating system; the program storage area can also store one or more applications (such as a gallery, contacts, etc.).
  • the data storage area can store data (such as photos, contacts, etc.) created during the use of the electronic device 100.
  • the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more disk storage components, a flash memory component, a universal flash storage (UFS), an embedded multimedia card (eMMC), etc.
  • UFS universal flash storage
  • eMMC embedded multimedia card
  • the processor 110 can enable the electronic device 100 to execute the method provided in the embodiment of the present application, as well as other applications and data processing by running instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor 110.
  • the electronic device 100 can implement audio functions such as music playing and recording through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone jack 170D, and the application processor.
  • the electronic device 100 can realize the shooting function through the ISP, the camera 193, the video codec, the GPU, the display screen 194 and the application processor.
  • the ISP is used to process the data fed back by the camera 193. For example, when taking a photo, the shutter is opened, and the light is transmitted to the camera photosensitive element through the lens. The light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converts it into an image visible to the naked eye.
  • the ISP can also perform algorithm optimization on the noise, brightness, and skin color of the image. The ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP can be set in the camera 193.
  • the camera 193 is used to capture still images or videos.
  • the object generates an optical image through the lens and projects it onto the photosensitive element.
  • the photosensitive element can be a charge coupled device (CCD) or a complementary metal oxide semiconductor (CMOS) phototransistor.
  • CMOS complementary metal oxide semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then passes the electrical signal to the ISP to be converted into a digital image signal.
  • the ISP outputs the digital image signal to the DSP for processing.
  • the DSP converts the digital image signal into an image signal in a standard RGB, YUV or other format.
  • the electronic device 100 may include one or more cameras 193.
  • the digital signal processor is used to process digital signals, and can process not only digital image signals but also other digital signals. For example, when the electronic device 100 is selecting a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy.
  • Video codecs are used to compress or decompress digital videos.
  • the electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record videos in a variety of coding formats, such as Moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, etc.
  • MPEG Moving Picture Experts Group
  • MPEG2 MPEG2, MPEG3, MPEG4, etc.
  • NPU is a neural network (NN) computing processor.
  • NN neural network
  • applications such as intelligent cognition of electronic device 100 can be realized, such as image recognition, face recognition, voice recognition, text understanding, 3D model reconstruction, etc.
  • the display screen 194 is used to display images, videos, etc.
  • the display screen 194 includes a display panel.
  • the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (AMOLED), a flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-OLed, quantum dot light-emitting diodes (QLED), etc.
  • the electronic device 100 may include one or more display screens 194.
  • the display screen 194 in FIG. 1 can be bent.
  • the display screen 194 can be bent, which means that the display screen can be bent to any angle at any position and can be kept at the angle.
  • the display screen 194 can be folded in half from the middle to the left or right. It can also be folded in half from the middle to the top or bottom.
  • the display screen 194 of the electronic device 100 may be a flexible screen.
  • flexible screens have attracted much attention for their unique characteristics and great potential.
  • flexible screens have the characteristics of strong flexibility and bendability, which can provide users with a new interaction method based on the bendable characteristics and meet users' more needs for electronic devices.
  • the foldable display screen on the electronic device can be switched between a small screen in a folded state and a large screen in an unfolded state at any time.
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.
  • FIG2 is a software structure diagram of the electronic device 100 of an embodiment of the present application.
  • the layered architecture divides the software into several layers, each layer has a clear role and division of labor.
  • the layers communicate with each other through software interfaces.
  • the Android system is divided into four layers, from top to bottom, namely, the application layer, the application framework layer, the Android runtime (Android runtime) and the system library, and the kernel layer.
  • the application layer can include a series of application packages.
  • the application package may include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, authentication module and execution module.
  • applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, authentication module and execution module.
  • the authentication module is used to authenticate users, for example, by voiceprint, fingerprint, iris, etc.
  • the execution module is used to launch the application in the lock screen state and execute the user's input (such as voice commands, gesture operations, etc.).
  • the application framework layer provides application programming interface (API) and programming framework for applications in the application layer.
  • API application programming interface
  • the application framework layer includes some predefined functions.
  • the application framework layer may include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, a controlled module, and the like.
  • the window manager is used to manage window programs.
  • the window manager can receive the size of the display screen, determine whether there is a status bar, lock the screen, capture the screen, etc.
  • Content providers are used to store and receive data and make it accessible to applications.
  • the data may include videos, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.
  • the view system includes visual controls, such as controls for displaying text, controls for displaying images, etc.
  • the view system can be used to build applications.
  • a display interface can be composed of one or more views.
  • a display interface including a text notification icon can include a view for displaying text and a view for displaying images.
  • the phone manager is used to provide communication functions of the electronic device 100, such as management of call status (including connecting, hanging up, etc.).
  • the resource manager provides various resources for applications, such as localized strings, icons, images, layout files, video files, and so on.
  • the notification manager enables applications to display notification information in the status bar. It can be used to convey notification-type messages and can disappear automatically after a short stay without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also be a notification that appears in the system top status bar in the form of a chart or scroll bar text, such as notifications of applications running in the background, or a notification that appears on the screen in the form of a dialog window. For example, a text message is prompted in the status bar, a prompt sound is emitted, an electronic device vibrates, an indicator light flashes, etc.
  • the controlled module is used to manage the permissions of applications running in the lock screen state so that the applications can only use the registered permissions.
  • the system library can include multiple functional modules, such as surface manager, media library, 3D graphics processing library (such as OpenGL ES), 2D graphics engine (such as SGL), etc.
  • functional modules such as surface manager, media library, 3D graphics processing library (such as OpenGL ES), 2D graphics engine (such as SGL), etc.
  • the surface manager is used to manage the display subsystem and provide the fusion of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as static image files, etc.
  • the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG and PNG, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, synthesis and layer processing.
  • a 2D graphics engine is a drawing engine for 2D drawings.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer includes at least display driver, camera driver, audio driver, and sensor driver.
  • the voice assistant application in the application package is a type of human-computer interaction application.
  • the voice assistant application can also be called a voice assistant application or a smart assistant application.
  • Human-computer interaction applications can also be called human-computer interaction robots, human-computer dialogue robots or chat robots (ChatBOT), etc.
  • Human-computer interaction applications are currently widely used in various electronic devices such as mobile phones, tablets, smart speakers, etc., providing users with intelligent voice interaction methods.
  • Multi-round dialogue is a typical application scenario of human-computer interaction. Through multiple rounds of dialogue with users, the user's true intention can be judged. Multiple rounds of dialogue are related to the processing of an event. For example, if a user needs to order a meal or buy a ticket, if the user's voice input is inaccurate, such as a slot is missing or needs to be supplemented, the voice interaction system needs to have multiple rounds of dialogue with the user, and the user needs to answer multiple rounds to determine the user's true intention.
  • Figure 3 shows a multi-round dialogue scenario, where the user wants to use an APP such as a voice assistant for navigation.
  • Figure 3 (a) to (d) show the process of voice interaction between the user and the voice interaction system of the electronic device.
  • the user inputs a voice command "Navigate to Xinjiekou", and the electronic device displays an interface 310.
  • the interface 310 may include a dialog box 301 input by the user, and the dialog box 301 displays the text "Navigate to Xinjiekou”.
  • the electronic device can extract from the voice command that the user's intention is "navigation” and the slot information is "Xinjiekou”, and can determine that the third-party application corresponding to the intention is a map software, and then use the intention "navigation” and the slot "Xinjiekou" to send a service request to the map software.
  • the server corresponding to the map software finds that the slot "Xinjiekou" corresponds to multiple specific addresses, it means that the instruction entered by the user is inaccurate.
  • the server of the map software can send the multiple specific addresses to the voice interaction system.
  • the voice interaction system will now announce to the user "Multiple destinations found, which one do you want to go to”, and the interface 310 displayed on the electronic device will display a dialog box 302 and a card 303.
  • the dialog box 302 displays the voice announcement, and the multiple specific addresses are displayed in the card 303, which is used to prompt the user to enter the specific address in the next round of dialogue.
  • the user can input “the first one” by voice in the next round of conversation, and a dialog box 304 will appear on the interface 310 of the electronic device at the same time, displaying the content of the user's voice input in this round.
  • the voice interaction system will announce to the user “Starting to navigate you to Xinjiekou subway station”, and at the same time a dialog box 305 appears on the interface 310 of the electronic device, and the dialog box 305 includes the content announced by the voice interaction system.
  • the voice interaction system will broadcast prompt information that needs to be supplemented or updated by the user in each round of dialogue, such as the content of dialog box 302, so that the voice interaction system can determine the user's true intention. This greatly increases the time spent on executing user instructions and the user experience is poor.
  • the embodiment of the present application provides a voice interaction method, which can reduce the time required for voice interaction and improve the user experience.
  • the voice interaction method includes:
  • the user Before S410, the user can open an application such as a voice assistant by clicking an application icon or using a voice command, thereby being able to perform voice interaction with the voice interaction system of the electronic device.
  • an application such as a voice assistant by clicking an application icon or using a voice command, thereby being able to perform voice interaction with the voice interaction system of the electronic device.
  • the electronic device may display a voice interaction interface 510 (first interface).
  • the interface 510 may include controls 511, 512, and 513.
  • Control 511 may be a "recommendation" control. When the user clicks on control 511, different cards may appear on the interface 510, prompting the user with voice commands that can be issued, such as "What's the weather like today?", "Tell a joke," etc.
  • Control 513 may be an "account” control. After the user clicks on the control, the user may set the sound in the voice interaction system, or browse the user's voice interaction records, etc.
  • Control 512 may be a voice input control. After the user clicks on the control, the electronic device may receive the sound.
  • the user can click on the control 512, and the electronic device starts to receive sound, thereby collecting the user's first voice input.
  • the control 512 can change from FIG5(a) to FIG5(b), indicating that the electronic device is receiving sound.
  • the user can input "Navigate to Xinjiekou" by voice, and a dialog box 513 can appear on the interface 510, and the dialog box 513 includes the user's first voice input.
  • the first voice input is a command input by the user through voice.
  • the voice interaction system of the electronic device can parse the first voice input to obtain the intent and slot.
  • the intent can be understood as an intent classification, which can correspond to a specific application, and the slot is a keyword related to the intent.
  • Each intent can correspond to one or more slots. For example, three slots are defined under the intent of "booking a flight", namely “departure time”, "origin” and "destination”. If you want to fully consider the content that users need to enter to book a flight, you can include more slots, such as the number of passengers, airlines, departure airports, landing airports, etc.
  • the user's first voice input includes a first slot, and the first voice input may also include other slots, which is not limited in the present application.
  • the electronic device further determines that the first slot includes at least two candidate items.
  • the at least two candidate items may be lower-level concepts of the first slot, or multiple complete concepts of the first slot, or different meanings of the first slot.
  • the first voice input is "Navigate to Xinjiekou”
  • the first slot is Xinjiekou
  • the place "Xinjiekou” is not a real-time location.
  • the candidate items may include "Xinjiekou Subway Station", “Xinjiekou Bus Station”, “Xinjiekou Pedestrian Street”, etc.
  • the first voice input is "Call Zhang Yu" and the user's phone book includes contacts "Zhang Yuyi” and "Zhang Yuer”, then the candidate items include "Zhang Yuyi” and "Zhang Yuer”.
  • the first voice input may be "Navigate to Gulou District", and there are Gulou Districts in many cities across the country, such as Nanjing Gulou District and Xuzhou Gulou District, then the candidate items include the multiple specific Gulou Districts.
  • the electronic device can determine that the first slot includes at least two candidate items after detecting that the time of the first pause is greater than a preset threshold.
  • a preset threshold For example, the user's first voice input is "navigate to Xinjiekou", and after the voice is input, the electronic device can detect the user's pause (first pause).
  • the time of the first pause can be greater than the preset threshold.
  • the preset threshold can be 50 milliseconds, 100 milliseconds, 200 milliseconds, etc.
  • the electronic device can parse the first voice input, obtain the user's intention and the corresponding slot (including the first slot), and pass the intention and the first slot corresponding to the first voice input to the server of the target application related to the intention, and the server of the target application can return the corresponding query result.
  • the server of the target application after the server of the target application determines that the first slot includes at least two candidates, in order to provide services to the user, the server of the target application will pass the at least two candidates to the voice interaction system of the electronic device, so that the voice interaction system can determine that the first slot includes at least two candidates based on the result returned by the application.
  • the electronic device will extend the countdown time for receiving the sound to a second time, and the second time can be, for example, 2 seconds to 5 seconds, for example, 2 seconds, 3 seconds, 4 seconds or 5 seconds.
  • control 515 corresponds to the remaining value of the radio countdown duration, and the electronic device extends the radio countdown duration to 2 seconds (the second time) for the user to make additional input.
  • the target application or the server of the target application
  • the target application will convey the true intention corresponding to the first voice input to the voice interaction system of the electronic device, so that the electronic device stops receiving the sound within the first time, and the first time can be 500 milliseconds, for example.
  • the electronic device When the first slot includes at least two candidates, the electronic device will not let the user determine the target candidate through the next round of dialogue, but will extend the reception countdown to the second time, so as to facilitate the user to supplement the content and reduce the time the user needs to wait. Specifically, the user can be prompted to determine the target candidate through the subsequent step S430.
  • the starting time point for calculating the radio countdown duration in the embodiment of the present application may be the moment when the electronic device detects that the time of the first pause is greater than a preset threshold. Thereafter, if the electronic device determines in the first radio time period that the first slot does not include at least two candidate items (the first slot is clear), the radio may be stopped after a first time has passed after the first pause. If the electronic device determines in the first radio time period that the first slot includes at least two candidate items, the first time may be extended to a second time, and the starting calculation time of the second time is still the moment when the electronic device detects that the time of the first pause is greater than the preset threshold.
  • the starting time point for calculating the radio countdown duration in the embodiment of the present application may also be after the electronic device determines that the first slot includes at least two candidates (or the voice interaction system of the electronic device determines that the first slot does not include at least two candidates). If the electronic device determines that the first slot includes at least two candidates, the electronic device may extend the first time to the second time, and the starting time point for calculating the second time is consistent with that of the first time. If the electronic device determines that the first slot is clear (does not include at least two candidates), the radio process may be stopped after the first time.
  • the reception countdown duration in the embodiment of the present application is not the total reception time.
  • the electronic device is also in the reception process when receiving the first voice input of the user, which can be called the total reception time. Since the time for the user to input the first voice will change with the length of the first voice input, the total reception time will also change accordingly.
  • the reception countdown duration of the present application only considers the time when the electronic device is still in the reception state after the first voice input is input.
  • Extending the radio countdown duration means that the total radio time is also extended, and the extension amount of the total radio time is consistent with the extension amount of the radio countdown duration. Therefore, extending the radio countdown duration from the first time to the second time can be equivalent to extending the total radio time by the first extension amount, which is the difference between the first time and the second time.
  • S430 displaying a first card on the first interface, the first card is used to prompt the user to determine a target candidate item for the first slot, and the first card includes the at least two candidate items.
  • the voice interaction system may display a first card on the first interface, wherein the first card 514 includes the at least two candidate items.
  • FIG5(c) shows that the first card 514 includes the candidate items: "Xinjiekou Subway Station", "Xinjiekou Bus Station” and "Xinjiekou Pedestrian Street”.
  • the first card 514 may also include at least one candidate statement, such as "the way...road” and “don't take the...road” shown in (c) of Figure 5.
  • the user can determine the target candidate according to the prompt of the first card 514, and can also supplement the corresponding statement, so that the provided service can better match the user's needs and enhance the user's experience.
  • S440 determining a target candidate item based on at least two candidate items or determining a target candidate item based on a second input of a user during a first sound receiving period, wherein the first sound receiving period is a period of time during which the electronic device is continuously in a sound receiving state after receiving a first voice input of a user.
  • the second input is used to select the target candidate from the at least two candidates; or the second input is used to input the target candidate, and the target candidate does not belong to the at least two candidates.
  • the user can determine the target candidate according to the prompt of the first card 514.
  • the user can determine the target candidate from the at least two candidates in the first sound recording period by means of click input (clicking on any one of the at least two candidates on the screen), voice input ("item x" or directly voice inputting the candidate), text input, etc.
  • the user can directly input the target candidate through the second input (for example, input "Xinjiekou International Cinema").
  • the electronic device can perform corresponding actions according to the target candidate and provide the user with services corresponding to the first voice input.
  • the at least two candidate items may include a default candidate item.
  • the default candidate item may be distinguished from other candidate items by different fonts, background colors, positions or numbers. For example, for the first card 514 shown in (c) of 5, the first candidate item "Xinjiekou Subway Station" among the at least two candidate items is the default candidate item.
  • the default candidate may be a candidate with higher popularity among at least two candidates.
  • the server of the target application may determine the default candidate based on the services provided to different users, and transmit the default candidate to the voice interaction system of the electronic device. In this way, the default candidate provided is more likely to be the target candidate that the user actually intends, thereby reducing the possibility of unnecessary operations by the user and improving the user experience.
  • the electronic device when the electronic device does not receive any input from the user during the first sound recording period, the electronic device may determine the target candidate item based on the default candidate item.
  • the electronic device can determine the target candidate item according to the user's second input; if the user's input is not received in the first receiving period, the electronic device can determine the default candidate item as the target candidate item.
  • the default candidate item can be determined as the target candidate item in the following manner: the user can click on "Xinjiekou Subway Station” on the first card 514 on the interface 510 to determine the target candidate item.
  • the user can determine the target candidate item as "Xinjiekou Subway Station” by voice input or text input of "Xinjiekou Subway Station” or "first item” in the second time;
  • the user can make no input in the first sound recording period, and the default candidate item "Xinjiekou Subway Station” can be directly determined as the target candidate item by the voice interaction system;
  • the user's second input can include instruction words such as "determine” and "confirm", which can correspond to the third instruction, and the instruction can determine that the default candidate item is the target candidate item.
  • the target candidate is not the default candidate
  • the user can still determine the target candidate by clicking on the corresponding option, text input, voice input, etc.
  • the electronic device can determine the target candidate through the user's voice input or text input.
  • the user input dialog box 513 displayed on the first interface 510 can be updated according to the changes of the currently determined target candidate. For example, in (c) of Figure 5, when the user has not determined the target candidate, the "Navigate to Xinjiekou" in the dialog box 513 can be updated to "Navigate to Xinjiekou Subway Station" according to the default candidate. If the target candidate determined by the subsequent user is different from the default candidate, the dialog box can continue to be updated. For example, if the target candidate selected by the user is "Xinjiekou Pedestrian Street", the dialog box 513 can be updated to "Navigate to Xinjiekou Pedestrian Street".
  • the voice interaction system may receive a fourth input from the user, which may correspond to a cancel instruction, and determine, based on the fourth input, that the default candidate is not the target candidate.
  • the fourth input may include a keyword such as "cancel", which may correspond to a cancel instruction for canceling the selected candidate (or default candidate).
  • the electronic device can display a control on the first interface, which is used to prompt the user of the remaining value of the radio countdown duration, so that the user can judge whether the remaining radio time is sufficient for the user to make a second input to determine the target candidate, and input instructions in time to extend the radio countdown duration when the time is insufficient.
  • a control 515 is displayed on the first interface 510, which is the remaining value of the radio countdown time.
  • the control may not be a text showing the remaining value of the radio countdown time as shown in FIG5(c), but may also be an animation, such as an animation whose shape changes according to the remaining value of the radio countdown time.
  • the user can also make a third input during the first radio time period, and the third input corresponds to the first instruction ("wait instruction"), and the first instruction is used to extend the radio countdown time.
  • the third input may include words such as "wait”, "wait” or "wait”, and these words may correspond to the wait instruction.
  • the electronic device may trigger the wait instruction, thereby correspondingly extending the radio countdown time, and extending the radio countdown time from the second time to the third time.
  • the extension of the radio countdown duration described in the present application does not extend the remaining radio time to a certain time, but extends the first radio period to a certain time. If the first time is 0.5 seconds, the second time is 2 seconds, and the third time is 4 seconds, then when the radio countdown duration is extended from 2 seconds to 4 seconds, it is possible that the remaining value of the radio countdown is 1 second, and the first interface will add 2 seconds to the remaining value and display it as 3 seconds, but in fact, it is still 4 seconds from the electronic device receiving the first voice input to the end of radio.
  • the radio countdown time can be extended by inputting a voice corresponding to the first instruction.
  • the electronic device can extend the radio countdown time by 2 seconds.
  • the electronic device may also display a prompt message on the first interface to prompt the user of the default action corresponding to the default candidate.
  • the electronic device will determine the default action based on the default candidate.
  • the default action can be displayed on the second card 516 as shown in (c) of Figure 5 to prompt the user of the query request that is about to be executed.
  • the second card 516 will display "Navigation to Xinjiekou subway station will be executed soon". If the user does not agree with the default action (does not agree with the default candidate), he can enter the command "Cancel" and enter a new target candidate.
  • the first interface 510 will display the action to be executed and the content of the second card 516. In other words, what is displayed on the first interface is the action to be executed, which can be a default action; or the default action can be displayed first and then become the target action (corresponding to the target candidate) according to the user's input.
  • the electronic device can also receive a fifth input from the user, which can correspond to a third instruction.
  • the third instruction can be an end instruction for ending the execution of the action corresponding to the first voice input. After receiving the fifth input, the electronic device can end the execution of the current action and stop receiving the sound.
  • the second input, the third input, the fourth input and the fifth input mentioned above can be one of voice input, selection input (click input) or text input, so that the user can choose a suitable input method.
  • the above takes the first slot including at least two candidate items as an example to introduce the voice interaction method provided by the embodiment of the present application.
  • the candidate items of the second slot and the third slot can also be displayed to the user according to the technical solution of the embodiment of the present application, and the sound reception countdown time is extended to prompt the user to determine the second target slot and the third target slot, thereby reducing the interaction time with the user and improving the user experience.
  • At least two candidate items, candidate statements, and default actions in the above embodiments may be displayed on one card or on multiple cards.
  • the countdown time for radio reception can be extended, and the user can be prompted to select the application they want to use through cards or options on the display interface, thereby reducing the interaction time with the user and improving the user experience.
  • the user's intention is unclear, for example, it can be interpreted as two or more candidate intentions, the user can be prompted to select the true intention by including cards with multiple candidate intentions on the display interface, thereby reducing the interaction time with the user and improving the user experience.
  • the electronic device will also send a first request corresponding to the target candidate item based on the target candidate item to provide the user with a service result corresponding to the first voice input.
  • FIG6 shows a flow chart of a voice interaction method provided in an embodiment of the present application.
  • S602 The user performs voice input.
  • the sound receiving component receives the user's input
  • the speech analysis component parses the intent and slot.
  • the immediate response component determines whether the instruction is complete and accurate.
  • a complete and accurate instruction can be parsed into a unique execution action.
  • the first slot input by the user may include at least two candidate items, and the immediate response component may instruct the display component to display an entity completion list, which may include candidate items and candidate statements.
  • the intended operation can be performed directly according to the instruction input by the user.
  • the default action corresponding to the default candidate in the entity completion list can be executed.
  • the immediate response component may determine whether the user's additional input includes or corresponds to a special instruction based on the result of the analysis of the user's additional input by the voice analysis component.
  • the supplementary input corresponds to a wait instruction, which can extend the reception time, and returns to step S610 to determine whether the user provides supplementary input within the extended reception time.
  • the supplementary input corresponds to a cancel instruction, which can cancel the default candidate item or the selected candidate item, and return to step S610 to determine whether the user continues to supplement the input of the target candidate item corresponding to the first slot.
  • the supplementary input corresponds to the selection instruction, and can fill the first slot with the candidate item selected by the user.
  • the supplementary input can be "first item", and the target candidate item corresponding to the entity completion list is determined through the supplementary input, and the intended operation is performed according to the target candidate item.
  • Supplementary input corresponds to the end command to end the reception.
  • the supplementary input corresponds to a confirmation instruction, that is, confirming the slot that the user has selected or the default slot, and performing the intended operation according to the slot.
  • FIG. 7 shows a schematic flow chart of a voice interaction method provided in an embodiment of the present application, the method comprising:
  • the audio receiving component receives a first voice input from a user.
  • the first voice input is “navigate to Xinjiekou”.
  • the first voice input received by the sound receiving component is uploaded to the voice analysis component.
  • the speech analysis component determines the intent and slot of the first speech input.
  • the number of slots may be multiple, and the slots include the first slot.
  • the speech analysis component sends the intent and slot to the immediate response component.
  • the slots may include a first slot.
  • the intention is "navigation”
  • the first slot is "Xinjiekou”.
  • the immediate response component determines that the first slot includes at least two candidate items.
  • the instant response component may send the intent and the first slot to an application corresponding to the intent for query, so that the application returns at least two candidates for the first slot.
  • the at least two candidates include a default candidate (Xinjiekou subway station).
  • the application also returns one or more candidate statements.
  • the immediate response component determines "slow stop” and sends a command to the sound receiving component to extend the sound receiving time.
  • slow stop means that after the instant response component determines that the first slot includes at least two candidates, it instructs the radio component to stop recording slowly and sends an instruction to extend the recording time to the radio component, allowing the user to supplement the input within the extended recording time.
  • the immediate response component sends the intent and default candidates to the dialog management component.
  • the dialog management component returns a default action to the immediate response component based on the intent and the default candidate.
  • the dialog management component returns multiple candidate statements.
  • the immediate response component sends at least two candidate items (optional, including default candidate items), candidate statements, and default actions to the display component, which displays them.
  • the display component can also display a radio countdown (the remaining value of the radio time).
  • the third input may include words such as "wait”, "wait a minute”, etc., and the third input may be sent to the immediate response component after being parsed by the voice analysis component.
  • the third input may be that the user believes that the default candidate (Xinjiekou subway station) is not the slot that the user really needs, and believes that the remaining recording time is not enough for the user to think or input the real target candidate, thereby triggering a wait instruction, which can extend the recording time.
  • the immediate response component analyzes the third input and instructs the sound receiving component to extend the sound receiving time.
  • the second input may include the target candidate item (Xinjiekou Pedestrian Street), or may also be the serial number of the target candidate item on the display interface (for example, the third item), etc.
  • the immediate response component determines that the target candidate is accurate, triggers "quick stop”, and instructs the radio component to stop receiving the radio.
  • the sound receiving component may stop receiving the sound within 500 milliseconds, for example.
  • the immediate response component sends the intent and target candidates to the dialog management component.
  • the DM component calls a third-party application based on the intent and target candidates to execute the user's operation.
  • FIG8 shows a voice interaction system for executing an embodiment of the present application.
  • the voice interaction system includes a sound receiving component 801, a voice analysis component 802, an immediate response component 803, and a dialogue manager (DM) component 804, wherein the voice analysis component 802 is composed of a voice activity detection (VAD) subcomponent 8021, an automatic speech recognition (ASR) subcomponent 8022, and a natural language understanding (NLU) subcomponent 8023.
  • VAD voice activity detection
  • ASR automatic speech recognition
  • NLU natural language understanding
  • the sound receiving component 801 is used to receive the user's voice input (first voice input), and the voice input received by the sound receiving component 801 will be analyzed and processed by the voice analysis component.
  • the VAD subcomponent 8021 can detect whether human speech is present, detect pauses in speech input, and provide analysis of speech, i.e., whether the input speech is voiced, silent, or continuous.
  • the main function of the ASR subcomponent 8022 is to recognize the user's voice as voice text, thereby converting the user's voice input into text to facilitate the NLU module to understand the text.
  • the main function of the NLU subcomponent 8023 is to understand the user's intent based on the voice text, perform slot analysis, and convert the voice text into structured information that the machine can understand. In other words, the voice text is converted into executable intents and slots. The intents and slots will be used to complete the user's demands through appropriate applications.
  • the immediate response component 803 is used to query the intent and slot parsed by the NLU subcomponent 8023 when the VAD subcomponent 8021 detects a pause greater than a preset threshold (for example, the preset threshold is greater than 50 milliseconds), and the target application related to the intent queries and returns the query result according to the query request of the immediate response component 803.
  • the immediate response component 803 can determine that the first slot includes at least two candidates according to the query result, and send an instruction to the sound receiving component 801 to extend the sound receiving time.
  • the immediate response component 803 can also determine a list of default candidates and statements based on the results returned by the destination application, and send the default candidates to the DM component 804, which returns the default action.
  • the immediate response component 804 can return the at least two candidates, candidate statements, and default actions to the display component of the electronic device for display.
  • the display component displays at least two candidate items and a default action on a first card of the first interface.
  • the first card is used to prompt the user to determine the target candidate item.
  • the first interface can also display the candidate statements and the remaining value of the recording time.
  • the user can determine the target candidate according to the first card. If the user disagrees with the default candidate and the default action, the user can input to determine the target candidate.
  • the immediate response component 803 can send the target candidate to the DM component, and the DM component returns the target action and displays the target action on the interface of the electronic device.
  • the instant response component can also perform different operations according to the commands.
  • FIG9 shows an electronic device 900 provided in an embodiment of the present application.
  • the electronic device 900 can execute the voice interaction method of FIG4 to FIG7 .
  • the electronic device 900 includes: a processing unit 910, a display unit 920
  • the electronic device 900 includes: a processing unit 910, used to receive a first voice input from a user, the first voice input includes a first slot; the processing unit 910, also used to extend the reception countdown duration from a first time to a second time when the first slot includes at least two candidate items, the reception countdown duration is the time during which the electronic device is continuously in a reception state after receiving the first voice input from the user; a display unit 920, used to display a first card on a first interface, the first card is used to prompt the user to determine a target candidate item for the first slot, the first card includes the at least two candidate items; the processing unit 910, also used to determine the target candidate item based on the at least two candidate items or to determine the target candidate item based on the user's second input during the first reception period, the first reception period being the time period during which the electronic device is continuously in a reception state after receiving the user's first voice input.
  • a processing unit 910 used to receive a first voice input from a user, the first voice
  • the at least two candidate items include a default candidate item.
  • the processing unit 910 is specifically configured to: determine the target candidate item according to the default candidate item when the electronic device does not receive any user input during the first sound reception period.
  • the second input is used to select the target candidate item from the at least two candidate items; or, the second input is used to input the target candidate item, and the target candidate item does not belong to the at least two candidate items.
  • the processing unit 910 is further used to: after the electronic device detects that the time of the first pause is greater than a preset threshold, determine that the first slot includes the at least two candidate items.
  • the processing unit 910 is also used to receive a fourth input from the user during the first recording period, and the fourth input corresponds to a second instruction, and the second instruction user determines that the default candidate item is not the target candidate item; based on the fourth input, it is determined that the default candidate item is not the target candidate item.
  • the display unit 920 is also used to display prompt information on the first interface to prompt the user of the default action corresponding to the default candidate item.
  • the processing unit 910 is further configured to determine the default action according to the default candidate item.
  • the display unit 920 is also used to display a control on the first interface, and the control is used to prompt the user of the remaining value of the radio countdown time.
  • the processing unit 910 is also used to receive a third input from the user during the first radio time period, the fifth input corresponds to a first instruction, and the first instruction is used to extend the radio countdown duration; according to the third input, the radio countdown duration is extended from the second time to the third time.
  • the processing unit 910 is further used to receive a fifth input from the user during the first sound recording period, wherein the fifth input corresponds to a third instruction, and the third instruction is used to end the execution of the action corresponding to the first voice input.
  • the second input, the third input, the fourth input or the fifth input is any one of the following: voice input, click input and text input.
  • FIG10 shows an electronic device 1000 provided in an embodiment of the present application, which can be used to execute any of the methods in FIG4 to FIG7.
  • the electronic device 1000 includes: a processor 1020.
  • the processor 1020 is used to implement corresponding control management operations, for example, the processor 1020 is used to support the electronic device 1000 to execute the method or operation or function of the aforementioned embodiment.
  • the electronic device 1000 may also include: a memory 1010 and a communication interface 1030; the processor 1020, the communication interface 1030 and the memory 1010 may be connected to each other or connected to each other through a bus 1040.
  • the communication interface 1030 is used to support the electronic device 1000 to communicate with other devices, etc.
  • the memory 1010 is used to store program codes and data of the electronic device 1000.
  • the processor 1020 calls the code or data stored in the memory 1010 to implement the corresponding operation.
  • the memory 1010 may be coupled with the processor or not.
  • the coupling in the embodiments of the present application is an indirect coupling or communication connection between electronic devices, units or modules, which can be electrical, mechanical or other forms, and is used for information exchange between electronic devices, units or modules.
  • the processor 1020 can be a central processing unit, a general processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a transistor logic device, a hardware component or any combination thereof. It can implement or execute various exemplary logic blocks, modules and circuits described in conjunction with the disclosure of this application.
  • the processor can also be a combination that implements a computing function, such as a combination of one or more microprocessors, a combination of a digital signal processor and a microprocessor, and the like.
  • the communication interface 1030 can be a transceiver, a circuit, a bus, a module or other types of communication interfaces.
  • the bus 1040 can be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect standard
  • EISA extended industry standard architecture
  • the bus can be divided into an address bus, a data bus, a control bus, etc. For ease of representation, only one thick line is used in FIG. 10, but it does not mean that there is only one bus or one type of bus.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
  • Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit 910, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application or the part that contributes to the prior art, or the part of the technical solution, can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including
  • the instructions are used to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: a USB flash drive, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • General Business, Economics & Management (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Des modes de réalisation de la présente demande concernent un procédé d'interaction vocale, un appareil d'interaction vocale et un dispositif électronique. Le procédé consiste : à recevoir une première entrée vocale d'un utilisateur, la première entrée vocale comprenant un premier créneau ; lorsque le premier créneau comprend au moins deux éléments candidats, à prolonger une durée de compte à rebours de réception vocale d'un premier moment à un second moment, la durée de compte à rebours de réception vocale étant une période de temps lorsqu'un dispositif électronique est en continu dans un état de réception vocale après la réception de la première entrée vocale de l'utilisateur ; à afficher une première carte sur une première interface, la première carte étant utilisée pour inciter à la détermination d'un élément candidat cible du premier créneau par l'utilisateur, et la première carte comprenant les deux, ou plus, éléments candidats ; et à déterminer l'élément candidat cible selon les deux, ou plus, éléments candidats ou à déterminer l'élément candidat cible pendant une première période de réception vocale selon une seconde entrée de l'utilisateur. Selon le procédé d'interaction vocale fourni par les modes de réalisation de la présente demande, l'élément candidat cible correspondant à une intention réelle de l'utilisateur peut être déterminé rapidement, de telle sorte que le temps total consommé lorsque l'utilisateur effectue une interaction vocale, soit réduit, et que l'expérience d'utilisateur soit améliorée.
PCT/CN2023/123414 2022-10-14 2023-10-08 Procédé d'interaction vocale, appareil d'interaction vocale et dispositif électronique WO2024078419A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211259240.4A CN117894307A (zh) 2022-10-14 2022-10-14 语音交互方法、语音交互装置和电子设备
CN202211259240.4 2022-10-14

Publications (1)

Publication Number Publication Date
WO2024078419A1 true WO2024078419A1 (fr) 2024-04-18

Family

ID=90640013

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/123414 WO2024078419A1 (fr) 2022-10-14 2023-10-08 Procédé d'interaction vocale, appareil d'interaction vocale et dispositif électronique

Country Status (2)

Country Link
CN (1) CN117894307A (fr)
WO (1) WO2024078419A1 (fr)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111724775A (zh) * 2019-03-22 2020-09-29 华为技术有限公司 一种语音交互方法及电子设备
CN113555018A (zh) * 2021-07-20 2021-10-26 海信视像科技股份有限公司 语音交互方法及装置
US20220157315A1 (en) * 2020-11-13 2022-05-19 Apple Inc. Speculative task flow execution
CN114582333A (zh) * 2022-02-21 2022-06-03 中国第一汽车股份有限公司 语音识别方法、装置、电子设备及存储介质
CN114627864A (zh) * 2020-12-10 2022-06-14 海信视像科技股份有限公司 显示设备与语音交互方法
US20220188361A1 (en) * 2020-12-11 2022-06-16 Meta Platforms, Inc. Voice-based Auto-Completions and Auto-Responses for Assistant Systems

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111724775A (zh) * 2019-03-22 2020-09-29 华为技术有限公司 一种语音交互方法及电子设备
US20220157315A1 (en) * 2020-11-13 2022-05-19 Apple Inc. Speculative task flow execution
CN114627864A (zh) * 2020-12-10 2022-06-14 海信视像科技股份有限公司 显示设备与语音交互方法
US20220188361A1 (en) * 2020-12-11 2022-06-16 Meta Platforms, Inc. Voice-based Auto-Completions and Auto-Responses for Assistant Systems
CN113555018A (zh) * 2021-07-20 2021-10-26 海信视像科技股份有限公司 语音交互方法及装置
CN114582333A (zh) * 2022-02-21 2022-06-03 中国第一汽车股份有限公司 语音识别方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN117894307A (zh) 2024-04-16

Similar Documents

Publication Publication Date Title
EP4030422B1 (fr) Procédé et dispositif d'interaction vocale
WO2021052263A1 (fr) Procédé et dispositif d'affichage d'assistant vocal
WO2020221072A1 (fr) Procédé d'analyse sémantique et serveur
CN112567457B (zh) 语音检测方法、预测模型的训练方法、装置、设备及介质
WO2020078299A1 (fr) Procédé permettant de traiter un fichier vidéo et dispositif électronique
US20220147207A1 (en) Application Quick Start Method and Related Apparatus
WO2022052776A1 (fr) Procédé d'interaction homme-ordinateur, ainsi que dispositif électronique et système
US20220214894A1 (en) Command execution method, apparatus, and device
CN115240664A (zh) 一种人机交互的方法和电子设备
WO2020019220A1 (fr) Procédé d'affichage d'informations de services dans une interface de prévisualisation et dispositif électronique
CN111970401B (zh) 一种通话内容处理方法、电子设备和存储介质
WO2022057852A1 (fr) Procédé d'interaction entre de multiples applications
CN111881315A (zh) 图像信息输入方法、电子设备及计算机可读存储介质
US12010257B2 (en) Image classification method and electronic device
CN113806473A (zh) 意图识别方法和电子设备
US20210405767A1 (en) Input Method Candidate Content Recommendation Method and Electronic Device
WO2022135157A1 (fr) Procédé et appareil d'affichage de page, ainsi que dispositif électronique et support de stockage lisible
WO2021196980A1 (fr) Procédé d'interaction multi-écran, dispositif électronique, et support de stockage lisible par ordinateur
WO2021238371A1 (fr) Procédé et appareil de génération d'un personnage virtuel
WO2021249281A1 (fr) Procédé d'interaction pour dispositif électronique, et dispositif électronique
WO2023005711A1 (fr) Procédé de recommandation de service et dispositif électronique
WO2024078419A1 (fr) Procédé d'interaction vocale, appareil d'interaction vocale et dispositif électronique
WO2022135273A1 (fr) Procédé permettant d'invoquer des capacités d'autres dispositifs, dispositif électronique et système
CN112786022A (zh) 终端、第一语音服务器、第二语音服务器及语音识别方法
WO2023045702A1 (fr) Procédé de recommandation d'informations et dispositif électronique

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23876621

Country of ref document: EP

Kind code of ref document: A1