CN111599358A

CN111599358A - Voice interaction method and electronic equipment

Info

Publication number: CN111599358A
Application number: CN202010274489.7A
Authority: CN
Inventors: 陈浩; 陈晓晓; 熊石一; 高璋; 殷志华
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-04-09
Filing date: 2020-04-09
Publication date: 2020-08-28
Also published as: WO2021204098A1

Abstract

The application discloses a voice interaction method for an electronic device, comprising the following steps: receiving voice from a user and analyzing the voice into semantic instructions; determining an instruction execution subject corresponding to the semantic instruction based on the semantic instruction; the instruction execution subject is any one of an operating system of the electronic equipment, an application in the electronic equipment and a voice user interface component associated with the application in the electronic equipment; an instruction execution subject corresponding to the semantic instruction performs an action based on the semantic instruction. Therefore, in the voice interaction process of the user and the electronic equipment, the voice control of the user on an operating system of the electronic equipment, the application in the electronic equipment and a user interface component associated with the application can be realized, the application range of the voice interaction is greatly expanded, the voice interaction can be realized more deeply, the voice interaction can be realized more comprehensively, and the user experience is effectively improved. The application also discloses an electronic device.

Description

Voice interaction method and electronic equipment

Technical Field

The present disclosure relates to the field of communications, and in particular, to a voice interaction method and an electronic device.

Background

Electronic equipment generally adopts a Graphical User Interface (GUI) to output information to a User of the electronic equipment, and currently, with the richness of the equipment form, voice interaction has become one of common interaction modes, so that the User can realize voice interaction with the electronic equipment through voice, and the electronic equipment can present an operation and an Interface responding to the input voice of the User to the User through the GUI.

The existing method for realizing voice interaction of electronic equipment generally includes that a system level instruction set is preset in a system of the electronic equipment, then an instruction corresponding to input voice of a user is extracted from the system level instruction set according to the input voice of the user, and the system of the electronic equipment executes system preset action and behavior corresponding to the instruction. The voice interaction method is more based on the system level, and usually only can realize the voice interaction at the system level, such as opening a certain application, setting an alarm clock, and the like, but after the application is opened, the voice control of the user interface component of the application cannot be further realized.

Disclosure of Invention

The embodiment of the application provides a voice interaction method and electronic equipment, and in the voice interaction process of a user and the electronic equipment, the user can control the voice of an application user interface component so as to improve the user experience.

To solve the above technical problem, in a first aspect, an embodiment of the present application provides a voice interaction method for an electronic device, including: receiving voice from a user and analyzing the voice into semantic instructions; determining an instruction execution subject corresponding to the semantic instruction based on the semantic instruction; the instruction execution subject is any one of an operating system of the electronic equipment, an application in the electronic equipment and a voice user interface component associated with the application in the electronic equipment; the instruction execution subject corresponding to the semantic instruction performs an action based on the semantic instruction.

In the embodiment of the application, the electronic equipment can realize the voice control of the user on the operating system of the electronic equipment, the application in the electronic equipment and the user interface component associated with the application in the electronic equipment, thereby greatly expanding the application range of the voice interaction, realizing the comprehensive voice control, realizing the whole voice interaction in the voice interaction between the user and the electronic equipment and improving the use experience of the user.

In a possible implementation of the first aspect, the method further includes: opening an application; the application includes at least one voice user interface component associated with the application; determining a voice user interface component corresponding to the semantic instruction based on the semantic instruction; the voice user interface component corresponding to the semantic instruction performs an action based on the semantic instruction.

In the embodiment of the application, the application in the electronic device at least comprises one voice user interface component associated with the application, the voice user interface component is a user interface component corresponding to the interface of the application, and the voice user interface component has a voice interaction capability and can recognize the voice of a user, namely the application has component-level voice. Illustratively, when voice interaction is performed, the electronic device receives voice from a user, parses the voice into semantic instructions, and matches the semantic instructions with at least one voice user interface component included in an application of the electronic device to determine a voice user interface component capable of recognizing the semantic instructions, and causes the voice user interface component to perform actions based on the semantic instructions, the voice user interface component is a voice user interface component which can listen to the voice of the user and can perform corresponding actions according to the voice of the user, so that during the voice interaction between the user and the electronic device, the voice control of the user on the user interface component corresponding to the interface of the application running in the foreground of the electronic device can be realized, the application range of the voice control function is greatly expanded, and more deep voice interaction can be realized, and voice interaction is more comprehensive, and user experience is effectively improved.

In one possible implementation of the first aspect described above, each voice user interface component comprises voice information and action information; wherein the voice information comprises at least one voice text, and each voice text comprises corresponding action information; the action information is used to describe the action performed by the corresponding voice user interface component.

In a possible implementation of the first aspect, matching, based on the semantic instruction, a voice user interface component capable of recognizing the semantic instruction in the at least one voice user interface component, and causing the voice user interface component to perform an action based on the semantic instruction includes: searching the voice text of at least one voice user interface component for the voice text matched with the semantic instruction; determining action information corresponding to the voice text matched with the semantic instruction; and causing the voice user interface component to perform a corresponding action according to the action information.

In the embodiment of the application, the voice user interface component comprises voice information, the voice information comprises a voice text, and when the semantic instruction is matched, the voice text matched with the semantic instruction is searched in the voice text in the determined at least one voice user interface component, so that the voice user interface component corresponding to the semantic instruction can be determined.

The voice user interface component also comprises action information, and the voice user interface component executes an action based on the semantic instruction, specifically executes an action corresponding to the action information according to the action information.

In the embodiment of the application, the voice user interface component corresponding to the semantic instruction can be determined through the voice information and the action information included in the voice user interface component, and the voice user interface component is made to execute the action corresponding to the semantic instruction, so that the control of the voice user interface component according to the voice of the user is conveniently realized.

In one possible implementation of the first aspect described above, the voice user interface component includes a prompt that presents content associated with the voice information.

The electronic device can present a prompt corresponding to the prompt information on a user interface thereof through the prompt information for the user to watch. And the prompt information presents the content associated with the voice information, namely the prompt can be information corresponding to the voice text, namely the prompt can be information corresponding to the semantic instruction, the user can input voice according to the prompt, so that the voice input by the user is the user voice corresponding to the semantic instruction, and therefore the user can accurately perform voice control on the voice user interface component, namely the electronic equipment can more accurately and quickly determine the voice user interface component corresponding to the user voice according to the input voice of the user.

In a possible implementation of the first aspect, the voice user interface components include different voice information between the voice user interface components.

In the embodiment of the application, the voice information of different voice user interface components is different, so that the voice control of the voice user interface component can be realized, the only voice user interface component corresponding to the voice of the user can be determined, the semantic instruction conflict of the voice user interface component is avoided, and the accurate voice control of the voice user interface component is realized.

In one possible implementation of the first aspect described above, the at least one voice user interface component associated with the application comprises a voice user interface component that is visibly displayed on an interface of the application and/or a voice user interface component that is invisibly included in the application.

In this embodiment of the application, the interface of the application includes at least one voice user interface component, which may specifically be a voice user interface component presented through a display interface of a screen of the electronic device, or a voice user interface component not presented on the display interface of the screen of the electronic device.

In one possible implementation of the first aspect described above, the voice user interface component comprises feedback information, the voice user interface component performs the action based on the semantic instruction and displays the feedback information associated with the action, wherein the feedback information comprises visual feedback content and/or auditory feedback content.

In the embodiment of the application, after the voice user interface component executes the action on the semantic instruction, the voice user interface component can also present feedback information on the application interface so as to inform the execution condition of the action of the user, so that the interaction between the user and the electronic equipment can be effectively enhanced, and the experience of the user is improved.

In one possible implementation of the first aspect, the voice user interface component includes switch information, and the voice user interface component determines whether semantic instruction matching is allowed according to the switch information. Illustratively, when the switch information is on, the voice user interface component allows semantic instruction matching, and when the switch information is off, the voice user interface component does not allow semantic instruction matching.

In a possible implementation of the first aspect, the application has at least one application-level speech text and at least one piece of application-level action information, where each piece of application-level speech text includes corresponding application-level action information; and in the case of matching to the application-level voice text corresponding to the semantic instruction, or in the case of matching to the application-level voice text corresponding to the semantic instruction and the voice text of the voice user interface component, performing the action described by the application-level action information corresponding to the application-level voice text.

In one possible implementation of the first aspect described above, a search is performed in the application-level speech text of the application and the speech text of at least one speech user interface component included in the interface of the application based on the semantic instruction, and the action described by the application-level action information corresponding to the application-level speech text is executed in case that the application-level speech text corresponding to the semantic instruction and the speech text of the speech user interface component are matched.

In the embodiment of the application, the application in the electronic device can have the component-level voice and the application-level voice, so that the voice interaction of the electronic device is layered to meet the voice control under different conditions.

In the process of performing voice interaction, the semantic instruction may be matched with an application-level voice text applied in the electronic device to determine an application-level voice text corresponding to the semantic instruction, and the electronic device may execute a corresponding action according to the application-level action information corresponding to the application-level voice text, thereby implementing application-level voice interaction.

And the voice of the application level is higher than the voice of the component level in priority, and under the condition that the voice text of the application level corresponding to the semantic instruction and the voice text of the voice user interface component are determined simultaneously according to the semantic instruction, the action described by the application level action information corresponding to the application level voice text is executed.

In a possible implementation of the first aspect, an operating system of the electronic device has at least one system level speech text and at least one system level action information, where each system level speech text includes corresponding system level action information; and in the event that either the system-level speech text corresponding to the semantic instruction is matched or the application-level speech text corresponding to the semantic instruction and the speech text of the speech user interface component are matched to the system-level speech text, performing an action described by the system-level action information corresponding to the system-level speech text.

In one possible implementation of the first aspect described above, a search is conducted in the system level speech text, the application level speech text, and the speech text of at least one speech user interface component included in the interface of the application based on the semantic instruction, and the action described by the system level action information corresponding to the system level speech text is performed if any of the application level speech text and the speech text of the speech user interface component corresponding to the semantic instruction match the system level speech text.

In the embodiment of the application, the application in the electronic device has the component-level voice and the application-level voice, and further has a system-level voice, so that the voice interaction of the electronic device is more hierarchical, and the voice control under different conditions is satisfied.

In the process of performing voice interaction, the semantic instruction may be matched with the system-level voice text in the electronic device to determine the system-level voice text corresponding to the semantic instruction, and the electronic device is enabled to execute a corresponding action according to the system-level action information corresponding to the system-level voice text, thereby implementing system-level voice interaction.

In the embodiment of the application, the electronic equipment has the voice control of a component level, an application level and a system level, can realize comprehensive voice control, can realize whole-course voice interaction in the voice interaction between a user and the electronic equipment, and improves the use experience of the user.

And the system-level speech priority is higher than the application-level speech, and the action described by the system-level action information corresponding to the system-level speech text is executed when either the application-level speech text corresponding to the semantic instruction or the speech text of the speech user interface component is matched with the system-level speech text at the same time according to the semantic instruction.

In a second aspect, an embodiment of the present application provides an electronic device for voice interaction, including: the voice receiving module is used for receiving voice from a user; the semantic analysis module is used for analyzing the voice into semantic instructions; the semantic instruction directional decision engine is used for determining an instruction execution main body corresponding to the semantic instruction based on the semantic instruction; the instruction execution subject is any one of an operating system of the electronic equipment, an application in the electronic equipment and a voice user interface component associated with the application in the electronic equipment; and the instruction execution module is used for enabling the instruction execution main body corresponding to the semantic instruction to execute the action based on the semantic instruction.

In a possible implementation of the second aspect, the electronic device further includes: the voice user interface component retrieval module is used for determining at least one voice user interface component which is associated with the application and is included by the application.

In one possible implementation of the second aspect above, each voice user interface component comprises voice information and action information; wherein the voice information comprises at least one voice text, and each voice text comprises corresponding action information; the action information is used to describe the action performed by the corresponding voice user interface component.

In a possible implementation of the second aspect, the apparatus further includes an instruction matching module, where the instruction matching module is configured to search a voice text matching the semantic instruction in the voice text of the at least one voice user interface component, and determine action information corresponding to the voice text matching the semantic instruction; the instruction execution module is further used for enabling the voice user interface component to execute corresponding actions according to the action information.

In one possible implementation of the second aspect described above, the voice user interface component includes a prompt that presents content associated with the voice information.

In one possible implementation of the second aspect described above, the voice user interface components include different voice information between the voice user interface components.

In one possible implementation of the above second aspect, the at least one voice user interface component associated with the application comprises a voice user interface component that is visibly displayed on an interface of the application and/or a voice user interface component that is invisibly included in the application.

In one possible implementation of the second aspect, the voice user interface component includes feedback information, the voice user interface component performs an action based on the semantic instruction, and displays feedback information associated with the action, wherein the feedback information includes visual feedback content and/or auditory feedback content.

In one possible implementation of the second aspect, the voice user interface component includes switch information, and the voice user interface component determines whether semantic instruction matching is allowed according to the switch information.

In a possible implementation of the second aspect, the application has at least one application-level speech text and at least one piece of application-level action information, where each piece of application-level speech text includes corresponding application-level action information; and in the event that the application-level speech text corresponding to the semantic instruction is matched, or in the event that the application-level speech text corresponding to the semantic instruction and the speech text of the speech user interface component are matched, performing an action described by the application-level action information corresponding to the application-level speech text.

In a possible implementation of the second aspect, the instruction matching module is further configured to search, based on the semantic instruction, the application-level speech text of the application and the speech text of at least one speech user interface component included in the interface of the application, and determine to perform the action described by the application-level action information corresponding to the application-level speech text in the case that the application-level speech text corresponding to the semantic instruction and the speech text of the speech user interface component are matched; and the instruction execution module is also used for enabling the application to execute the action described by the application-level action information corresponding to the application-level voice text.

In a possible implementation of the second aspect, the system further includes an application-defined instruction retrieving module, configured to search in an application-level voice text of the application based on the semantic instruction.

In a possible implementation of the second aspect, an operating system of the electronic device has at least one system level speech text and at least one system level action information, where each system level speech text includes corresponding system level action information; and in the event that either the system-level speech text corresponding to the semantic instruction is matched or the application-level speech text corresponding to the semantic instruction and the speech text of the speech user interface component are matched to the system-level speech text, performing an action described by the system-level action information corresponding to the system-level speech text.

In one possible implementation of the second aspect, the instruction matching module is further configured to search, based on the semantic instruction, the system-level speech text, the application-level speech text, and the speech text of at least one speech user interface component included in the interface of the application, and determine to perform the action described by the system-level action information corresponding to the system-level speech text if any of the application-level speech text and the speech text of the speech user interface component corresponding to the semantic instruction is matched to the system-level speech text; and the instruction execution module is also used for enabling the system to execute the action described by the system level action information corresponding to the system level voice text.

In a possible implementation of the second aspect, the system further includes a system instruction retrieving module, configured to search in the system level speech text based on the semantic instruction.

In a possible implementation of the second aspect, the apparatus further includes an instruction distribution module, where the instruction distribution module is configured to distribute the semantic instruction to the instruction execution main body.

The electronic device provided by the present application includes a module for executing the voice interaction method provided by the first aspect and/or any one of the possible implementation manners of the first aspect, so that the beneficial effects (or advantages) of the voice interaction method provided by the first aspect can also be achieved.

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory for storing a computer program, the computer program comprising program instructions; a processor for executing program instructions to cause the electronic device to perform the following voice interaction method: receiving voice from a user and analyzing the voice into semantic instructions; determining an instruction execution subject corresponding to the semantic instruction based on the semantic instruction; the instruction execution subject is any one of an operating system of the electronic equipment, an application in the electronic equipment and a voice user interface component associated with the application in the electronic equipment; the instruction execution subject corresponding to the semantic instruction performs an action based on the semantic instruction.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium storing a computer program, the computer program including program instructions, which are executed by a computer to cause the computer to: receiving voice from a user and analyzing the voice into semantic instructions; determining an instruction execution subject corresponding to the semantic instruction based on the semantic instruction; the instruction execution subject is any one of an operating system of the electronic equipment, an application in the electronic equipment and a voice user interface component associated with the application in the electronic equipment; the instruction execution subject corresponding to the semantic instruction performs an action based on the semantic instruction.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below.

FIG. 1 is a block diagram illustrating an electronic device, according to some embodiments of the present application;

FIG. 2 is a block diagram illustrating a software architecture of an electronic device, according to some embodiments of the present application;

3A-3C are schematic diagrams illustrating some voice user interface components according to some embodiments of the present application;

FIG. 3D is an interface schematic diagram illustrating an electronic device, according to some embodiments of the present application;

FIG. 3E is a schematic diagram illustrating some instruction sets and execution sets, according to some embodiments of the present application;

4A-4C are interface diagrams illustrating some electronic devices, according to some embodiments of the present application;

FIG. 4D is a schematic diagram illustrating an application declaring speech capabilities, according to some embodiments of the present application;

FIG. 5A is a method flow diagram illustrating a method of voice interaction, according to some embodiments of the present application;

FIG. 5B is an interface schematic diagram illustrating an electronic device, according to some embodiments of the present application;

FIG. 6A is a method flow diagram illustrating another method of voice interaction, according to some embodiments of the present application;

FIG. 6B is a method flow diagram illustrating another method of voice interaction, according to some embodiments of the present application;

7A-7F are diagrams illustrating some voice interaction scenarios, according to some embodiments of the present application;

FIG. 8 is a diagram illustrating a multimodal interaction UI system, according to some embodiments of the application;

FIG. 9 is a schematic diagram illustrating an electronic device, according to some embodiments of the present application;

fig. 10 is a schematic diagram illustrating a structure of a system on a chip (SoC), according to some embodiments of the present application.

Detailed Description

The following description of the embodiments of the present application is provided by way of specific examples, and other advantages and effects of the present application will be readily apparent to those skilled in the art from the disclosure herein. While the description of the present application will be described in conjunction with the embodiments, this does not represent that the features of the application are limited to that embodiment. On the contrary, the application of the present disclosure with reference to the embodiments is intended to cover alternatives or modifications as may be extended based on the claims of the present disclosure. In the following description, numerous specific details are included to provide a thorough understanding of the present application. The present application may be practiced without these particulars. Moreover, some of the specific details have been omitted from the description in order to avoid obscuring or obscuring the focus of the present application. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that in this specification, like reference numerals and letters refer to like items in the following drawings, and thus, once an item is defined in one drawing, it need not be further defined and explained in subsequent drawings.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It should be noted that the electronic device in the embodiment of the present application may specifically be an electronic device such as a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a Television (TV), a desktop computer, a wearable electronic device, and a virtual reality device.

In the process of using the electronic device by the user, the interaction between the user and the electronic device is also different due to differences between the electronic devices, for example, the current interaction between the user and the mobile phone is mainly touch control, and the interaction between the user and the television is mainly remote control. The touch interaction mode requires that a user keeps close distance with the electronic equipment, so that the electronic equipment can be reached by a touch; the remote control interaction mode mainly considers distance limitation, remote control can be achieved within a remote control range, in addition, remote control experience depends on limited remote control keys, and the operation efficiency is not high.

The electronic equipment generally adopts a graphical user interface to output information to a user of the electronic equipment, and currently, with the richness of the equipment forms, voice interaction becomes one of common interaction modes, so that the user can realize voice interaction with the electronic equipment through voice, and the electronic equipment can present operation and an interface responding to the input voice of the user to the user through the graphical user interface.

Therefore, voice interaction becomes one of the common interaction means regardless of the type of electronic device. The voice interaction method and the voice interaction device aim at solving the problem of improving the voice interaction experience and can perform good voice interaction under the condition of meeting different devices.

In order to solve the technical problem, the application provides a voice interaction method and an electronic device.

Fig. 1 shows a schematic structural diagram of an electronic device 100 provided in an embodiment of the present application.

The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identity Module (SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor, a gyroscope sensor, a fingerprint sensor, a temperature sensor, a touch sensor, and other sensors.

It is to be understood that the illustrated structure of the embodiment of the present invention does not specifically limit the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processor (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

The controller can generate an operation control signal according to the instruction operation code and the time sequence signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

In some embodiments, processor 110 may include one or more interfaces. The interface may include an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, and the like.

The I2S interface may be used for audio communication. In some embodiments, processor 110 may include multiple sets of I2S buses. The processor 110 may be coupled to the audio module 170 via an I2S bus to enable communication between the processor 110 and the audio module 170. In some embodiments, the audio module 170 may communicate audio signals to the wireless communication module 160 via the I2S interface, enabling answering of calls via a bluetooth headset.

The PCM interface may also be used for audio communication, sampling, quantizing and encoding analog signals. In some embodiments, the audio module 170 and the wireless communication module 160 may be coupled by a PCM bus interface. In some embodiments, the audio module 170 may also transmit audio signals to the wireless communication module 160 through the PCM interface, so as to implement a function of answering a call through a bluetooth headset. Both the I2S interface and the PCM interface may be used for audio communication.

MIPI interfaces may be used to connect processor 110 with peripheral devices such as display screen 194, camera 193, and the like. The MIPI interface includes a Camera Serial Interface (CSI), a display screen serial interface (DSI), and the like. In some embodiments, processor 110 and camera 193 communicate through a CSI interface to implement the capture functionality of electronic device 100. The processor 110 and the display screen 194 communicate through the DSI interface to implement the display function of the electronic device 100.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal and may also be configured as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 110 with the camera 193, the display 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like.

It should be understood that the connection relationship between the modules according to the embodiment of the present invention is only illustrative, and is not limited to the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

The charging management module 140 is configured to receive charging input from a charger. The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The electronic device 100 implements display functions via the GPU, the display screen 194, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 is used to display images, video, and the like. The display screen 194 includes a display panel. In some embodiments, the electronic device 100 may include 1 or N display screens 194, with N being a positive integer greater than 1.

The electronic device 100 may implement a shooting function through the ISP, the camera 193, the video codec, the GPU, the display 194, the application processor, and the like.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to extend the memory capability of the electronic device 100.

The internal memory 121 may be used to store computer-executable program code, which includes instructions. The internal memory 121 may include a program storage area and a data storage area.

The electronic device 100 may implement audio functions via the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or some functional modules of the audio module 170 may be disposed in the processor 110.

The speaker 170A, also called a "horn", is used to convert the audio electrical signal into an acoustic signal. The electronic apparatus 100 can listen to music through the speaker 170A or listen to a handsfree call.

The receiver 170B, also called "earpiece", is used to convert the electrical audio signal into an acoustic signal. When the electronic apparatus 100 receives a call or voice information, it can receive voice by placing the receiver 170B close to the ear of the person.

The microphone 170C, also referred to as a "microphone," is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can input a voice signal to the microphone 170C by speaking the user's mouth near the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C to achieve a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further include three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, perform directional recording, and so on.

The headphone interface 170D is used to connect a wired headphone. The headset interface 170D may be the USB interface 130, or may be a 3.5mm open mobile electronic device platform (OMTP) standard interface, a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The keys 190 include a power-on key, a volume key, and the like. The keys 190 may be mechanical keys. Or may be touch keys. The electronic apparatus 100 may receive a key input, and generate a key signal input related to user setting and function control of the electronic apparatus 100.

The motor 191 may generate a vibration cue. The motor 191 may be used for incoming call vibration cues, as well as for touch vibration feedback.

Indicator 192 may be an indicator light that may be used to indicate a state of charge, a change in charge, or a message, missed call, notification, etc.

The SIM card interface 195 is used to connect a SIM card.

Fig. 2 shows a block diagram of a software structure of the electronic device 100 according to an embodiment of the present application.

Wherein, the layered architecture divides the software into a plurality of layers, and each layer has clear roles and division of labor. The layers communicate with each other through a software interface.

As shown in fig. 2, the software framework includes an application layer having a series of applications, including applications such as application a, application B, application C, etc., which may be gallery, calendar, talk, bluetooth, music, video, etc.

The software framework also includes a system platform, wherein the system platform includes a user interface component. As previously mentioned, the electronic device 100 typically includes a plurality of applications, and the applications may present content to the user through an interface included in the applications, and the interface corresponding to the applications typically includes a plurality of user interface components. The user interface component is specifically an EditView component, a SearchBox component, a ListView component, and the like in the system. In the embodiment of the present application, the user interface component can listen to the user voice and can be defined as a Voice User Interface (VUI) component.

Further, the voice user interface component also includes voice user interface component voice atom capabilities such that the voice user interface component can understand the at least one semantic instruction.

The system platform further includes a relevant module for implementing the voice interaction method provided in the embodiment of the present application, and specifically, as shown in fig. 2, the system platform includes a voice capability module, where the voice capability module includes a voice receiving module and a semantic parsing module, and further may further include a semantic instruction orientation decision engine.

The voice receiving module is used for receiving voice input of a user, the semantic analysis module is used for analyzing voice received by the voice receiving module to obtain a semantic instruction, and the semantic instruction directional decision engine is used for determining a range for matching the semantic instruction according to the voice received by the voice receiving module or determining a range for matching the semantic instruction according to the semantic instruction obtained by the semantic analysis module, for example, the semantic instruction is matched with the voice user interface component.

The system platform also comprises a voice user interface component execution framework, and the voice user interface component execution framework comprises a voice user interface component retrieval module, an instruction matching module and an instruction execution module.

The voice user interface component retrieval module is used for retrieving the voice user interface component, and is exemplarily used for retrieving at least one voice user interface component included by the determined application.

Further, in embodiments of the present application, the application includes at least one voice user interface component that includes a voice user interface component that is visibly displayed on an interface of the application and/or a voice user interface component that is invisibly included in the application. That is, the interface of the application running in the foreground of the electronic device 100 determined in the foregoing includes at least one voice user interface component, and specifically, may be a voice user interface component presented on a display interface on a screen of the electronic device, or may be a voice user interface component not presented on the display interface on the screen of the electronic device.

Of course, the voice user interface component retrieving module may retrieve only the voice user interface components included in the interface of the application running in the foreground of the electronic device 100, or may retrieve all the voice user interface components in the electronic device 100, which may be specifically set as needed.

Further, the speech user interface component execution framework further comprises an instruction distribution module, wherein the speech user interface component retrieval module is used for retrieving and determining the speech user interface component, the instruction matching module is used for matching semantic instructions, the instruction distribution module is used for distributing the semantic instructions, for example, distributing the semantic instructions to the speech user interface component, and the instruction execution module is related to implementing instruction execution, for example, the instruction execution module enables the speech user interface component to execute the semantic instructions.

Further, the speech user interface component execution framework can also comprise an application custom instruction retrieval module, wherein the application custom instruction is retrieved by the application.

In addition, the software framework also comprises a front-end framework UIKIT-VUI component for performing related configuration of the voice user interface component.

It should be noted that the software framework further includes modules that are the same as or similar to those of the current andriod system, and the embodiments of the present application are not described in detail.

In the embodiment of the present application, a component in a component module of the electronic device 100 is defined as a voice user interface component, where the voice user interface component includes information, and the information may be specifically set in a text form.

In one implementation of the present application, the voice user interface component includes voice information, action information, prompt information, feedback information, and switch information corresponding to the voice user interface component,

wherein each voice user interface component comprises voice information and action information; the voice information comprises at least one voice text, and the voice text is used for indicating semantic instructions which can be understood by the voice user interface component, namely the semantic instructions which can be understood by the voice user interface component can be set through the voice text. And each voice text comprises corresponding action information, and the action information is used for describing an action executed by the corresponding voice user interface component, namely, indicating the content to be executed after the voice user interface component understands the semantic instruction.

When the semantic instruction is matched, searching the voice text in the determined at least one voice user interface component for the voice text which is the same as the semantic instruction, so that the voice user interface component which is associated with the voice text can be determined. And after the voice user interface component is determined, the voice user interface component executes corresponding action according to the action information corresponding to the voice text in the voice user interface component.

The prompt information is used for presenting the content associated with the voice information, and through the setting of the prompt information, the electronic equipment can present the prompt corresponding to the prompt information on the user interface of the electronic equipment for the user to watch.

Illustratively, the prompt information includes the aforementioned prompt for prompting the user to perform voice input, and the prompt information may be specifically the same as the semantic instruction, that is, the prompt information may be specifically information corresponding to the semantic instruction, and is used for friendly informing the user of how to perform the operation. The user can perform voice input according to the prompt, so that the voice input by the user is a semantic instruction, that is, the electronic device 100 can more accurately and quickly determine the voice user interface component corresponding to the voice of the user according to the voice input by the user, thereby enabling the user to perform voice control on the voice user interface component more accurately.

Further, in the embodiment of the present application, after the voice user interface component executes the action based on the semantic instruction, the method further includes displaying feedback information associated with the action on the user interface of the application based on the aforementioned feedback information to inform the user of the execution of the action, and presenting visual feedback information about the voice interaction stage where the voice user interface component is located on the user interface; the voice interaction stage comprises at least one of a stage before the triggering operation of the voice user interface component is executed, a stage in the triggering operation of the voice user interface component is executed, and a stage after the triggering operation of the voice user interface component is executed. For example, visual feedback to the user on the user interface for indicating the voice user interface component in the stages of listening to the voice input of the user, listening to the voice input of the user and the like can effectively enhance the interaction between the user and the electronic equipment and improve the experience of the user. Wherein the feedback information comprises visual feedback content and/or auditory feedback content.

In the embodiment of the present application, the switch information is used to define the on and off of the voice function of the voice user interface component, and specifically, for voice functions of the voice user interface element, all user interface elements may be configured as voice user interface elements through which information is switched, the voice user interface component can control the on and off of the voice function of the voice user interface component according to the voice switch information, when the switch information is configured to be 'on', the voice function of the voice user interface component is in an on state, the voice user interface component can recognize the voice input of a user and support the voice interaction function, and when the switch information is configured to be 'off', the voice function of the voice user interface component is in a closed state, the voice user interface component cannot recognize the voice input of a user, and the voice interaction function is not supported.

For example, please refer to fig. 3A, in an embodiment of the present application, the voice information may be defined as a V command, which is used to define a semantic command (a voice interaction command) understandable by a voice user interface component; the action information can be defined as V execution and is used for defining what the voice user interface component executes after understanding the semantic instruction; the prompt information may be defined as a V-prompt for defining a prompt to be displayed to a user on the UI interface; the feedback information can be defined as V feedback and is used for defining the visual feedback executed by the voice user interface component in the process of listening and listening; the switch information may be defined as a V-switch for defining on or off of a voice function of the voice user interface component.

Further, referring to fig. 3B, the V-command, the V-prompt, and the V-execution may be implemented as a V-command execution set, and a voice user interface component may have multiple sets of V-command execution sets.

The component properties and methods of the V instruction may be: the component presets an instruction, which may be a Text directly pointing to the component as an instruction. In addition, some of the component instructions may be considered overwritten by the developer. Additionally, an example of a method of the V instruction may be xx.

V execution is an internal method that represents what the voice user interface component will execute after the V instruction is received, e.g., may be a click, may be a drop down, etc.

The component properties and methods of the V-tips may be: the method is preset by a platform, is generally the same as the V instruction, and is used for friendly informing a user how to operate; in addition, it can also be customized by the developer of the voice user component, for example, the configuration information corresponding to the V-prompt can be xx.

The V-feedback may be a visual feedback for defining the voice user interface component to perform in the process of listening and listening, which may refer to the small e in the prior art.

The component properties and methods of the V-switch may be: the voice user interface components may be self-contained with a enable, default true and false set according to the characteristics of the voice user interface components. In addition, the voice user interface component may also open setable and getable interfaces for the developer to configure the voice switch information.

For example, please refer to fig. 3C, in an embodiment of the present application, information of a voice user interface component specifically includes:

the V switch information is @ Enable ═ true; i.e. the voice function is turned on.

The V instruction information is @ VclickUIVcommd ═ text; i.e. its text is its semantic instruction.

The V prompt message is @ VclickUIVTips @ VclickUIVcommd; i.e. its instruction itself is its hint.

The V execution information is onClick, namely, after a semantic instruction is triggered, Button executes the onClick operation.

V feedback visual feedback around the button when the command is received.

Illustratively, as shown in FIG. 3D, where the "confirmation" in the solid box is the V prompt, upon receiving a confirmation semantic instruction from the user, the confirmation Button in the dashed box will perform the onClick operation.

Further, in the embodiment of the present application, a common V instruction, a V execution, and a V prompt may be configured for the voice user interface component, and multiple items may be configured, that is, the voice user interface component may have an atomic capability, and may basically provide a perfect implementation of a common voice interaction instruction. For example, the V instructions, V executions, V hints can be provided as a set of semantic instruction executions for the speech user interface component, and a speech user interface component can have one or more sets of semantic instruction executions. That is, the voice information of the voice user interface component includes one or more voice texts corresponding to the semantic instruction for triggering the user interface component, so that one voice user interface component can understand a plurality of semantic instructions, execute an action according to the semantic instructions, and generate a prompt, or respectively execute a plurality of actions according to the semantic instructions, and respectively display the prompts corresponding to the actions.

For example, referring to fig. 3E, the common/generic instruction set and common execution set of system functions in the VUI atomic capability include the following examples: the V command of the voice user interface component may specifically include only one voice text of "NEXT episode", or may include a plurality of voice texts of "NEXT episode", "play NEXT episode", "NEXT first", "NEXT page", "please play NEXT", "NEXT", and so on. When the received semantic instruction is any one of the aforementioned "next set", "play next set", "next head", "next page", "please play next head", "next", and "next", the voice user interface component may be determined as the voice user interface component corresponding to the voice, so that the voice user interface component performs the operation corresponding to the voice.

Further, for example, in the system functions of the common execution set, V execution includes the following examples:

VUI-DO-HOME desktop

Bluetooth is opened by VUI-DO-OPENBT

It should be noted that, in this embodiment of the application, the voice user interface component specifically refers to a component in the aforementioned component module, information of the voice user interface component is configured by a developer of the voice user interface component, the electronic device 100 using the voice user interface component can use the voice user interface component, and an application installed in the electronic device 100 can use the voice user interface component.

For example, the application can use the voice user interface component by declaration, and particularly, when the application developer develops the application, the application developer directly declares the capability of using the voice user interface component of the system, so that the application can call the system user interface component to realize the voice interaction function of the application.

Further, the voice capability of the VUI component of the system can be used by declaring it to be described by applying some description files that are native or creating a new description file.

Illustratively, the statement may specifically refer to a non-intrusive registration of a voice user interface component, and the specific statement may be as follows:

ID-VIEW1:

Venable＝true,

Vcmd:@text,

Vtip:@Vcmd

VD0:Onclick

in addition, when the application is developed, the application developer can separately configure the voice user interface component for the application, so that the application naturally has a voice interaction function.

Further, the application may refer to an original application installed in the electronic device, or may refer to a third-party application developed by a third-party developer, and when the third-party developer develops the third-party application, the third-party application may be configured to use the voice user interface component in the electronic device 100, for example, by the foregoing declaration.

It should be noted that, in the embodiment of the present application, between the voice user interface components, the voice user interface components include different voice information from each other. The voice information of different voice user interface components is different, so that the only voice user interface component corresponding to the voice of the user can be determined in the implementation of voice control, and the accurate voice control of the voice user interface component is realized.

That is, to avoid a situation that a semantic instruction conflicts in the using process of the user, that is, to avoid a situation that one semantic instruction can be simultaneously matched to at least two user interface components, it may be preset that the voice information in the voice information of the voice user interface components supported by each user interface in the electronic device 100 is different from each other, and therefore, when the received semantic instruction is only matched with the current user interface component of the application running in the foreground of the electronic device 100, a situation that the semantic instruction conflicts is issued may be avoided, and the experience of the user may be effectively improved.

In addition, the electronic device 100 may be preset with different voice information for all voice user interface components supported by the electronic device 100, so that the situation of instruction collision can be avoided no matter the semantic instruction is matched with the current user interface component or matched with any user interface component supported by the electronic device 100.

In detail, to avoid the foregoing conflict, a developer of the electronic device 100 and a developer of the application may specifically configure the voice information in the voice information of the user page component when performing development work of the electronic device 100 and the application, so as to avoid that the voice information of different user interface components has the same semantic instruction set.

For example, referring to fig. 4A, an application in the electronic device 100 uses two identical voice user interface components, and the voice texts of the two identical voice user interface components are both "confirmed", so that if the user inputs "confirmed" by voice, a semantic instruction conflict may be generated, and in order to avoid generating the semantic instruction conflict, so as to achieve accurate positioning, the voice texts of the voice user interface components may be configured in the development stage of the electronic device 100 and the application.

Specifically, the developer of the electronic device 100 configures the voice information of the voice user interface component in the electronic device, and the developer of the electronic device 100 may tag the semantic instruction of the voice user interface component. For example, referring to FIG. 4B, the system would tag two "acknowledgments," one of which is tagged with "acknowledgment 1" and the other with "acknowledgment 2".

Note that, the labeling here may specifically mean that the "confirmation 1" and "confirmation 2" voice texts are arranged in the voice information.

In addition, in the embodiment of the present application, the voice user interface component may be opened to a third-party application developer for overwriting, and if the voice information of the voice user interface component is allowed to be overwritten by the third-party application developer, the application developer may overwrite the contents of the semantic instruction set, the prompt information, and the like on the voice information of the voice user interface component. For example, referring to FIG. 4C, the third party application developer overwrites one of the two "confirmations" with "clause confirmation".

Further, in the embodiment of the application, after the voice input and the semantic instruction are matched, if the voice user interface component corresponding to the semantic instruction is not matched, the execution operation of the semantic instruction is not performed, and operation prompt information for prompting the user to perform the next operation is displayed to inform the user of the matching result or inform the user of the next operation.

Further, in the electronic device 100, by setting the aforementioned voice user interface component, a user can implement voice control over the user interface component in the electronic device 100 through voice, that is, the embodiment of the present application can implement component-level voice interaction.

Further, the embodiment of the present application can also implement the whole-process voice control of the system, the application, and the like in the using process of the electronic device 100 by the user.

Further, in another implementation of the present application, the electronic device also supports application-level voice interaction. Specifically, in the embodiment of the present application, the application has at least one application-level voice text and at least one application-level action information as an application-level semantic instruction set, where each voice text includes corresponding action information. And the application-level speech text is executed in response with priority over the speech text of the speech user interface component.

That is, in the embodiment of the present application, in the process of performing voice interaction, the electronic device 100 matches a semantic instruction obtained according to the voice of the user with an application-level voice text of an application to determine an application-level voice text corresponding to the semantic instruction, and if the corresponding application-level voice text is matched according to the semantic instruction, the application performs corresponding operation according to application-level action information corresponding to the application-level voice text, thereby implementing application-level voice interaction.

Further, in the embodiment of the present application, based on the semantic instruction, a search is performed between the application-level speech text of the application and the speech text of at least one speech user interface component included in the interface of the application, and in a case where the application-level speech text corresponding to the semantic instruction is matched or the application-level speech text corresponding to the semantic instruction and the speech text of the speech user interface component are simultaneously matched, the action described by the application-level action information corresponding to the application-level speech text, that is, the speech with the application-level speech priority higher than the component level, is performed, and in a case where the application-level speech text corresponding to the semantic instruction and the speech text of the speech user interface component are simultaneously determined according to the semantic instruction, the action described by the application-level action information corresponding to the application-level speech text is performed.

The application has at least one application level voice text and at least one application level action information, and the setting can be specifically carried out through an SDK packet of the application.

Further, the registering of the application-level semantic instruction level with the system of the electronic device 100 by the application includes registering in a non-intrusive description manner of XML (Extensible Markup Language) or dls (deep Learning service).

In addition, after receiving the semantic instruction defined by the user, the system can send a signal to the application or directly execute an execution command set by the application.

In the embodiment of the application, the application has at least one application-level voice text and at least one piece of application-level action information as an application-level semantic instruction set, wherein each piece of voice text includes corresponding action information.

Illustratively, the application declares speech capabilities through the application-level semantic instruction set and may be formatted for instruction | signal | execution.

For example, please refer to fig. 4D:

v/application declares Voice capability, formatted as instruction | Signal | execution

VCMD HISTORY|1|0

Science magic sheet |2|0

Not see | null | HOME

The command is a command supporting the customization of a developer, such as exit, and also supports the wildcard description of the command, such as science fiction, and represents the command containing science fiction. Support calls for system preset instruction ID, such as VCMD HISTORY, system built-in scenario statement. Additionally, long-term considerations support the android resource STRING ID reference, ID STRING XXXX.

Further, null means that no signal is received and will generally be used in conjunction with the later "execution", i.e. there is no need to send a signal to the application, and it is desirable that the system directly execute some instructions. In addition, the signal can be customized by a developer, and then the system sends a signal set by an application to the application.

"0" indicates what the application does not need the system to execute, and is handled by the application in the callback process. It may be a common command for system opening, such as HOME (back to the desktop). In addition, the system can automatically jump to specific Activity.

Further, the system voice interaction instruction set can be stored in a form of a semantic instruction table. Further, the developer of the application may also configure the application with an application voice interaction instruction set, which is similar to the system voice interaction instruction set.

Further, in the embodiment of the present application, the electronic device further supports system-level voice interaction, that is, the electronic device has at least one system-level voice text and at least one system-level action information, where each system-level voice text includes corresponding system-level action information; and the system level speech text is executed in response with a higher priority than the system level speech text.

In the process of performing voice interaction, the semantic instruction may be matched with a system-level voice text in the electronic device to determine the system-level voice text corresponding to the semantic instruction, and the electronic device is enabled to execute a corresponding action according to the system-level action information corresponding to the system-level voice text, thereby implementing system-level voice interaction.

Further, based on the semantic instruction, a search is conducted among the system level speech text, the application level speech text, and the speech text of at least one speech user interface component included in the interface of the application, and in the event that any of the application level speech text and the speech text of the speech user interface component corresponding to the semantic instruction and the system level speech text are simultaneously matched, an action described by the system level action information corresponding to the system level speech text is performed. I.e., speech having a system-level speech priority higher than the application level, the action described by the system-level action information corresponding to the system-level speech text is performed in the event that either of the application-level speech text and the speech text of the speech user interface component is matched to the system-level speech text at the same time according to the semantic instructions.

Specifically, a system level semantic instruction module may be preset in the electronic device 100 in advance, where the system level semantic instruction module includes a system voice interaction instruction set, the system voice interaction instruction set includes the at least one system level voice text and at least one system level action information, and the system level voice text may specifically include "return to the desktop", "open a video", "set an alarm clock", and the like, for example. The functions of the device are the same as those of the conventional small e and the like, and detailed description is omitted in the present application.

In the embodiment of the application, the electronic device 100 has the voice interaction at a component level, an application level and a system level, so that the voice interaction of the electronic device 100 is more hierarchical, so as to satisfy the voice control under different conditions, further, the comprehensive voice control can be realized, the whole-course voice interaction can be realized in the voice interaction between the user and the electronic device, and the use experience of the user is improved.

Further, in the embodiment of the present application, the setting of the semantic instruction orientation decision engine is used to determine a range for matching the semantic instruction.

Specifically, the instruction directional decision engine may pre-limit a semantic instruction matching range in which the electronic device 100 performs semantic instruction matching after acquiring the semantic instruction. Specifically, the semantic instruction matching range may be defined to match the semantic instruction with the voice user interface component supported by the user interface of the application currently running on the foreground by the electronic device 100 after the semantic instruction is obtained, or may be defined to match the semantic instruction with all the voice user interface components supported by the current electronic device 100 after the semantic instruction is obtained, that is, to match with all the voice user interface components installed in the electronic device 100. Further, the method can also comprise matching the semantic instruction with the system level semantic instruction set and matching the semantic instruction with an application level semantic instruction set.

It should be noted that, in the embodiment of the present application, the instruction direction decision engine may perform direction decision completely according to priority, where the priority of the system-level semantic instruction is greater than the priority of the application-level semantic instruction, and the priority of the application-level semantic instruction is greater than the priority of the component-level semantic instruction. In addition, independent identification words may also be set for the system level semantic instruction, the application level semantic instruction, and the component level semantic instruction, for example, the identification word of the system level semantic instruction is "small e", if the user's voice includes "small e", the semantic instruction decision module determines that the semantic instruction is the system level semantic instruction, the identification word of the application level semantic instruction is "hello", if the user's voice includes "hello", the semantic instruction is the application level semantic instruction, and if neither of the user's voices includes the aforementioned "small e" and "hello", the semantic instruction is the component level semantic instruction. Further, it may also be defined that in an application, the priority of the application-level semantic instructions is greater than the priority of the component-level semantic instructions.

Further, it should be noted that, in the embodiments of the present application:

the instruction sources for the system level instructions are: platform unified definition and management (e.g., a Board platform can be opened to define each service); the instruction effective scene is as follows: a full scene; the instruction execution is: executed by the system, or the system calls application capability execution; example (c): performing voice search, alarm clock setting and the like on any interface, and opening certain application; the applicable scope is as follows: scene independent, high frequency interactive commands.

The instruction sources for the application level instructions are: self-defining an application; the instruction execution main body is as follows: application; example (c): the video playing application controls playing progress through voice, switches playing programs and the like; the applicable scope is as follows: the application makes accurate voice interaction for the service;

the instruction sources for the component level instructions are: defining a system component; the instruction execution main body is as follows: system, i.e., VUI component; example (c): the current EditBox can directly carry out voice input and directly carry out voice click on the Button; the application scope is as follows: and (4) universal interaction.

Specifically, based on the voice user interface component provided in the embodiment of the present application, a platform for developing and using the voice user interface component in the embodiment of the present application may be provided, so as to be used by users of all parties. For example, the voice user interface component may be provided for a developer of a third-party application to directly use the voice user interface component when developing the application, and if the application directly uses the voice user interface component, the electronic device 100 may perform voice control on the application after receiving the semantic instruction by only turning on the V switch when being installed in the electronic device 100.

Referring to fig. 5A, fig. 5A is a voice interaction method provided by an embodiment of the present application, which is applied to the electronic device 100, and the method includes:

s100: receiving voice from a user and analyzing the voice into semantic instructions;

s200: determining an instruction execution subject corresponding to the semantic instruction based on the semantic instruction; the instruction execution subject is any one of an operating system of the electronic equipment, an application in the electronic equipment and a voice user interface component associated with the application in the electronic equipment;

s300: an instruction execution subject corresponding to the semantic instruction performs an action based on the semantic instruction.

The following describes the voice interaction method provided in the embodiment of the present application in detail with reference to the foregoing electronic device 100 and the software structure block diagram of the electronic device 100.

Specifically, the embodiment of the present application provides an electronic device 100 capable of implementing voice control of a voice user interface component by a user through voice, and a method for voice interaction between a user and the electronic device 100, so that the user can implement voice control of the voice user interface component executed in the electronic device 100 in a process of using the electronic device 100, so as to improve user experience.

For example, the application may be a video application, and the video application performs video playing, and the user may view the video presented by the television through the user interface, which is described below as an example of a scenario in which the user performs video playing using the television.

Illustratively, as shown in fig. 5B, a video screen display area and prompt information display areas of various other voice user interface components for performing control operations for playing video are generally included on a user interface for playing video on a television, and for example, the voice user interface components such as "pause", "screen capture", "record", "fast forward", "set", "coin-in" and the like are all components for displaying prompt information, which are displayed visually on the voice user interface component on the interface of the application, and further, the application also includes voice user interface components such as "slide down", "slide up", "next episode" and the like which are included in the interface in an invisible manner.

During the playing process of the video, the user can input voice data to the television. The television obtains the voice data entered by the user, e.g., the user may speak "next episode" to the television.

After receiving voice data input by a user, the television carries out semantic analysis on the voice data and extracts keywords to generate a semantic instruction. The semantic instruction may specifically be a character string that can be recognized by the television.

Illustratively, for example, if the voice data input by the user is "please play the next set" or "the next set", the semantic instruction obtained after the semantic parsing module extracts the keyword is the character string "the next set".

The television retrieves voice user interface components included in an interface of an application currently running in the foreground of the television to determine at least one voice user interface component included in the interface of the application currently running in the foreground of the television.

After the semantic instruction and at least one voice user interface component are determined, the television matches the semantic instruction with voice information of the voice user interface component to determine the voice user interface component corresponding to the semantic instruction.

Further, after the voice user interface component is determined, the television distributes the semantic instruction to the voice user interface component, so that the voice user interface component executes the triggering operation of the voice user interface component corresponding to the semantic instruction according to the action information.

For example, if the user wants to pause, after seeing the text of playing on the current video playing interface, the user knows that the semantic instruction that the user needs to input is pause, and the user directly says pause to the electronic device, so that the pause operation of the playing video can be realized.

In addition, the user may also say "please pause" to the television, and the television may also realize the voice user interface component corresponding to the semantic instruction, which is accurately determined according to the semantic instruction according to the setting related to the voice information.

In order to facilitate the user to know the voice interaction stage of the user interface and improve the pleasure of voice interaction, visual feedback information can be presented on the user interface. Illustratively, if the user says "coin in" to the television, the voice user interface component corresponding to the "coin in" is triggered to perform a click operation to execute the "coin in" operation, and after the "coin in" operation is executed, further, feedback information is generated to prompt that the "coin in" operation is completed.

Specifically, for example, a character of "1 coin thrown" may be displayed on the "coin thrown" display frame, or an animation or the like in which a coin is thrown may be displayed.

The voice interaction method provided by the embodiment of the application can conveniently realize the voice control of the user interface component, enhance the depth of the voice interaction of the user and effectively improve the experience of the user.

In one implementation of the present application, the electronic device 100 supports system level voice interaction, application level voice interaction, and component level voice interaction.

Referring to fig. 6A, an embodiment of the present application provides a voice interaction method, which may be specifically applied to a mobile phone, and the method includes:

s210, the mobile phone receives the voice input of the user.

S220, the mobile phone obtains a semantic instruction according to the voice and determines the type of the semantic instruction.

Specifically, the semantic instruction in the mobile phone has three layers of instruction sets, including a system level semantic instruction, an application level semantic instruction and a component level semantic instruction, and a response body requiring a semantic instruction decision module to decide the instruction, if the voice of the user includes a small e, the semantic instruction decision module determines that the semantic instruction is the system level semantic instruction, if the voice of the user includes a hello, the semantic instruction is the application level semantic instruction, and if the voice of the user does not include the small e and the hello, the semantic instruction is the component level semantic instruction.

And S230, the mobile phone performs instruction matching on the semantic instruction according to the determined instruction type, and sends the semantic instruction to an execution main body of the semantic instruction to be executed according to a matching result.

Specifically, if the semantic instruction is a system-level semantic instruction, the execution subject is a system, and S240 is executed, if the semantic instruction is an application-level semantic instruction, the execution subject is an application, and S250 is executed, if the semantic instruction is an application-level semantic instruction, the execution subject is a voice user interface component, and S260 is executed.

S240, the mobile phone system executes the semantic instruction. Which is equivalent to e, etc. of the prior art and will not be described herein.

And S250, executing the semantic instruction by the application in the mobile phone.

And S260, executing a semantic instruction by a voice user interface component in the mobile phone. The semantic instructions executed by the voice user interface component are specifically described above, and are not described herein again.

Further, in the embodiment of the application, after receiving the user-defined semantic instruction, the system may send a signal to the application, or directly execute an execution command set by the application.

Referring to fig. 6B, performing the dispatching of the semantic instruction includes:

s310, determining a semantic instruction according to the voice input of the user.

S320, judging whether the current application registers the instruction or not, and if not, ending the voice interaction; if yes, go to step S330.

S330, determining whether the application registers the system command, if yes, executing S340, and if no, executing step S350.

S340, executing the system level semantic instruction.

S350, determine whether the application registers signaling, if so, execute S360.

S360, a signal is sent to the application, and S370 is further performed.

S370, the application processes the callback.

The application processes the call-back, namely the application receives the corresponding message call-back according to the registered instruction, and the application can process the execution of the instruction in the call-back function.

An example of instruction callback processing is specifically shown below:

SINGLEHISTORY 1

SINGLEMOVIETYPESCIENCE

VCMDHISTORY|1|0

Science magic sheet |2|0

Not see | null | HOME

Application Callback example

For example, please refer to fig. 7A to 7F, which show a specific application scenario of the voice interaction method provided in the embodiment of the present application, taking a voice interaction experience of video playing as an example.

Illustratively, as shown in FIG. 7A, the handset is at the desktop.

When a user wants to watch videos, the user inputs voice 'i want to watch videos' into the mobile phone.

In the present embodiment, only one video application is installed in a mobile phone as an example. When there are multiple video applications in the system, then the user can specify the name of the video application, such as "i want to see video through the a application," etc.

After receiving the 'i want to see the video', a voice receiving module of the mobile phone sends the voice to a semantic analysis module, and the semantic analysis module performs semantic analysis on the 'i want to see the video' voice to obtain a semantic instruction 'open video application'. And then the semantic analysis module sends the 'open video application' to the instruction matching module.

The instruction matching module matches the semantic instruction, specifically, the instruction matching module matches the system-level semantic instruction, the application-level semantic instruction and the component-level semantic instruction with the semantic instruction, that is, the instruction matching module matches the semantic instruction with the system-level voice text, the application-level voice text and the voice information of the voice user interface component on the current interface one by one, and finally determines that the system executes the semantic instruction, the semantic analysis module sends the semantic instruction to the system, and the system opens the video application according to the execution of the semantic instruction.

For example, after the system executes the semantic instruction, the user interface of the mobile phone is specifically shown in fig. 7B.

It should be noted that the system level semantic instruction may complete the opening/closing of the video application, etc.

Furthermore, when the user says "open the playing record" to the mobile phone, the semantic parsing module parses the semantic instruction "playing record", and after the instruction matching module matches the semantic instruction, the instruction matching module determines that the execution subject of "opening the playing record" is the video application, and then the semantic parsing module sends the "playing record" semantic instruction to the video application, and the video application executes the semantic instruction. The "play record" semantic instruction is specifically an application-level semantic instruction registered by an application.

For example, after the video application executes the semantic instruction, the user interface of the mobile phone is specifically shown in fig. 7C.

Furthermore, when a user inputs voice 'slide down', the semantic analysis module analyzes to obtain a semantic instruction 'slide down', the instruction matching module matches the semantic instruction and determines that the execution main body of the semantic instruction is a 'list' voice user interface component, the 'slide down' is a semantic instruction carried by the 'list' voice user interface component, the semantic analysis module sends the semantic instruction to the 'list' voice user interface component, the 'list' voice user interface component executes the semantic instruction, and the play history list automatically slides down preset pixels.

For example, the user interface of the mobile phone after the "list" voice user interface component executes the semantic instruction is specifically shown in fig. 7D.

Further, when the user selects one of the videos to play, for example, the user inputs voice "video E", the semantic parsing module matches the semantic instruction, determines that the execution subject of the semantic instruction is a "list" voice user interface component, the "list" voice user interface component executes the semantic instruction, the word string of TextView of the "list" voice user interface component is the click semantic instruction of Item, and then the mobile phone triggers the operation of clicking the "video E" component as shown in fig. 7E. The user interface of the mobile phone after the "list" voice user interface component executes the semantic instruction is specifically shown in fig. 7F, and the video application starts playing a video.

It is to be understood that with respect to fig. 7A-7F, at the display of the video frames (images), a particular video frame (image) is illustrated with a plurality of irregularly arranged horizontal lines, which are not limiting with respect to the particular video frame (image).

Further, when the video application plays a video, various voice user interface components shown in fig. 7F are also present on the interface for playing the video for the user to perform voice interaction, which is similar to fig. 5B and will not be described herein again.

Through the voice interaction method provided by the embodiment of the application, the user can realize voice control on the whole process of the mobile phone, so that the convenience of voice interaction is improved, and the user experience is improved.

In summary, compared with the implementation of voice interaction methods such as Xiao e and the like in the prior art, the voice interaction method provided by the embodiment of the application can not only implement system-level voice interaction, but also implement application-level voice interaction and component-level voice interaction, greatly expand the application range of voice interaction functions, implement more deep voice interaction, and more comprehensive voice interaction, and effectively improve user experience.

Furthermore, the voice interaction method provided by the embodiment of the application avoids the occurrence of semantic instruction conflict in the voice interaction process, can more accurately realize voice control, and improves the user experience.

In addition, the application developed by the third-party developer can realize the voice interaction function based on the electronic equipment and the voice user interface component provided by the embodiment of the application, namely, the open function of the voice interaction framework of the electronic equipment is expanded, the requirement of a user for installing the application at any time and realizing the voice interaction at any time can be met, and the user experience is improved. In addition, if the third-party developer wants to realize the voice interaction of the application developed by the third-party developer, a voice function does not need to be created for the application separately, and the development cost of the application is reduced.

Further, it should be noted that the voice interaction method provided in the embodiment of the present application may be applied to a multi-modal interactive UI system, and specifically, please refer to fig. 8, the multi-modal interactive UI system includes a multi-modal interactive UI system development framework and components, and a plurality of UI interaction modes based on the multi-modal interactive UI system development framework and components, where the UI interaction modes include touch interaction, remote control interaction, voice interaction, gesture interaction, gaze interaction, and the like. The voice interaction mode is realized based on the voice interaction method provided by the embodiment of the application. In addition, other similar UI interaction modes can be further expanded.

With the future use of multiple devices in a full scene, especially a large-screen camera, the development of fixation interaction and the development of gesture interaction are facilitated. The multi-modal interactive UI system can greatly improve the user experience.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an electronic device 900 provided according to an embodiment of the present application. The electronic device 900 may include one or more processors 901 coupled to a controller hub 904. For at least one embodiment, controller hub 904 communicates with processor 901 via a multi-drop bus such as a front-side bus (FSB), a point-to-point interface such as a quick channel interconnect (QPI), or similar connection. Processor 901 executes instructions that control general types of data processing operations. In one embodiment, the controller hub 904 includes, but is not limited to, a Graphics Memory Controller Hub (GMCH) (not shown) and an input/output hub (IOH) (which may be on separate chips) (not shown), where the GMCH includes memory and graphics controllers and is coupled to the IOH.

The electronic device 900 may also include a coprocessor 906 and memory 902 coupled to the controller hub 904. Alternatively, one or both of the memory 902 and GMCH may be integrated within the processor 901 (as described herein), with the memory 902 and coprocessor 906 coupled directly to the processor 901 and to the controller hub 904, with the controller hub 904 and IOH in a single chip.

The memory 902 may be, for example, Dynamic Random Access Memory (DRAM), Phase Change Memory (PCM), or a combination of the two.

In one embodiment, coprocessor 906 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like. The optional nature of coprocessor 906 is represented in FIG. 9 by dashed lines.

In one embodiment, electronic device 900 may further include a Network Interface (NIC) 903. The network interface 903 may include a transceiver to provide a radio interface for the electronic device 900 to communicate with any other suitable device (e.g., front end module, antenna, etc.). In various embodiments, the network interface 903 may be integrated with other components of the electronic device 900. The network interface 903 may implement the functions of the communication unit in the above-described embodiments.

The electronic device 900 may further include input/output (I/O) devices 905. Input/output (I/O) devices 905 may include: a user interface designed to enable a user to interact with the electronic device 900; the design of the peripheral component interface enables peripheral components to also interact with the electronic device 900; and/or sensors are designed to determine environmental conditions and/or location information associated with electronic device 900.

It is noted that fig. 9 is merely exemplary. That is, although fig. 9 shows that the electronic device 900 includes a plurality of components, such as a processor 901, a controller hub 904, a memory 902, etc., in a practical application, a device using the methods according to the embodiments of the present application may include only a part of the components of the electronic device 900, for example, may include only the processor 901 and the NIC 903. The nature of the alternative device in fig. 9 is shown in dashed lines.

One or more tangible, non-transitory computer-readable media for storing data and/or instructions may be included in the memory of the electronic device 900. A computer-readable storage medium has stored therein instructions, and in particular, temporary and permanent copies of the instructions.

In this embodiment, the electronic device 900 may specifically be a mobile phone, and the instructions stored in the memory of the electronic device may include: instructions that when executed by at least one unit in a processor cause a handset to implement a voice interaction method as mentioned above.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an SoC (System on Chip) 1000 according to an embodiment of the present disclosure. In fig. 10, like parts have the same reference numerals. In addition, the dashed box is an optional feature of the more advanced SoC 1000. The SoC1000 may be used in any electronic device according to embodiments of the present application. According to different devices and different instructions stored in the devices, corresponding functions can be realized.

In fig. 10, the SoC1000 includes: an interconnect unit 1002 coupled to the processor 1001; a system agent unit 1006; a bus controller unit 1005; an integrated memory controller unit 1003; a set or one or more coprocessors 1007 which may include integrated graphics logic, an image processor, an audio processor, and a video processor; an SRAM (static random access memory) unit 1008; a DMA (direct memory access) unit 1004. In one embodiment, the coprocessor 1007 comprises a special-purpose processor, such as, for example, a network or communication processor, compression engine, GPGPU, a high-throughput MIC processor, embedded processor, or the like.

Included in SRAM cell 1008 may be one or more computer-readable media for storing data and/or instructions. A computer-readable storage medium may have stored therein instructions, in particular, temporary and permanent copies of the instructions. The instructions may include: instructions that, when executed by at least one unit in a processor, cause an electronic device to implement a voice interaction method as mentioned above.

The embodiments of the mechanism disclosed in the embodiments of the present application can be implemented in software, hardware, firmware, or a combination of these implementations. Embodiments of the application may be implemented as computer programs or program code executing on programmable systems including at least one processor, memory (or storage systems including volatile and non-volatile memory and/or storage units).

Program code may be applied to the input instructions to perform the functions described in the text and to generate output information. The output information may be applied to one or more output devices in a known manner. It is appreciated that in embodiments of the present application, the processing system may be a microprocessor, a Digital Signal Processor (DSP), a microcontroller, an Application Specific Integrated Circuit (ASIC), the like, and/or any combination thereof. According to another aspect, the processor may be a single core processor, a multi-core processor, and/or the like, and/or any combination thereof.

The program code may be implemented in high level procedural or object oriented programming language to communicate with a processor. The program code can also be implemented in assembly or machine speech, if desired. Indeed, the mechanisms described in the text are not limited in scope to any particular programming language. In either case, the speech may be compiled speech or interpreted speech.

In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may be implemented as one or more transitory or non-transitory and readable (e.g., computer-readable) storage media bearing or having stored thereon instructions that are readable and executable by one or more processors. For example, the instructions may be distributed via a network or a pneumatically readable computer medium. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including, but not limited to, floppy diskettes, optical disks, read-only memories (CD-ROMs), magneto-optical disks, read-only memories (ROMs), Random Access Memories (RAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash cards, or a tangible machine-readable memory for transmitting information (e.g., carrier waves, infrared signals, digital signals, etc.) using the internet in an electrical, optical, acoustical or other form of propagated signal. Thus, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine.

While the present application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those skilled in the art that the foregoing is a more detailed description of the present application, and the present application is not intended to be limited to these details. Various changes in form and detail, including simple deductions or substitutions, may be made by those skilled in the art without departing from the spirit and scope of the present application.

Claims

1. A voice interaction method for an electronic device, the method comprising:

receiving voice from a user and resolving the voice into semantic instructions;

determining an instruction execution subject corresponding to the semantic instruction based on the semantic instruction; the instruction execution subject is any one of an operating system of the electronic equipment, an application in the electronic equipment and a voice user interface component associated with the application in the electronic equipment;

the instruction execution subject corresponding to the semantic instruction performs an action based on the semantic instruction.

2. The voice interaction method of claim 1, wherein the method further comprises:

opening an application; the application includes at least one voice user interface component associated with the application;

determining a voice user interface component corresponding to the semantic instruction based on the semantic instruction;

the voice user interface component corresponding to the semantic instruction performs an action based on the semantic instruction.

3. The voice interaction method of claim 2,

each of the voice user interface components includes voice information and action information; wherein the voice information comprises at least one piece of voice text, and each piece of the voice text comprises corresponding action information;

the action information is used for describing the action executed by the corresponding voice user interface component.

4. The voice interaction method of claim 3, wherein based on the semantic directive, determining the voice user interface component corresponding to the semantic directive, the voice user interface component performing an action based on the semantic directive, comprising:

searching the voice text of the at least one voice user interface component associated with the application for voice text matching the semantic instruction;

determining action information corresponding to the voice text matched with the semantic instruction; and

and enabling the voice user interface component to execute corresponding actions according to the action information.

5. The voice interaction method of claim 3, wherein the voice user interface component comprises a prompt message presenting content associated with the voice message.

6. The voice interaction method of claim 3, wherein the voice user interface components each include different voice information therebetween.

7. The voice interaction method of any one of claims 2-6, characterized in that the at least one voice user interface component associated with the application comprises a voice user interface component that is visibly displayed on an interface of the application and/or a voice user interface component that is invisibly included in the application.

8. The voice interaction method of any of claims 2-7, wherein the voice user interface component comprises feedback information, the voice user interface component performs an action based on the semantic instruction and displays feedback information associated with the action, wherein the feedback information comprises visual feedback content and/or auditory feedback content.

9. The voice interaction method of claim 3, wherein the application has at least one application-level voice text and at least one application-level action information, wherein each of the application-level voice texts includes corresponding application-level action information; and in the case of matching to the application-level speech text corresponding to the semantic instruction, or in the case of matching to the application-level speech text corresponding to the semantic instruction and to the speech text of the speech user interface component, performing the action described by the application-level action information corresponding to the application-level speech text.

10. The voice interaction method according to claim 9, wherein, based on the semantic instruction, a search is made in an application-level voice text of the application and a voice text of the at least one voice user interface component included in the interface of the application, and in a case where the application-level voice text and the voice text of the voice user interface component corresponding to the semantic instruction are matched, an action described by the application-level action information corresponding to the application-level voice text is executed.

11. The voice interaction method of claim 9, wherein an operating system of the electronic device has at least one system level voice text and at least one system level action information, wherein each of the system level voice texts comprises corresponding system level action information; and in the event that the system level speech text corresponding to the semantic instruction is matched, or in the event that either the application level speech text corresponding to the semantic instruction or the speech text of the speech user interface component is matched to the system level speech text, performing an action described by the system level action information corresponding to the system level speech text.

12. The voice interaction method of claim 11, searching among the system level voice text, the application level voice text, and the voice text of the at least one voice user interface component included in the interface of the application based on the semantic instruction, and performing an action described by system level action information corresponding to the system level voice text if any of the application level voice text and the voice text of the voice user interface component corresponding to the semantic instruction matches the system level voice text.

13. An electronic device, comprising:

a memory for storing a computer program, the computer program comprising program instructions;

a processor for executing the program instructions to cause the electronic device to perform the following voice interaction method:

receiving voice from a user and resolving the voice into semantic instructions;

14. A computer-readable storage medium storing a computer program, the computer program comprising program instructions that are executed by a computer to cause the computer to perform a voice interaction method comprising:

receiving voice from a user and resolving the voice into semantic instructions;