CN114694646A

CN114694646A - Voice interaction processing method and related device

Info

Publication number: CN114694646A
Application number: CN202011636583.9A
Authority: CN
Inventors: 黄龙; 王翃宇; 李勇
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2022-07-01
Also published as: WO2022143258A1

Abstract

The application discloses a voice interaction processing method and a related device, which relate to natural language processing technology in the field of artificial intelligence, in particular to processing technology of multi-turn conversation, and the method comprises the following steps: the electronic equipment receives a first voice signal; under the condition that the electronic equipment is connected with the cloud server, the electronic equipment uploads the first voice signal to the cloud server; when the cloud server processes the voice data, the cloud server sends a corresponding instruction to the electronic device, instructs the electronic device to perform a corresponding action, and synchronously sends the context (intention and slot information) of the voice conversation to the electronic device. In this way, if a network interruption occurs in the voice service of multiple rounds of conversations, which causes the voice service to be switched from the processing on the cloud server to the processing on the electronic device, the electronic device can continue to execute the original voice service based on the context of the voice conversation and the received next voice signal, so that the problem of voice service interruption of multiple rounds of conversations can be solved.

Description

Voice interaction processing method and related device

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a voice interaction processing method and related apparatus.

Background

With the gradual development of voice interaction technology, more and more intelligent devices have voice interaction functions. The voice interaction means that a user obtains a voice/text response by inputting voice/text, for example, the user inputs "how much the weather is today" by voice, and the intelligent device returns "the weather is clear, 25-29 degrees" by voice.

Current voice interactive systems require network support. When some conditions (such as network interruption) exist, the original voice interaction service cannot be continuously executed, and the user experience is influenced.

Disclosure of Invention

The embodiment of the application provides a voice interaction processing method and a related device, which are used for solving the problem of voice service interruption of multi-turn conversation and improving the voice service processing capability.

In a first aspect, the present application provides a voice interaction processing method, which is applied to an electronic device, and includes: the electronic equipment receives an input first voice signal; the method comprises the steps that under the condition that the electronic equipment is connected with a cloud server, the electronic equipment uploads a first voice signal to the cloud server; the electronic equipment receives first voice reply content, intention and one or more slot position information corresponding to the intention, wherein the intention and the one or more slot position information are obtained by recognizing a first voice signal by the cloud server, and the first voice reply content is determined by the cloud server based on the intention and the one or more slot position information; after the electronic equipment outputs the first voice reply content, receiving a second voice signal; under the condition that the communication quality of the electronic equipment and the cloud server is poor, the electronic equipment identifies the second voice signal to obtain corresponding semantic information, and determines a first operation based on the intention and one or more slot position information and the semantic information; the electronic device performs a first operation.

According to the embodiment of the application, in the process of processing the voice service, the voice service can be processed by the cloud server or the electronic device, when the cloud server processes the voice data, the cloud server sends a corresponding instruction to the electronic device to instruct the electronic device to execute a corresponding action, and synchronously sends the context (intention and slot position information) of the voice conversation to the electronic device. Therefore, if network interruption occurs in the voice service of multiple rounds of conversations, the voice service is switched from processing on the cloud server to processing by the electronic equipment, and the electronic equipment can continue to execute the original voice service based on the context of the voice conversation and the received next voice signal, so that the problem of voice service interruption of multiple rounds of conversations can be solved, and the capability of voice service processing is improved.

In one possible implementation, the electronic device determines a first operation based on the intent and the one or more slot position information and the semantic information, including: the electronic equipment identifies that the semantic information is matched with one missing slot position in the one or more slot position information, and fills the semantic information into the value of the slot position; the electronic device determines a first operation based on the intent and the populated one or more slot location information. Here, specifically describing a process of the electronic device processing the original voice service based on the second voice signal, since the electronic device obtains the intention and the slot position information corresponding to the first voice signal, the electronic device can continue to perform the slot filling processing on the received second voice signal based on the intention and the slot position information, thereby realizing the capability of continuing to process the original voice service.

In one possible implementation, the first operation includes one or more of: playing the second voice reply content; displaying the text content of the second voice reply content; jump to the corresponding interface.

In one possible implementation, the method further includes: the electronic equipment receives a first instruction sent by a cloud server; and the electronic equipment displays the text content of the first voice reply content based on the first instruction and/or jumps to a corresponding interface.

In one possible implementation manner, the method for enabling the electronic device to communicate with the cloud server in poor quality includes: the electronic equipment fails to upload the second voice signal to the cloud server; or after the electronic equipment uploads the first voice signal to the cloud server, the reply data of the cloud server is not received within the preset time. The timing of the occurrence of poor communication quality is described here, which may be that the communication quality between the electronic device and the cloud server is poor when the electronic device uploads the second voice signal, or that the communication quality between the electronic device and the cloud server is poor when the cloud server issues the voice reply content for the second voice signal.

In one possible implementation, an electronic device receives a first speech signal, including: the electronic device receives a first voice signal through a voice assistant application.

In a second aspect, the present application provides a voice interaction processing method, which is applied to a cloud server, and includes: the method comprises the steps that a cloud server receives a first voice signal uploaded by electronic equipment; the cloud server identifies the first voice signal to obtain a corresponding intention and one or more slot position information corresponding to the intention, and determines first voice reply content based on the intention and the one or more slot position information; the cloud server sends the first voice reply content, the intent, and the one or more slot location information to the electronic device.

According to the embodiment of the application, the cloud server sends the instruction to the electronic equipment, and when the electronic equipment is instructed to execute the corresponding action, the context (intention and slot position information) of the voice conversation is synchronously sent to the electronic equipment. Therefore, if network interruption occurs in the voice service of multiple rounds of conversations, the voice service is switched from processing on the cloud server to processing by the electronic equipment, and the electronic equipment can continue to execute the original voice service based on the context of the voice conversation and the received next voice signal, so that the problem of voice service interruption of multiple rounds of conversations can be solved, and the capability of voice service processing is improved.

In one possible implementation, the cloud server sends the first voice reply content and the intent and the one or more slot information to the electronic device, including: the cloud server sends the first voice reply content and the intention and the one or more slot position information to the electronic equipment under the condition that at least one slot position information in the one or more slot position information is missing. The method includes that a cloud server sends intention and slot position information to electronic equipment, namely when slot position information is missing, the current voice service is judged to be multi-turn conversation service, and then the cloud server sends the intention and slot position information to the electronic equipment; if the slot position information is not lost, the voice service can be processed and completed in a single round without acquiring a next voice signal. Through the judgment step of whether the slot position information is lost or not, the issuing intention and the slot position information are further determined, and resources can be saved.

In a third aspect, the present application provides a voice interaction processing system, which includes an electronic device and a cloud server, wherein,

an electronic device for receiving a first voice signal;

the electronic equipment is also used for uploading the first voice signal to the cloud server under the condition that the electronic equipment is connected with the cloud server;

the cloud server is used for identifying the first voice signal, obtaining a corresponding intention and one or more slot position information corresponding to the intention, and determining first voice reply content based on the intention and the one or more slot position information;

the cloud server is further used for sending the first voice reply content, the intention and the one or more slot position information to the electronic equipment;

the electronic equipment is also used for receiving a second voice signal after outputting the first voice reply content;

the electronic equipment is further used for identifying the second voice signal under the condition that the communication quality of the electronic equipment and the cloud server is poor to obtain corresponding semantic information, and determining a first operation based on the intention and one or more slot position information and the semantic information;

the electronic equipment is also used for executing the first operation.

In one possible implementation manner, the electronic device is further configured to identify that the slot position missing from the semantic information and the one or more slot position information matches, and fill the semantic information as a value of the slot position; the electronic device is further configured to determine a first operation based on the intent and the populated one or more slot location information. Here, specifically describing a process of the electronic device processing the original voice service based on the second voice signal, because the electronic device obtains the intention and the slot position information corresponding to the first voice signal, the electronic device can continue to perform slot filling processing on the received second voice signal based on the intention and the slot position information, thereby realizing the capability of continuing to process the original voice service.

In a possible implementation manner, the electronic device is further configured to receive a first instruction sent by the cloud server; and the electronic equipment is also used for displaying the text content of the first voice reply content based on the first instruction and/or jumping to a corresponding interface.

In one possible implementation, the electronic device is further configured to receive a first voice signal via a voice assistant application.

In one possible implementation manner, the cloud server is further configured to send the first voice reply content, the intention, and the one or more slot information to the electronic device in a case that at least one of the one or more slot information is missing. The method includes that a cloud server sends intention and slot position information to electronic equipment, namely when slot position information is missing, the current voice service is judged to be multi-turn conversation service, and then the cloud server sends the intention and slot position information to the electronic equipment; if the slot position information is not lost, the voice service can be processed and completed in a single round without acquiring a next voice signal. Through the judgment step of whether the slot position information is missing, the issuing intention and the slot position information are further determined, and resources can be saved.

In a fourth aspect, the present application provides an electronic device, comprising: one or more processors, one or more memories; the one or more memories are coupled to the one or more processors; the one or more memories are for storing computer program code comprising computer instructions; when the computer instructions are executed on the processor, the electronic device is caused to execute the voice interaction processing method in any one of the possible implementation manners of the first aspect.

In a fifth aspect, the present application provides a cloud server, including: one or more processors, one or more memories; the one or more memories are coupled to the one or more processors; the one or more memories are for storing computer program code comprising computer instructions; when the computer instructions are executed on the processor, the electronic device is caused to execute the voice interaction processing method in any one of the possible implementation manners of the second aspect.

In a sixth aspect, an embodiment of the present application provides a computer storage medium, which includes computer instructions, and when the computer instructions are executed on an electronic device, the communication apparatus is caused to execute the voice interaction processing method in any possible implementation manner of any one of the foregoing aspects.

In a seventh aspect, an embodiment of the present application provides a computer program product, which, when running on a computer, causes the computer to execute the voice interaction processing method in any one of the possible implementation manners of the foregoing aspects.

Drawings

Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a software structure of an electronic device according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a method for processing a voice interaction according to an embodiment of the present application;

FIGS. 5A-5B are schematic diagrams illustrating a further method for processing voice interaction according to an embodiment of the present application;

fig. 6 is a schematic diagram of a call scenario provided in an embodiment of the present application;

fig. 7A to 7B are schematic scene diagrams of a voice interaction processing method according to an embodiment;

FIGS. 8A-8D are schematic diagrams of a set of application interfaces provided by embodiments of the present application;

fig. 9 is a schematic flowchart of a voice interaction processing method according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings. In the description of the embodiments herein, "/" means "or" unless otherwise specified, for example, a/B may mean a or B; the "and/or" in the text is only an association relation describing the association object, and indicates that three relations may exist, for example, a and/or B may indicate: three cases of a alone, a and B both, and B alone exist, and in addition, "a plurality" means two or more than two in the description of the embodiments of the present application.

In the following, the terms "first", "second" are used for descriptive purposes only and are not to be understood as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature, and in the description of embodiments of the application, unless stated otherwise, "plurality" means two or more. The terms "intermediate," "left," "right," "upper," "lower," and the like, indicate orientations or positional relationships that are based on the orientations or positional relationships shown in the drawings, are used for convenience in describing the application and for simplicity in description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and therefore should not be construed as limiting the application.

Fig. 1 shows a scene diagram of a voice interaction system 10 according to an embodiment of the present invention. As shown in fig. 1, the system 10 includes an electronic device 100 and a cloud server 200. It should be noted that the system 10 shown in fig. 1 is only an example, and those skilled in the art can understand that in practical applications, the system 10 generally includes a plurality of electronic devices 100 and cloud servers 200, and the application does not limit the number of electronic devices 100 and cloud servers 200 included in the system 10.

The electronic device 100 is an intelligent device with voice interaction functionality, and the electronic device 100 can receive voice instructions from a user and return voice or non-voice information to the user. In the embodiment of the present application, the electronic device 100 may be a mobile phone, a tablet Computer, a notebook Computer, an Ultra-mobile Personal Computer (UMPC), a handheld Computer, a netbook, a Personal Digital Assistant (PDA), a virtual reality device, a PDA (Personal Digital Assistant, also called a palmtop), a portable internet device, a data storage device, a camera, a wearable device (e.g., a wireless headset, a smart watch, a smart bracelet, smart glasses, a Head-mounted device (HMD), electronic clothing, an electronic bracelet, an electronic necklace, an electronic accessory, an electronic tattoo, and a smart mirror), or a smart home device (e.g., a smart speaker, a smart refrigerator, a smart desk lamp, an electric lamp, a smart television, a smart microwave oven, a smart fan, an air conditioner, a smart robot, a smart phone, a smart television, a smart phone, a Personal Digital Assistant, a PDA, a Personal Digital Assistant, a camera, a wearable device, a wearable equipment, a wearable device, and a wearable device, Smart curtains), and so forth. One application scenario involved in the embodiment of the present application is a home scenario, that is, the electronic device 100 is placed in a home of a user, and the user can send a voice instruction to the electronic device 100 to implement some functions, such as accessing the internet, ordering songs, shopping, knowing weather forecast, controlling other smart home devices in the home, and the like.

The cloud server 200 communicates with the electronic device 100 through a network, which may be, for example, a cloud server physically located at one or more sites. The cloud server 200 provides recognition service for the voice data received on the electronic device 100 to obtain a text representation of the voice data input by the user; the cloud server 200 also obtains a representation of the user's intention based on the text representation, and generates a response instruction to return to the electronic device 100. The electronic device 100 executes corresponding actions according to the response command to provide corresponding services for the user, such as setting an alarm clock, making a call, sending a mail, broadcasting information, playing a song, playing a video, and the like. Of course, the electronic device 100 may also output a corresponding voice response to the user according to the response instruction, or display corresponding text content, which is not limited in this embodiment of the application.

The electronic apparatus 100 according to the embodiment of the present application will be described first.

Referring to fig. 2, fig. 2 shows a schematic structural diagram of an exemplary electronic device 100 provided in an embodiment of the present application.

The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a key 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identification Module (SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It is to be understood that the illustrated structure of the embodiment of the present application does not specifically limit the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

The controller may be, among other things, a neural center and a command center of the electronic device 100. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

In some embodiments, processor 110 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose-input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, a bus or Universal Serial Bus (USB) interface, and the like.

The I2C interface is a bi-directional synchronous serial bus that includes a serial data line (SDA) and a Serial Clock Line (SCL). In some embodiments, processor 110 may include multiple sets of I2C buses. The processor 110 may be coupled to the touch sensor 180K, the charger, the flash, the camera 193, etc. through different I2C bus interfaces, respectively. For example: the processor 110 may be coupled to the touch sensor 180K via an I2C interface, such that the processor 110 and the touch sensor 180K communicate via an I2C bus interface to implement the touch functionality of the electronic device 100.

The I2S interface may be used for audio communication. In some embodiments, processor 110 may include multiple sets of I2S buses. The processor 110 may be coupled to the audio module 170 via an I2S bus to enable communication between the processor 110 and the audio module 170. In some embodiments, the audio module 170 may communicate audio signals to the wireless communication module 160 via the I2S interface, enabling answering of calls via a bluetooth headset.

The PCM interface may also be used for audio communication, sampling, quantizing and encoding analog signals. In some embodiments, the audio module 170 and the wireless communication module 160 may be coupled by a PCM bus interface. In some embodiments, the audio module 170 may also transmit audio signals to the wireless communication module 160 through the PCM interface, so as to implement a function of answering a call through a bluetooth headset. Both the I2S interface and the PCM interface may be used for audio communication.

The UART interface is a universal serial data bus used for asynchronous communications. The bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is generally used to connect the processor 110 and the wireless communication module 160. For example: the processor 110 communicates with a bluetooth module in the wireless communication module 160 through a UART interface to implement a bluetooth function. In some embodiments, the audio module 170 may transmit the audio signal to the wireless communication module 160 through a UART interface, so as to realize the function of playing music through a bluetooth headset.

MIPI interfaces may be used to connect processor 110 with peripheral devices such as display screen 194, camera 193, and the like. The MIPI interface includes a Camera Serial Interface (CSI), a Display Serial Interface (DSI), and the like. In some embodiments, processor 110 and camera 193 communicate through a CSI interface to implement the capture functionality of electronic device 100. The processor 110 and the display screen 194 communicate through the DSI interface to implement the display function of the electronic device 100.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal and may also be configured as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 110 with the camera 193, the display 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like. The GPIO interface may also be configured as an I2C interface, an I2S interface, a UART interface, a MIPI interface, and the like.

The USB interface 130 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 130 may be used to connect a charger to charge the electronic device 100, and may also be used to transmit data between the electronic device 100 and a peripheral device. And the earphone can also be used for connecting an earphone and playing audio through the earphone. The interface may also be used to connect other electronic devices, such as AR devices and the like.

It should be understood that the interface connection relationship between the modules illustrated in the embodiments of the present application is only an illustration, and does not limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

The charging management module 140 is configured to receive charging input from a charger.

The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 and provides power to the processor 110, the internal memory 121, the external memory, the display 194, the camera 193, the wireless communication module 160, and the like. The power management module 141 may also be used to monitor parameters such as battery capacity, battery cycle count, battery state of health (leakage, impedance), etc. In some other embodiments, the power management module 141 may also be disposed in the processor 110. In other embodiments, the power management module 141 and the charging management module 140 may also be disposed in the same device.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The

antennas

1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 100 may be used to cover a single or multiple communication bands. Different antennas can also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide a solution including 2G/3G/4G/5G wireless communication applied to the electronic device 100. The mobile communication module 150 may include at least one filter, a switch, a power amplifier, a Low Noise Amplifier (LNA), and the like. The mobile communication module 150 may receive the electromagnetic wave from the antenna 1, filter, amplify, etc. the received electromagnetic wave, and transmit the electromagnetic wave to the modem processor for demodulation. The mobile communication module 150 may also amplify the signal modulated by the modem processor, and convert the signal into electromagnetic wave through the antenna 1 to radiate the electromagnetic wave. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the same device as at least some of the modules of the processor 110.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating a low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then passes the demodulated low frequency baseband signal to a baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs a sound signal through an audio device (not limited to the speaker 170A, the receiver 170B, etc.) or displays an image or video through the display screen 194. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be provided in the same device as the mobile communication module 150 or other functional modules, independent of the processor 110.

The wireless communication module 160 may provide solutions for wireless communication applied to the electronic device 100, including UWB, Wireless Local Area Networks (WLANs) (e.g., wireless fidelity (WiFi) network), bluetooth (bluetooth, BT), Global Navigation Satellite System (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like. The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, performs frequency modulation and filtering processing on electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, perform frequency modulation and amplification on the signal, and convert the signal into electromagnetic waves through the antenna 2 to radiate the electromagnetic waves.

In some embodiments, antenna 1 of electronic device 100 is coupled to mobile communication module 150 and antenna 2 is coupled to wireless communication module 160 so that electronic device 100 can communicate with networks and other devices through wireless communication techniques. The wireless communication technology may include global system for mobile communications (GSM), General Packet Radio Service (GPRS), code division multiple access (code division multiple access, CDMA), Wideband Code Division Multiple Access (WCDMA), time-division code division multiple access (time-division code division multiple access, TD-SCDMA), Long Term Evolution (LTE), LTE, BT, GNSS, WLAN, NFC, FM, and/or IR technologies, etc. The GNSS may include a Global Positioning System (GPS), a global navigation satellite system (GLONASS), a beidou navigation satellite system (BDS), a quasi-zenith satellite system (QZSS), and/or a Satellite Based Augmentation System (SBAS).

The electronic device 100 implements display functions via the GPU, the display screen 194, and the application processor. The GPU is a microprocessor for image processing, connected to the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 is used to display images, video, and the like. The display screen 194 includes a display panel. The display panel may be a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), or the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, with N being a positive integer greater than 1.

In some embodiments of the present application, the interface content currently output by the system is displayed in the display screen 194. For example, the interface content is an interface provided by an instant messaging application.

The electronic device 100 may implement a photographing function through the ISP, the camera 193, the video codec, the GPU, the display screen 194, and the application processor, etc.

The ISP is used to process the data fed back by the camera 193. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also carry out algorithm optimization on the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in camera 193.

The camera 193 is used to capture still images or video.

The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the electronic device 100 selects a frequency bin, the digital signal processor is used to perform fourier transform or the like on the frequency bin energy.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record video in a variety of encoding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.

The NPU is a neural-network (NN) computing processor that processes input information quickly by using a biological neural network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. Applications such as intelligent recognition of the electronic device 100 can be implemented by the NPU, for example: image recognition, face recognition, speech recognition, text understanding, and the like.

The internal memory 121 may include one or more Random Access Memories (RAMs) and one or more non-volatile memories (NVMs).

The random access memory may include static random-access memory (SRAM), dynamic random-access memory (DRAM), synchronous dynamic random-access memory (SDRAM), double data rate synchronous dynamic random-access memory (DDR SDRAM), such as fifth generation DDR SDRAM generally referred to as DDR5 SDRAM, and the like;

the nonvolatile memory may include a magnetic disk storage device, a flash memory (flash memory).

The FLASH memory may include NOR FLASH, NAND FLASH, 3D NAND FLASH, etc. according to the operation principle, may include single-level cells (SLC), multi-level cells (MLC), three-level cells (TLC), four-level cells (QLC), etc. according to the level order of the memory cells, and may include universal FLASH memory (UFS), embedded multimedia memory cards (eMMC), etc. according to the storage specification.

The random access memory may be read and written directly by the processor 110, may be used to store executable programs (e.g., machine instructions) of an operating system or other programs in operation, and may also be used to store data of users and applications, etc.

The nonvolatile memory may also store executable programs, data of users and application programs, and the like, and may be loaded into the random access memory in advance for the processor 110 to directly read and write.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to extend the storage capability of the electronic device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, files such as music, video, etc. are saved in an external memory card.

The electronic device 100 may implement audio functions via the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or some functional modules of the audio module 170 may be disposed in the processor 110.

The speaker 170A, also called a "horn", is used to convert the audio electrical signal into a sound signal. The electronic apparatus 100 can listen to music through the speaker 170A or listen to a handsfree call.

The receiver 170B, also called "earpiece", is used to convert the electrical audio signal into an acoustic signal. When the electronic apparatus 100 receives a call or voice information, it can receive voice by placing the receiver 170B close to the ear of the person.

The microphone 170C, also referred to as a "microphone," is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can input a voice signal to the microphone 170C by speaking the user's mouth near the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C to achieve a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further include three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, perform directional recording, and so on.

The headphone interface 170D is used to connect a wired headphone. The headset interface 170D may be the USB interface 130, or may be a 3.5mm open mobile electronic device platform (OMTP) standard interface, a cellular telecommunications industry association (cellular telecommunications industry association) standard interface of the USA.

The pressure sensor 180A is used for sensing a pressure signal, and converting the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The gyro sensor 180B may be used to determine the motion attitude of the electronic device 100. The air pressure sensor 180C is used to measure air pressure. The magnetic sensor 180D includes a hall sensor. The electronic device 100 may detect the opening and closing of the flip holster using the magnetic sensor 180D. The acceleration sensor 180E may detect the magnitude of acceleration of the electronic device 100 in various directions (typically three axes). The magnitude and direction of gravity can be detected when the electronic device 100 is stationary. The method can also be used for recognizing the posture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications. A distance sensor 180F for measuring a distance. The electronic device 100 may measure the distance by infrared or laser. The proximity light sensor 180G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The ambient light sensor 180L is used to sense the ambient light level. Electronic device 100 may adaptively adjust the brightness of display screen 194 based on the perceived ambient light level. The ambient light sensor 180L may also be used to automatically adjust the white balance when taking a picture. The ambient light sensor 180L may also cooperate with the proximity light sensor 180G to detect whether the electronic device 100 is in a pocket to prevent accidental touches. The fingerprint sensor 180H is used to collect a fingerprint. The electronic device 100 can utilize the collected fingerprint characteristics to unlock the fingerprint, access the application lock, photograph the fingerprint, answer an incoming call with the fingerprint, and so on. The temperature sensor 180J is used to detect temperature.

The touch sensor 180K is also referred to as a "touch panel". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is used to detect a touch operation applied thereto or thereabout, which is an operation of a user's hand, elbow, stylus, or the like contacting the display screen 194. The touch sensor can communicate the detected touch operation to the application processor to determine the touch event type. Visual output associated with the touch operation may be provided via the display screen 194. In other embodiments, the touch sensor 180K may be disposed on a surface of the electronic device 100, different from the position of the display screen 194.

The bone conduction sensor 180M may acquire a vibration signal. In some embodiments, the bone conduction sensor 180M may acquire a vibration signal of the human vocal part vibrating the bone mass. The bone conduction sensor 180M may also contact the human pulse to receive the blood pressure pulsation signal. In some embodiments, the bone conduction sensor 180M may also be disposed in a headset, integrated into a bone conduction headset. The audio module 170 may analyze a voice signal based on the vibration signal of the bone mass vibrated by the sound part acquired by the bone conduction sensor 180M, so as to implement a voice function. The application processor can analyze heart rate information based on the blood pressure beating signal acquired by the bone conduction sensor 180M, so as to realize the heart rate detection function.

The keys 190 include a power-on key, a volume key, and the like. The keys 190 may be mechanical keys. Or may be touch keys. The electronic apparatus 100 may receive a key input, and generate a key signal input related to user setting and function control of the electronic apparatus 100.

The motor 191 may generate a vibration cue. The motor 191 may be used for incoming call vibration cues, as well as for touch vibration feedback. For example, touch operations applied to different applications (e.g., photographing, audio playing, etc.) may correspond to different vibration feedback effects. The motor 191 may also respond to different vibration feedback effects for touch operations applied to different areas of the display screen 194. Different application scenes (such as time reminding, receiving information, alarm clock, game and the like) can also correspond to different vibration feedback effects. The touch vibration feedback effect may also support customization.

Indicator 192 may be an indicator light that may be used to indicate a state of charge, a change in charge, or a message, missed call, notification, etc.

The SIM card interface 195 is used to connect a SIM card. The SIM card can be attached to and detached from the electronic device 100 by being inserted into the SIM card interface 195 or being pulled out of the SIM card interface 195.

Fig. 3 shows a block diagram of a software structure of the electronic device 100 according to the embodiment of the present application.

The layered architecture divides the software into several layers, each layer having a clear role and division of labor. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, an application layer, an application framework layer, an Android runtime (Android runtime) and system library, and a kernel layer from top to bottom.

The application layer may include a series of application packages. The application packages may include, for example, camera, gallery, calendar, phone, map, navigation, WLAN, bluetooth, music, video, game, shopping, travel, instant messaging (e.g., short message) applications, and the like. In addition, the application package may further include: main screen (i.e. desktop), minus one screen, control center, notification center, etc.

As shown in fig. 3, the application layer in the embodiment of the present application includes a voice assistant and a voice processing module.

The speech processing module provides a speech processing capability that any application may invoke, such as a speech assistant application, through which the electronic device 100 receives a speech signal and which the speech assistant application invokes to process the speech signal. The speech processing module includes capabilities of speech recognitioN (ASR), semantic understanding (NLU), Dialogue Management (DM), Natural Language Generation (NLG), and speech synthesis (TTS). Wherein the content of the first and second substances,

the voice recognition module is used for recognizing the voice signal to obtain text representation information of the voice signal. Specifically, the speech recognition module may first represent the speech signal as text data, and then perform word segmentation processing on the text data to obtain text representation information of the speech signal, that is, words in the speech signal are converted into readable inputs including binary codes, character sequences, and the like, for the electronic device 100. Typical speech recognition methods may be, for example: the embodiment of the present invention is not limited to a method based on a vocal tract model and speech knowledge, a template matching method (comparing similarity between a feature vector of an input speech signal and each template in a template library in sequence, and outputting the highest similarity as a recognition result), a method using a neural network, and the like.

The semantic understanding module is used for converting the text representation information of the voice signal into semantic information which can be understood by the electronic device 100. Semantic information includes entities, triples, intents, events, and so forth. With this information, the electronic device 100 can understand the user's language and determine what the user wants to do.

The dialog management module is configured to determine, based on the semantic information, an action to be performed by the electronic device 100 next, where the performed action includes one or more of the following: play the voice reply content (e.g., provide results, ask for specific restrictions, clarify or confirm the need, etc.); displaying the text content of the voice reply content; jumping to a corresponding interface; and so on.

Specifically, the dialogue management module determines an intention expressed in the semantic information, and then fills a slot corresponding to the intention according to the semantic information. The intention is what the user wants to do, and the slot position corresponding to the intention is information needed by the user to finish the intention, for example, the intention is 'call making', and the slot position corresponding to the 'call making' is the object of the call making; for example, if the intention is "send a short message", there are two slots corresponding to the "send a short message", which are the object to send a short message and the content of the short message.

In essence, dialog management is a decision-making process, and the dialog management module continuously determines the action to be executed next according to the current state in the voice interaction process, so as to assist the user in completing the task of information acquisition or service acquisition. If the action needs to carry out voice interaction with the user, the natural language generation module is triggered to generate language text understandable by the user; and finally, the generated language text is played by the speech synthesis module for the user to listen to.

The natural language generation module is used for converting the data set in the non-language format into text information in a language format understandable by a user. The natural language generation module determines which information should be included in the language text being constructed, organizes a reasonable text order, and combines multiple information into one sentence. Then some connective words and phrases are selected to form a complete sentence with good structure.

The speech synthesis module is used for converting the text information generated by the natural language generation module into artificial speech through a mechanical and electronic method.

The application framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer includes a number of predefined functions.

As shown in fig. 3, the application framework layer may include an input manager, a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, a display manager, an activity manager, and the like. For illustrative purposes, in FIG. 3, the application framework layers are illustrated as including an input manager, a window manager, a content provider, a view system, and an activity manager. It should be noted that any two modules of the input manager, the window manager, the content provider, the view system, and the activity manager may be mutually invoked.

The input manager is used for receiving instructions or requests reported by lower layers such as a kernel layer, a hardware abstraction layer and the like.

The window manager is used for managing window programs. The window manager can obtain the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like. In this application, the window manager is configured to display a window including one or more shortcut controls when the electronic device 100 meets a preset trigger condition.

The activity manager is used for managing the active services running in the system, and comprises processes (processes), applications, services (services), task information and the like.

The content provider is used to store and retrieve data and make it accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.

The view system includes visual controls such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures. In this application, the view system is configured to display a shortcut region on the display screen 103 when the electronic device 100 meets a preset trigger condition, where the shortcut region includes one or more shortcut controls added to the electronic device 100. The position and the layout of the shortcut region, and the icon, the position, the layout and the function of the control in the shortcut region are not limited.

The display manager is used for transmitting display content to the kernel layer.

The phone manager is used to provide communication functions of the electronic device 100. Such as management of call status (including on, off, etc.).

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and the like.

The notification manager enables the application to display notification information in the status bar, can be used to convey notification-type messages, can disappear automatically after a short dwell, and does not require user interaction. Such as a notification manager used to inform download completion, message alerts, etc. The notification manager may also be a notification that appears in the form of a chart or scroll bar text at the top status bar of the system, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, prompting text information in the status bar, sounding a prompt tone, vibrating the electronic device, flashing an indicator light, etc.

The Android Runtime comprises a core library and a virtual machine. The Android runtime is responsible for scheduling and managing an Android system.

The core library comprises two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.

The application layer and the application framework layer run in a virtual machine. And executing java files of the application program layer and the application program framework layer into a binary file by the virtual machine. The virtual machine is used for performing the functions of object life cycle management, stack management, thread management, safety and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. For example: surface managers (surface managers), media libraries (media libraries), three-dimensional graphics processing libraries (e.g., OpenGL ES), 2D graphics engines (e.g., SGL), and the like.

The surface manager is used to manage the display subsystem and provide fusion of 2D and 3D layers for multiple applications.

The media library supports a variety of commonly used audio, video format playback and recording, and still image files, among others. The media library may support a variety of audio-video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.

The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.

The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is a layer between hardware and software. The kernel layer at least comprises a display driver, a camera driver, an audio driver, a sensor driver, a driver of a touch chip and an input system and the like. For convenience of illustration, in fig. 3, the core layer is illustrated as including an input system, a driver of the touch chip, a display driver, and a storage driver. Wherein, the display driver and the storage driver can be arranged in the driving module together.

It is to be understood that the illustrated structure of the present application does not constitute a specific limitation to the electronic device 100. In other embodiments, electronic device 100 may include more or fewer components than shown, or combine certain components, or split certain components, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The following describes a technical principle based on the voice interactive system 10 in conjunction with the voice interactive system 10 of the embodiment of the present application. As shown in FIG. 4, a voice interaction process in the voice interaction system 10 is shown in FIG. 4. The cloud server 200 and the electronic device 100 communicate with each other through a network.

First, the electronic device 100 detects that a voice signal is accessed, and the electronic device 100 starts a voice interaction function. In some embodiments, the electronic device may receive the first voice signal through a voice assistant Application (APP). The electronic device 100 detects whether a target object (e.g., a preset wakeup word) is included in the received voice signal, and enters an interactive state to start a voice interaction function if the target object is included. The target object may be preset when the electronic device 100 leaves the factory, may be preset in the voice assistant application, or may be set by the user himself or herself during the use of the electronic device 100.

The electronic device 100 performs distribution control on the received voice signal based on a preset rule, and a distribution path includes a path one and a path two. The preset rule includes that when the network quality is good, the electronic device 100 uploads the received voice signal to the cloud server 200 for processing (path one), where the good network quality refers to that the electronic device 100 and the cloud server 200 are capable of performing data transmission (including uplink and downlink data transmission); when the network quality is poor or disconnected, the electronic device 100 processes the received voice signal at the electronic device 100 (path two), where the poor network quality or disconnection refers to that data transmission (including uplink or downlink data transmission) cannot be performed between the electronic device 100 and the cloud server 200, or that the data transmission rate is lower than a threshold value. The preset rule may also be that the distribution is performed according to the intention corresponding to the recognized voice signal, in short, when the intention corresponding to the voice signal is completed locally, such as making a call, sending a short message, opening a gallery, and the like, the processing may be performed in the electronic device 100; when the intention corresponding to the voice signal requires a network, for example, searching a web page, playing music online, etc., the processing may be performed at the cloud server 200.

Next, processing of the voice signal on the cloud server 200 and processing of the voice signal on the electronic device 100 are respectively described.

Path one: the voice signal is processed on the cloud server 200.

Firstly, the electronic device 100 uploads a voice signal to the cloud server 200, the cloud server 200 receives the voice signal, recognizes the voice signal through the speech recognition technology ASR, and converts the voice signal into text representation information, that is, words in the voice signal are converted into an input which can be read by the cloud server 200, including, for example, binary codes, character sequences, and the like.

In some embodiments, the cloud server 200 may sequentially compare the feature vectors of the input voice signal with the similarity of each template in the template library, output text data by using the highest similarity as a recognition result, and perform word segmentation processing on the text data to obtain text representation information of the voice signal. Optionally, the cloud server 200 may also calculate text representation information corresponding to the voice signal by using a trained vocal tract model, a neural network model, and the like.

It should be noted that, when the cloud server 200 identifies the voice signal through the voice recognition technology, some preprocessing operations on the voice signal may also be included, such as: sampling, quantizing, removing speech data that does not contain speech content (e.g., silent speech data), framing, windowing, etc., the speech data, etc.

And step two, after voice recognition, the cloud server 200 converts the text representation information into semantic information which can be understood by a machine through a semantic understanding technology NLU.

In some embodiments, the semantic understanding technology can be simply understood as the following steps, first, the cloud server 200 segments the text representation information obtained by speech recognition into a series of units with semantics and grammar, and usually uses the word "token" to represent the units obtained by text segmentation. A common text segmentation method is "word segmentation", that is, the text is segmented according to the granularity of "words". Models for word segmentation may include first order markov models, hidden markov models, conditional random fields, recurrent neural networks, and so forth.

Then, based on the token sequence, a numerical vector or a matrix is obtained by using a word vector space model, a distributed representation model and other text representation models. This matrix is a numerical representation of the text. Next, based on the data of the numerical representation of the text, "key information" (i.e., semantic information) such as entities, triples, intents, events, and the like, therein is calculated using a classification algorithm, a sequence labeling method, and the like. With this information, the cloud server 200 can understand the language of the user and determine what the user wants to do.

And step three, the cloud server 200 performs session management based on the semantic information. The session management refers to a process in which the cloud server 200 determines an action to be performed next based on the semantic information. The actions performed include one or more of: play the voice reply content (e.g., provide results, ask for specific restrictions, clarify or confirm the need, etc.); displaying the text content of the voice reply content; jumping to a corresponding interface; and so on.

In some embodiments, the cloud server 200 determines an intention expressed in the semantic information, and then fills a slot corresponding to the intention according to the semantic information. The intent is what the user is going to do, the slot to which the intent corresponds is the information the user needs to accomplish the intent, and an intent may correspond to one or more slots. The cloud server 200 fills the intended slot positions based on the semantic information, and if the semantic information is insufficient, information in one or more slot positions is lost, the next action is determined to be further inquiry aiming at the lost slot positions; if the information in the slot is not missing, the user intention is converted into an explicit instruction of the user, and the electronic device 100 is instructed to execute a corresponding action.

For example, the cloud server 200 acquires semantic information of a voice signal "please help me open a gallery", determines that an intention of a user is to open (open) an object according to the semantic information of the voice signal, and then a slot corresponding to the intention is an open object, and the cloud server 200 performs slot filling according to the semantic information of the voice signal to determine that the open object is the gallery. The dialog management determines an explicit instruction, i.e. an instruction to open the gallery, based on the semantic information.

The above provides an example in which the cloud server 200 determines an operation instruction based on a voice signal of the user in the case where the slot position information is not missing. And in the case that one or more slots are missing due to insufficient semantic information, the cloud server 200 needs to store the current intention and slot information, and further queries are performed on the missing slots. Generally, this situation is called a multi-turn conversation.

For example, the cloud server 200 acquires semantic information of a voice signal "i want to make a call", determines that an intention of a user is to make a call (call) according to the semantic information of the voice signal, and if a slot corresponding to the intention is an object of making a call, and if the slot is missing due to insufficient semantic information, the session management determines an explicit instruction based on the semantic information, that is, plays a voice reply content, and further queries the missing slot.

Alternatively, the dialog management may determine a further explicit instruction based on the semantic information of the speech signal, i.e. to display the text content of the speech reply.

When the cloud server 200 receives the voice signal next time, the cloud server 200 repeatedly executes the first step and the second step, and then fills the slot with the semantic information corresponding to the voice signal based on the stored intention and the slot information. If the slot is completely filled, that is, the slot information is not missing, the user intention is converted into a user-specific instruction, and the electronic device 100 is instructed to execute a corresponding action. For the above example, at this time, the intention stored by the cloud server 200 is to make a call, and the missing slot is an object to make a call, and when the cloud server 200 receives the voice signal "xiaoming" next time, based on semantic information corresponding to the voice signal "xiaoming", the slot "object to make a call" is filled, so as to determine an instruction, that is, a call is made to xiaoming.

After the cloud server 200 determines the action to be executed next, if the action needs to perform voice interaction with the user, for example, outputting the voice reply content, the cloud server 200 may generate a language text understandable by the user based on a natural language generation technology, and then perform voice synthesis on the generated language text to generate voice data.

In some embodiments, the cloud server 200 determines which information should be included in the language text being constructed, organizes a reasonable text order, and combines multiple pieces of information into one sentence. Then some connective words and phrases are selected, and the information is formed into a complete sentence with good structure.

Step fifthly, the cloud server 200 sends an instruction to the electronic device 100, instructing the electronic device 100 to execute the action.

In some embodiments, based on the above-mentioned steps (r) and (r), the cloud server 200 sends an instruction with voice data (voice reply content) to the electronic apparatus 100, instructing the electronic apparatus 100 to output the voice reply content.

Alternatively, the cloud server 200 transmits an instruction with text data (text content of the voice reply content) to the electronic device 100, and instructs the electronic device 100 to display the text data.

In some embodiments, the step iv is optional, and if the action to be performed next determined by the cloud server 200 does not need to output the voice reply content, the step iv does not need to be performed. Based on the step (c), the cloud server 200 sends an instruction to the electronic device 100 to instruct the electronic device 100 to perform interface jump.

And a second route: the speech signal is processed at the electronic device 100.

The electronic device 100 receives the voice signal, recognizes the voice signal through a voice recognition technology, and converts the voice signal into text representation information. The textual representation information is then converted to machine understandable semantic information by semantic understanding techniques. The electronic device 100 then determines the action to perform next based on the semantic information. If the action requires voice interaction with the user, such as outputting a voice reply, the electronic device 100 may generate a language text understandable by the user based on a natural language generation technology, and then perform voice synthesis on the generated language text to generate voice data. The electronic apparatus 100 outputs the voice data. If the action to be executed next determined by the electronic device 100 does not need to output the voice reply content, the electronic device 100 executes the jump of the interface without using the voice synthesis technology.

It should be noted that, based on the same inventive concept, the principle of the voice recognition, semantic understanding, dialog management, and voice synthesis solution in the second path in the embodiment shown in fig. 4 is similar to that in the first path, and therefore, the implementation processes of the voice recognition, semantic understanding, dialog management, and voice synthesis in the second path of the electronic device 100 may refer to the corresponding description of the cloud server 200 in the first path, third step, fifth step, and will not be described herein again.

In summary, the embodiment shown in fig. 4 details the implementation principle of the voice interaction system. In some cases, the electronic device 100 distributes the voice service based on the network condition, and if the network condition is good, the voice signal is uploaded to the cloud server 200 for processing, that is, the path two is described above; if the network is disconnected or the network quality is not good, the processing is performed in the electronic device 100, i.e., the first path is described above. In this case, if the voice service is switched during the processing due to the network, the original voice service cannot be continuously executed, which affects the user experience.

For example, the cloud server 200 receives a first voice signal, and if semantic information of the voice signal is insufficient to cause one or more slots to be missing, the cloud server 200 needs to store the current intention and slot information, and further query for the missing slots. If network interruption occurs before the cloud server 200 receives a voice signal next time, the cloud server 200 cannot receive the next voice signal, the electronic device 100 distributes the next voice signal to the electronic device 100 for processing, the electronic device 100 cannot continue to execute the original voice service based on semantic information of the next voice signal, and the original voice service is interrupted, so that user experience is affected.

In combination with the voice interaction system 10 of the embodiment of the present application, the embodiment of the present application further provides a voice interaction processing method, where the cloud server 200 sends an instruction to the electronic device 100 to instruct the electronic device 100 to execute a corresponding action, and when a network interruption occurs in a voice service of multiple turns of conversations, resulting in end cloud switching (switching between the electronic device 100 and the cloud server 200, that is, switching between a path one and a path two), the electronic device 100 can also continue to execute a primitive voice service based on the context of the voice conversation and a received next voice signal, thereby solving the problem of voice service interruption of the multiple turns of conversations.

Referring to fig. 5A, fig. 5A shows a voice interaction process in the voice interaction system 10.

At time T1, the network quality is good, and when the electronic device 100 receives the voice 1, the electronic device 100 starts the voice interaction function. The electronic device 100 performs distribution control on the received voice 1 based on a preset rule, for example, if the network quality is good at this time, the electronic device 100 uploads the received voice 1 to the cloud server 200 for processing, where the processing action includes processes such as voice recognition, semantic understanding, dialog management, and voice synthesis. In the embodiment of the present application, the speech 1 may also be referred to as a first speech signal.

Based on the same inventive concept, the principle of solving the problems of voice recognition, semantic understanding, dialog management, and voice synthesis at the time of T1 in the embodiment shown in fig. 5A is similar to the path one in the embodiment shown in fig. 4, and therefore, the implementation processes of voice recognition, semantic understanding, dialog management, and voice synthesis at the time of T1 by the cloud server 200 may refer to the corresponding description of the step (r) in the step (c) in the path one in fig. 4, and will not be described herein again.

Determining the action to be executed next by the cloud server 200, and sending an instruction to the electronic device 100 to instruct the electronic device 100 to execute action 1; and the cloud server 200 synchronously transmits the voice conversation context, which refers to the intention and the slot position information obtained by the cloud server 200 through recognizing and understanding the voice 1, to the electronic device 100. Act 1 performed includes one or more of: playing the voice reply content for voice 1 (e.g., providing results, asking for specific restrictions, clarifying or confirming requirements, etc.); displaying the text content of the voice reply content; jumping to a corresponding interface; and so on.

After receiving the instruction and the dialog context, the electronic device 100 forwards the instruction and the dialog context through the dialog information forwarding module, and the electronic device 100 executes the action 1 based on the instruction and saves the dialog context. The session information forwarding module may be regarded as a node that receives data sent by the cloud server 200, and is used to receive and forward the data.

At time T2, when the network quality is not good, the electronic device 100 outputs the voice reply content for the voice 1 and then receives the voice 2, and because the network quality is not good at this time, data transmission cannot be implemented between the electronic device 100 and the cloud server 200, the electronic device 100 cannot upload the voice 2 to the cloud server 200. The electronic device 100 invokes its own voice processing capability to process the voice 2, and the processing action includes processes of voice recognition, semantic understanding, dialog management, voice synthesis, and the like. The implementation processes of the speech recognition, semantic understanding, dialog management, and speech synthesis of the electronic device 100 at the time T2 may refer to corresponding descriptions of the electronic device 100 in the second path in fig. 4, and are not described herein again. In the embodiment of the present application, the speech 2 may also be referred to as a second speech signal.

It should be noted that, unlike the second path, in the session management part in the embodiment of the present application, the electronic device 100 determines the action to be performed next based on the session context saved at the time of speech 2 and T1. The electronic device 100 fills the missing slot based on the semantic information corresponding to the voice 2, and the intention and slot position information. If the semantic information corresponding to the voice 2 is insufficient, the slot position information is not completely filled, and one or more slot positions are missing, the electronic device 100 determines that the action to be executed next is to perform further inquiry on the missing slot positions; if the slot position information is not missing, the user intention is converted into a user-specific instruction, and the electronic device 100 is instructed to execute a corresponding action.

Here, the time T2 is a time period during which the electronic device 100 uploads the voice 2 to the cloud server 200 after the electronic device 100 receives the instruction and the dialog context transmitted by the cloud server 200 for the voice 1, and may be, for example, before the electronic device 100 receives the voice 2, or before the electronic device uploads the voice 2 to the cloud server 200. That is, the voice 2 cannot be uploaded to the cloud server 200 due to poor network quality at time T2.

In some embodiments, the time T2 may be after the electronic device 100 uploads the voice 2 to the cloud server 200, and before the cloud server 200 issues the instruction to the electronic device 100. That is, the cloud server 200 cannot issue the instruction generated for the voice 2 to the electronic device 100 due to poor network quality. As shown in fig. 5B, the electronic device 100 uploads the voice 2 to the cloud server 200, and the cloud server 200 processes the voice 2, where the processing action includes processes of voice recognition, semantic understanding, dialog management, voice synthesis, and the like. At this time, a network quality problem occurs, data transmission cannot be realized between the electronic device 100 and the cloud server 200, and the cloud server 200 cannot issue an instruction generated for the voice 2 to the electronic device 100.

Optionally, if the electronic device 100 does not receive the instruction issued by the cloud server 200 for the voice 2 within a preset time after the electronic device 100 uploads the voice 2 to the cloud server 200, the electronic device 100 invokes its own voice processing capability to process the voice 2 (for example, the voice 2 may be a backup voice).

Optionally, after the electronic device 100 uploads the voice 2 to the cloud server 200, before receiving an instruction issued by the cloud server 200 for the voice 2, it is detected that the network connection with the cloud server 200 is currently disconnected, and then the electronic device 100 invokes its own voice processing capability to process the voice 2 (for example, the voice 2 may be a backup voice).

For the processing procedure, reference may be made to the corresponding description of the speech 2 by the electronic device 100 at the time T2 in fig. 5A, which is not described herein again.

In this way, in the voice interaction process, when the cloud server 200 issues an instruction each time, the conversation context is synchronously sent to the electronic device 100, the electronic device 100 receives and stores the conversation context, and when a network interrupt occurs, the electronic device 100 can continue to process the voice service originally processed on the cloud server 200 based on the stored conversation context, so that the voice service is not interrupted, the processing efficiency of the voice service is improved, and the user experience is improved.

In some embodiments, each time the cloud server 200 issues the instruction, if at least one slot information of the one or more slot information is missing, the session context, i.e., the intention and the slot information, is synchronously sent to the electronic device 100.

Specifically, referring to fig. 5A, at time T1, during the process of processing the voice 1 by the cloud server 200, based on the semantic information corresponding to the voice 1, an intention expressed by the semantic information corresponding to the voice 1 and slot information corresponding to the intention are determined, where one intention may correspond to one or more slots. Based on the slot position information of the semantic information filling intention, if all slot positions are completely filled, that is, the slot position information is not missing, the user intention is converted into a user-specific instruction, and the cloud server 200 sends the instruction to the electronic device 100 to instruct the electronic device 100 to execute a corresponding action. For example, the cloud server 200 acquires a voice signal "please help me open a gallery", determines that the intention of the user is to open (open) an object according to semantic information of the voice signal, and then a slot corresponding to the intention is an open object, and the cloud server 200 performs slot filling according to the semantic information of the voice signal to determine that the open object is the gallery. The dialog management determines an explicit instruction, i.e. an instruction to open the gallery, based on the semantic information. It can be seen that, since the intention of the user is completed at this time, and the cloud server 200 determines that the intention is ended, at this time, the cloud server 200 does not need to send the dialog context (intention and slot information) to the electronic device 100, and resources are saved.

When the semantic information is insufficient to cause one or more slot positions to be missing, the cloud server 200 needs to store the current intention and the slot position information, further queries the missing slot position, waits until a next voice signal is received, and combines the slot position information of the stored intention to perform slot filling through the next voice signal to determine the next execution action. In the embodiment of the present application, the cloud server 200 generates a voice reply content for voice 1 based on a voice synthesis technology, sends an instruction to the electronic device 100 instructing the electronic device 100 to output the voice reply content, and synchronously sends a conversation context (intention and slot position information corresponding to voice 1) to the electronic device 100, and the electronic device 100 receives and saves the conversation context. Thus, even if network interruption occurs when the electronic device 100 receives the next voice signal, the electronic device 100 can process the received next voice signal by combining the voice interaction capability of the electronic device 100 with the saved conversation context, so that the processing efficiency of the voice service is improved, and the user experience is improved.

In some embodiments, when the missing slot is two or more, the cloud server 200 sends the electronic device 100 the intention and the slot information synchronously, and the cloud server 200 may mark the slot to indicate the order of slot filling of the electronic device 100. Therefore, when the electronic equipment processes the next voice signal, one of the slots can be accurately filled.

Next, taking an application scenario of making a call as an example, a voice interaction processing method implemented in this scenario of making a call in the embodiment of the present application is described in detail.

As shown in fig. 6, at time T1, when the user wants to make a call by voice, a voice assistant Application (APP) may be started to input a voice signal "i want to make a call". The electronic device 100 receives a voice signal "i want to make a call" input by a user through the voice assistant application, and performs distribution control on the received voice signal based on a preset rule, for example, when the network quality is good, the electronic device 100 uploads the received "i want to make a call" to the cloud server 200 for processing.

After receiving the voice signal of "i want to make a call", the cloud server 200 converts the voice signal into text information according to a voice recognition technology (ASR), obtains semantic information according to a semantic understanding technology (NUL), and recognizes that the user intends to make a call. Next, the cloud server 200 determines that the slot information corresponding to the intention of making a call includes an object of making a call, the cloud server 200 performs slot filling according to the semantic information, and the cloud server 200 recognizes that the semantic information of "i want to make a call" does not include the object of making a call, that is, the cloud server 200 determines that the slot (object of making a call) corresponding to the intention (make a call) is empty of information.

The cloud server 200 determines that the action to be executed next is to inquire the vacant slot position information, and the cloud server 200 generates a voice reply content "who you want to get to" according to a speech synthesis technology (TTS), and sends an instruction with the voice reply content to the electronic device 100 to instruct the electronic device 100 to play the voice reply content. And, the cloud server 200 synchronously transmits the session context including an object (vacancy) intended to "make a call" and slot information "make a call" to the electronic device 100. The electronic device 100 receives the instruction and the dialog context sent by the cloud server, plays the voice reply content "who you want to get to" based on the instruction, and saves the dialog context.

Optionally, the cloud server 200 may also send an instruction with text data of the voice reply content to the electronic device 100, instructing the electronic device 100 to display the text data ("who you want to play to").

After the electronic device 100 plays the voice reply content "who you want to get to", the user inputs the voice signal "get twilight" again, the network quality is not good at time T2, and the electronic device 100 invokes its own voice processing capability to process the voice signal "get twilight". The electronic device 100 converts the speech signal into text information according to a speech recognition technique (ASR) and obtains semantic information according to a semantic understanding technique (NUL). Next, the electronic apparatus 100 performs slot filling based on the saved intention "call" and slot position information "object of call (empty)", and semantic information corresponding to "call xiao ming". The electronic device 100 recognizes that the twilight in the semantic information of "i want to make a call" is the object of making a call, that is, the cloud server 200 determines that the slot (object of making a call) corresponding to the intention (make a call) is the twilight.

The electronic apparatus 100 determines that the action to be performed next is to call the xiaoming and outputs the voice reply content "calling the xiaoming". The electronic device 100 generates a voice reply content "call xiaoming" according to a speech synthesis technology (TTS), and the electronic device 100 plays the voice reply content. Moreover, the electronic device 100 queries whether the contact person is small or bright in the address book, and calls the call capability to be small or bright. Optionally, the electronic device 100 may also display text data of the voice reply content ("text content of calling Xiaoming").

The above describes a voice interaction processing method in a call-making scenario, and a smart phone is taken as an example of the electronic device 100, and some voice interaction processes are exemplarily shown in combination with a specific scenario. In the process of voice interaction, if the network quality of the electronic device 100 changes from good to poor, the processing of the voice signal is switched from the cloud server 200 to the electronic device 100, and the cloud server 200 issues the voice conversation context and stores the voice conversation context in the electronic device 100, so that the electronic device 100 can realize uninterrupted voice service even if network interruption occurs in the process of multiple rounds of conversations. As shown in fig. 7A and 7B, the wakeup word is set to "mini art".

The user: the Xiao Yi, I want to make a call.

Smartphone (electronic device 100): who you want to get to.

The user: xiaoming.

Smartphone (electronic device 100): good, it is calling for you well.

With reference to fig. 8A to 8D, the above-mentioned voice dialog is taken as an example, and an implementation form of the voice interaction processing method provided by the embodiment of the present application on the display interface of the smart phone is described below.

As shown in FIG. 8A, FIG. 8A illustrates a voice interaction interface 801, which may be, for example, an interface of a voice assistant application. The voice interaction interface 801 includes a status bar 8011 and a function bar 8012.

The status bar 8011 may include: one or more of a signal strength indicator 8013, a battery status indicator 8014, a time indicator 8015 of the wireless network signal. The signal strength indicator 8013 indicates the current network quality (which may also indicate a data transmission rate of the electronic device 100 and the cloud server 200), and in fig. 8A, the signal strength indicator 8013 is full (4 cases), indicating that the current network quality is good.

The functionality bar 8012 may include one or more functionality controls, such as a voice input control 8016. When electronic device 100 detects a user operation with respect to voice input control 8012, electronic device 100 receives a voice signal. As shown in fig. 8A, the electronic device 100 receives the voice signal "little art, i.e., i want to make a call" and displays it on the voice interaction interface 801.

As shown in fig. 8B, the electronic device 100 receives the voice signal "xiaozuoxiao art, i want to make a call", may upload the voice signal to the cloud server 200 for processing, play the voice reply content "who you want to make a call" based on the instruction returned by the cloud server 200, and display on the voice interaction interface 802. Where the voice input control 8016 transitions to the voice output control 8026, indicating that the electronic device 100 is currently outputting voice. In the embodiment of the present application, while the cloud server 200 returns the instruction, the voice conversation context is synchronously returned to the electronic device 100, and the electronic device 100 receives and stores the conversation context.

As shown in fig. 8C, when the current network quality of the electronic device 100 is poor and the signal strength indicator 8033 has only two remaining cells, the electronic device 100 and the cloud server 200 cannot perform data transmission or the data transmission rate is too low. When the electronic device 100 receives the voice signal "xiaoming", the electronic device 100 cannot upload the voice signal to the cloud server 200 for processing, or the cloud server 200 cannot issue an instruction to the electronic device 100. At this time, the electronic device 100 may continue to process the voice signal "xiao ming" based on the saved dialog context, play the voice reply content "good, calling xiao ming for you," and display on the voice interaction interface 803. And, performing the action of making a call, the electronic device 100 jumps to a call interface, as shown in fig. 8D, fig. 8D shows a call interface 804, and the call interface 804 indicates that the electronic device 100 is making a call to xiao ming currently.

The above is an application scenario in which the voice service is a multi-turn conversation (specifically, a two-turn conversation), in the process of the conversation, the network quality of the electronic device 100 changes from good to bad, the processing of the voice signal is converted from the cloud server 200 to the electronic device 100, and as the cloud server 200 issues the conversation context of the voice and stores the conversation context in the electronic device 100, even if a network interruption occurs in the process of the multi-turn conversation, the electronic device 100 can implement uninterrupted voice service, and the processing efficiency of the voice service is improved.

Next, an application scenario of three-round dialog is provided in the embodiment of the present application, and an application scenario of sending a short message is taken as an example to briefly describe a voice interaction processing method implemented in the scenario of sending a short message in the embodiment of the present application.

When the network quality is good, the electronic device 100 receives a voice signal "i want to send a short message" input by a user, and performs distribution control on the received voice signal based on a preset rule, for example, when the network quality is good at this time, the electronic device 100 uploads the received "i want to send a short message" to the cloud server 200 for processing.

The cloud server 200 recognizes that the user's intention is to send a short message. Next, the cloud server 200 determines that the slot position information corresponding to the intention of sending the short message includes an object to send the short message and content of sending the short message, the cloud server 200 fills the slot according to the semantic information, and the cloud server 200 identifies that the semantic information of "i want to send the short message" does not include the object to make a call and the content of sending the short message, that is, the cloud server 200 determines that the information of the slot position (the object to send the short message and the content of sending the short message) corresponding to the intention (sending the short message) is vacant.

The cloud server 200 determines that the next action to be executed is to query for the vacant slot position information, and because two slot position information are vacant, the cloud server 200 may query for one of the vacant slot position information according to the priority, for example, first query for an object to send a short message. The cloud server 200 generates a voice reply content "who you want to send a short message to" according to a voice synthesis technology (TTS), and sends an instruction with the voice reply content to the electronic device 100 to instruct the electronic device 100 to play the voice reply content. And, the cloud server 200 synchronously transmits the session context including an object (vacancy) intended to "send a short message" and slot information "send a short message, a content (vacancy) of sending a short message" to the electronic device 100. The electronic device 100 receives the instruction and the conversation context sent by the cloud server, plays the voice reply content "who you want to send a short message to" based on the instruction, and saves the conversation context.

Next, after the electronic device 100 plays the voice reply content "who you want to send a short message to", the user inputs the voice signal "give xiaoming" again, and if the network quality is good at this time, the electronic device 100 uploads the received "i want to send a short message" to the cloud server 200 for processing. The cloud server 200 performs slot filling based on the stored intention "send short message" and slot position information "object (gap) to send short message, content (gap) to send short message", and semantic information corresponding to "give xiao ming". The cloud server 200 recognizes that "xiaoming" in the semantic information of "give xiaoming" is an object to send a short message, that is, the cloud server 200 determines that the information of the slot position (object to send a short message) corresponding to the intention (call) is "xiaoming".

At this time, the slot position information "content of sending a short message" is still vacant, the cloud server 200 stores the current intention and the slot position information, the cloud server 200 determines that the next action is to inquire the vacant slot position information (content of sending a short message) again, the cloud server 200 generates a voice reply content "what you want to send" according to a speech synthesis technology (TTS), and sends an instruction with the voice reply content to the electronic device 100 to instruct the electronic device 100 to play the voice reply content. The cloud server 200 synchronously transmits the session context to the electronic device 100, and at this time, the session context includes an intention (sending a short message), slot information "object of sending a short message (minuscule), content of sending a short message (vacancy)". The electronic device 100 receives the instruction and the dialog context sent by the cloud server, plays the voice reply content "what you want to send" based on the instruction, and saves the dialog context.

In some embodiments, after the electronic device 100 plays the voice reply content "who you want to send a short message", the user inputs the voice signal "give xiaoming" again, and if the network quality is not good at this time, the electronic device 100 invokes its own voice processing capability to process the voice signal "give xiaoming". The electronic device 100 converts the speech signal into text information according to a speech recognition technique (ASR) and obtains semantic information according to a semantic understanding technique (NUL). Next, the electronic apparatus 100 performs slot filling based on the saved intention "send short message" and slot position information "object (gap) to send short message, content (gap) of sending short message", and semantic information corresponding to "give xiao ming". The electronic device 100 recognizes that the minuscule in the semantic information of "give minuscule" is the object to send the short message, i.e., the electronic device 100 determines that the slot (object to send the short message) corresponding to the intention (call) is the minuscule.

That is, for the above example, when the cloud server 200 synchronously sends the intention and the slot position information to the electronic device 100, since there are two slot positions that are not available, the cloud server 200 may mark the slot position to determine which slot position is filled next time. Then, when the electronic device 100 fills the slot, the filling may be directly performed without determining which slot the semantic information corresponds to. That is, the electronic apparatus 100 can directly determine that the slot (object to send a short message) corresponding to the intention (call) is "mingmen".

At this time, the slot position information "content of sending a short message" is still vacant, the electronic device 100 stores the current intention and the slot position information, the electronic device 100 determines that the next action to be executed is to inquire the vacant slot position information (content of sending a short message) again, the electronic device 100 generates a voice reply content "what you want to send" according to a voice synthesis technology (TTS), and the electronic device 100 plays the voice reply content.

The voice signal received next time by the electronic device 100 is processed again and slot filling is performed until the slot information is completely filled, and an instruction for executing the intention is generated, so that the electronic device 100 determines that the intention is already executed.

The present application provides a voice interaction processing method, as shown in fig. 9, the method includes:

the electronic device 100 establishes a connection with the cloud server 200. Step S101: the electronic device 100 receives a first voice signal.

The first voice signal may be, for example, the voice 1 in fig. 5A or fig. 5B, or the voice "i want to make a call" in fig. 6.

Step S102: the electronic device 100 uploads the first voice signal to the cloud server 200.

Step S103: the cloud server 200 recognizes the first voice signal, obtains a corresponding intention and one or more slot position information corresponding to the intention, and determines the first voice reply content based on the intention and the one or more slot position information.

Step S104: the cloud server 200 transmits the first voice reply content, the intent, and the one or more slot information to the electronic device 100.

Step S105: the electronic device 100 outputs the first voice reply content and saves the intent and the one or more slot position information.

The first voice reply content may be, for example, the voice reply content included in action 1 in fig. 5A, or the voice "who you want to call" in fig. 6.

The electronic device 100 and the cloud server 200 have poor communication quality.

Step S106: the electronic device 100 receives the second voice signal.

The first speech signal may be, for example, the speech 2 in fig. 5A or 5B, or the speech "give a decimal" in fig. 6.

Step S107: the electronic device 100 recognizes the second voice signal to obtain corresponding semantic information, and determines the first operation based on the intent and the one or more slot position information and the semantic information.

Step S108: a first operation is performed.

The first operation may be, for example, action 2 in fig. 5A or fig. 5B, or may be playing the voice content and/or displaying the text content "call well", and performing: calling twilight ", one or more of these three actions.

In some embodiments, poor communication quality between the electronic device 100 and the cloud server 200 may occur at any time period between step S106 and step S107.

In one possible implementation manner, the electronic device 10 identifies the second voice signal to obtain corresponding semantic information, and determines a first operation based on the intention and one or more slot position information and the semantic information, including: the electronic device 100 identifies that the semantic information matches with one missing slot position in the one or more slot position information, and fills the semantic information into the value of the slot position; the electronic device determines a first operation based on the intent and the populated one or more slot location information. Here, specifically describing a process of the electronic device processing the original voice service based on the second voice signal, because the electronic device obtains the intention and the slot position information corresponding to the first voice signal, the electronic device can continue to perform slot filling processing on the received second voice signal based on the intention and the slot position information, thereby realizing the capability of continuing to process the original voice service.

In one possible implementation, the first operation includes one or more of: playing the second voice reply content; displaying the text content of the second voice reply content; jump to the corresponding interface. The second voice reply content may be, for example, the voice reply content included in action 2 in fig. 5A or 5B, or the voice "call is small" in fig. 6.

In one possible implementation, the method further includes: the method comprises the steps that electronic equipment receives a first instruction sent by a cloud server; and the electronic equipment displays the text content of the first voice reply content based on the first instruction and/or jumps to a corresponding interface.

The embodiment of the application also provides a computer readable storage medium. The methods described in the above method embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media may include computer storage media and communication media, and may include any medium that can communicate a computer program from one place to another. A storage media may be any available media that can be accessed by a computer.

The embodiment of the application also provides a computer program product. The methods described in the above method embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. If implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in the above method embodiments may be wholly or partially generated when the above computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a computer network, a network appliance, a user device, or other programmable apparatus.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital subscriber line) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), among others.

One of ordinary skill in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the above method embodiments. And the aforementioned storage medium includes: various media capable of storing program codes, such as ROM or RAM, magnetic or optical disks, etc.

Claims

1. A method for processing voice interaction, the method comprising:

the electronic equipment receives a first voice signal;

the electronic equipment uploads the first voice signal to a cloud server under the condition that the electronic equipment is connected with the cloud server;

the electronic equipment receives first voice reply content, an intention and one or more slot position information corresponding to the intention, wherein the intention and the one or more slot position information are obtained by the cloud server through recognition of the first voice signal, and the first voice reply content is determined by the cloud server based on the intention and the one or more slot position information;

after the electronic equipment outputs the first voice reply content, a second voice signal is received;

under the condition that the communication quality of the electronic equipment and the cloud server is poor, the electronic equipment identifies the second voice signal to obtain corresponding semantic information, and determines a first operation based on the intention, the one or more slot position information and the semantic information;

the electronic device performs the first operation.

2. The method of claim 1, wherein the electronic device determines a first operation based on the intent and the one or more slot information and the semantic information, comprising:

the electronic equipment identifies that the semantic information is matched with one missing slot position in the one or more slot position information, and fills the semantic information into the value of the slot position;

the electronic device determines a first operation based on the intent and the populated one or more slot location information.

3. The method of claim 1 or 2, wherein the first operation comprises one or more of:

playing the second voice reply content;

displaying the text content of the second voice reply content;

jump to the corresponding interface.

4. The method according to any one of claims 1-3, further comprising:

the electronic equipment receives a first instruction sent by the cloud server;

and the electronic equipment displays the text content of the first voice reply content based on the first instruction and/or jumps to a corresponding interface.

5. The method according to any one of claims 1-4, wherein the poor communication quality between the electronic device and the cloud server comprises:

the electronic equipment fails to upload the second voice signal to a cloud server; or

After the electronic equipment uploads the first voice signal to the cloud server, reply data of the cloud server are not received within preset time.

6. The method of any of claims 1-5, wherein the electronic device receives a first speech signal comprising:

the electronic device receives the first voice signal through a voice assistant application.

7. A method for processing voice interaction, the method comprising:

the method comprises the steps that a cloud server receives a first voice signal uploaded by electronic equipment;

the cloud server identifies the first voice signal to obtain a corresponding intention and one or more slot position information corresponding to the intention, and determines a first voice reply content based on the intention and the one or more slot position information;

the cloud server sends the first voice reply content, the intent, and the one or more slot location information to the electronic device.

8. The method of claim 7, wherein the cloud server sending the first voice reply content and the intent and the one or more slot information to the electronic device comprises:

the cloud server sends the first voice reply content, the intent and the one or more slot information to the electronic device when at least one of the one or more slot information is missing.

9. An electronic device, comprising: one or more processors, one or more memories; the one or more memories are respectively coupled with the one or more processors; the one or more memories are for storing computer program code comprising computer instructions; when the computer instructions are executed on the processor, cause the electronic device to perform:

receiving a first voice signal;

uploading the first voice signal to a cloud server under the condition that connection with the cloud server is established;

receiving first voice reply content and an intention sent by the cloud server and one or more slot position information corresponding to the intention, wherein the intention and the one or more slot position information are obtained by the cloud server through recognition of the first voice signal, and the first voice reply content is determined by the cloud server based on the intention and the one or more slot position information;

after the first voice reply content is output, receiving a second voice signal;

under the condition that the communication quality with the cloud server is poor, identifying the second voice signal to obtain corresponding semantic information, and determining a first operation based on the intention, the one or more slot position information and the semantic information;

the first operation is performed.

10. The electronic device of claim 9, wherein the determining a first operation based on the intent and the one or more slot information and the semantic information comprises:

identifying that the slot position missing from the semantic information and the one or more slot position information is matched, and filling the semantic information into the value of the slot position;

determining a first operation based on the intent and the populated one or more slot location information.

11. The electronic device of claim 9 or 10, wherein the first operation comprises one or more of:

playing the second voice reply content;

displaying the text content of the second voice reply content;

jump to the corresponding interface.

12. The electronic device of any of claims 9-11, wherein the electronic device further performs:

receiving a first instruction sent by the cloud server;

and displaying the text content of the first voice reply content based on the first instruction, and/or jumping to a corresponding interface.

13. The electronic device according to any one of claims 9-12, wherein the poor communication quality with the cloud server comprises:

the uploading of the second voice signal to the cloud server fails; or

After the first voice signal is uploaded to the cloud server, reply data of the cloud server are not received within preset time.

14. The electronic device of any of claims 9-13, wherein the electronic device receives a first voice signal, comprising:

the first voice signal is received by a voice assistant application.

15. A cloud server, comprising: one or more processors, one or more memories; the one or more memories are respectively coupled with the one or more processors; the one or more memories are for storing computer program code comprising computer instructions; when the computer instructions are run on the processor, cause the cloud server to perform:

receiving a first voice signal uploaded by electronic equipment;

identifying the first voice signal to obtain a corresponding intention and one or more slot position information corresponding to the intention, and determining first voice reply content based on the intention and the one or more slot position information;

transmitting the first voice reply content, the intent, and the one or more slot location information to the electronic device.

16. The cloud server of claim 15, wherein said sending the first voice reply content and the intent and the one or more slot information to the electronic device comprises:

and sending the first voice reply content and the intention and the one or more slot position information to the electronic equipment under the condition that at least one slot position information in the one or more slot position information is missing.

17. A computer readable medium storing one or more programs, wherein the one or more programs are configured to be executed by the one or more processors, the one or more programs comprising instructions for performing the method of claims 1-8.