WO2022143258A1

WO2022143258A1 - Voice interaction processing method and related apparatus

Info

Publication number: WO2022143258A1
Application number: PCT/CN2021/139631
Authority: WO
Inventors: 黄龙; 王翃宇; 李勇
Original assignee: 华为技术有限公司
Priority date: 2020-12-31
Filing date: 2021-12-20
Publication date: 2022-07-07
Also published as: CN114694646A

Abstract

A voice interaction processing method, an electronic device, a cloud server, and a computer-readable medium, relating to a natural language processing technology in the field of artificial intelligence, especially a multi-round dialogue processing technology. The method comprises: an electronic device (100) receives a first voice signal (S101); when the electronic device (100) establishes a connection with a cloud server (200), the electronic device (100) uploads the first voice signal to the cloud server (200) (S102); the cloud server (200) recognizes the first voice signal to obtain a corresponding intent and one or more slot information corresponding to the intent, and determines first voice response content on the basis of the intent and one or more slot information (S103); the cloud server (200) sends the first voice response content, intent, and one or more slot information to the electronic device (100) (S104); the electronic device (100) outputs the first voice response content (S105), and then receives a second voice signal (S106); in the case of poor quality of communication between the electronic device (100) and the cloud server (200), the electronic device (100) recognizes the second voice signal to obtain corresponding semantic information, and determines a first operation on the basis of the intent, one or more slot information, and semantic information (S107); and the electronic device (100) performs the first operation (S108), such that when processing of a voice service is switched from the cloud server to the electronic device due to a network connection failure during the voice service of a multi-round dialogue, the electronic device can also continue to execute the original voice service on the basis of the context of the voice dialogue and a next voice signal received, thereby solving the problem of interruption of voice services of multi-round dialogues.

Description

A kind of voice interaction processing method and related device

This application claims the priority of the Chinese patent application with the application number 202011636583.9 and the application title "A Voice Interaction Processing Method and Related Apparatus" filed with the China Patent Office on December 31, 2020, the entire contents of which are incorporated herein by reference Applying.

technical field

The present application relates to the field of artificial intelligence, and in particular, to a voice interaction processing method and related devices.

Background technique

With the gradual development of voice interaction technology, more and more smart devices have a voice interaction function. Voice interaction means that the user obtains a voice/text response by inputting voice/text. For example, the user voice input "How is the weather today", and the smart device voice returns "The weather is fine, 25 degrees to 29 degrees".

The current voice interaction system needs network support. In some cases (such as network interruption), the original voice interaction service cannot continue to be executed, which affects the user experience.

SUMMARY OF THE INVENTION

Embodiments of the present application provide a voice interaction processing method and a related device, so as to solve the problem of voice service interruption in multiple rounds of conversations, and improve the voice service processing capability.

In a first aspect, the present application provides a voice interaction processing method, which is applied to an electronic device, including: the electronic device receives an input first voice signal; when the electronic device establishes a connection with a cloud server, the electronic device sends the first voice signal to the electronic device. A voice signal is uploaded to the cloud server; the electronic device receives the content of the first voice reply sent by the cloud server, the intent, and one or more slot information corresponding to the intent, and the intent and the one or more slot information are the first Recognized by the voice signal, the first voice reply content is determined by the cloud server based on the intent and one or more slot information; after the electronic device outputs the first voice reply content, it receives the second voice signal; When the communication quality of the server is not good, the electronic device recognizes the second voice signal, obtains corresponding semantic information, and determines the first operation based on the intent and one or more slot information and semantic information; the electronic device executes first operation.

In this embodiment of the present application, in the process of processing voice services, it can be processed by a cloud server or an electronic device. When the cloud server processes voice data, the cloud server sends corresponding instructions to the electronic device, instructing the electronic device to perform corresponding actions. , and synchronously sends the context (intent and slot information) of the voice conversation to the electronic device. In this way, if the network is interrupted in the voice service of multiple rounds of dialogue, the voice service is switched from the cloud server to the electronic device for processing, and the electronic device can also be based on the context of the voice dialogue and the received next voice signal. , continue to execute the original voice service, so as to solve the problem of interruption of voice service for multiple rounds of dialogues, and improve the ability of voice service processing.

In a possible implementation manner, the electronic device determines the first operation based on the intent and one or more slot information and semantic information, including: the electronic device identifies one of the semantic information and the one or more slot information The missing slot matches, and the semantic information is filled with the value of the slot; the electronic device determines the first operation based on the intent and the filled information of one or more slots. Here, the process of processing the original voice service by the electronic device based on the second voice signal is described in detail. Since the electronic device has obtained the intent and slot information corresponding to the first voice signal, it can continue to receive the corresponding information based on the intent and the slot information. The slot-filling process is performed on the second voice signal of the device, so as to realize the ability to continue processing the original voice service.

In a possible implementation manner, the first operation includes one or more of the following: playing the second voice reply content; displaying the text content of the second voice reply content; jumping to a corresponding interface.

In a possible implementation manner, the method further includes: the electronic device receives the first instruction sent by the cloud server; the electronic device displays the text content of the first voice reply content based on the first instruction, and/or jumps to a corresponding interface.

In a possible implementation manner, the poor communication quality between the electronic device and the cloud server includes: the electronic device fails to upload the second voice signal to the cloud server; or after the electronic device uploads the first voice signal to the cloud server, the The reply data from the cloud server has not been received within the set time. The timing of poor communication quality is described here. It can be when the electronic device uploads the second voice signal and the communication quality between the electronic device and the cloud server is poor, or when the cloud server delivers the voice reply content for the second voice signal. The communication quality between the electronic device and the cloud server is poor.

In a possible implementation manner, the electronic device receiving the first voice signal includes: the electronic device receives the first voice signal through a voice assistant application.

In a second aspect, the present application provides a voice interaction processing method, which is applied to a cloud server, including: the cloud server receives a first voice signal uploaded by an electronic device; the cloud server recognizes the first voice signal to obtain a corresponding intent One or more slot information corresponding to the intention, and determine the first voice reply content based on the intention and the one or more slot information; the cloud server sends the first voice reply content, the intention and one or more slots to the electronic device bit information.

In this embodiment of the present application, the cloud server sends an instruction to the electronic device to instruct the electronic device to simultaneously send the context (intent and slot information) of the voice dialogue to the electronic device when performing a corresponding action. In this way, if the network is interrupted in the voice service of multiple rounds of dialogue, the voice service is switched from the cloud server to the electronic device for processing, and the electronic device can also be based on the context of the voice dialogue and the received next voice signal. , continue to execute the original voice service, so as to solve the problem of interruption of voice service for multiple rounds of dialogues, and improve the ability of voice service processing.

In a possible implementation manner, the cloud server sends the first voice reply content and intent and one or more slot information to the electronic device, including: the cloud server has at least one slot information in the one or more slot information. In the case of absence, the content and intent of the first voice reply and one or more slot information are sent to the electronic device. Here is a situation in which the cloud server sends intent and slot information to the electronic device, that is, when the slot information is missing, it is determined that the current voice service is a multi-round conversation service, and then the cloud server will send the electronic device to the electronic device. The device sends the intent and slot information; if the slot information is not missing, the voice service can be processed in a single round, and there is no need to obtain the next voice signal. Through the step of judging whether the slot information is missing, the delivery intention and the slot information are further determined, which can save resources.

In a third aspect, the present application provides a voice interaction processing system, the voice interaction processing system includes an electronic device and a cloud server, wherein,

an electronic device for receiving the first voice signal;

The electronic device is further configured to upload the first voice signal to the cloud server when the electronic device establishes a connection with the cloud server;

The cloud server is used to identify the first voice signal, obtain the corresponding intention and one or more slot information corresponding to the intention, and determine the content of the first voice reply based on the intention and the one or more slot information;

The cloud server is further configured to send the first voice reply content, intent and one or more slot information to the electronic device;

The electronic device is further configured to receive the second voice signal after outputting the content of the first voice reply;

The electronic device is also used for identifying the second voice signal in the case of poor communication quality between the electronic device and the cloud server, to obtain corresponding semantic information, and based on the intent and one or more slot information and semantic information, determine the first operation;

The electronic device is further configured to perform the first operation.

In a possible implementation manner, the electronic device is further configured to identify that the semantic information matches one of the missing slots in the one or more slot information, and fill the semantic information as the value of the slot; the electronic device, Also used to determine the first operation based on the intent and the filled one or more slot information. Here, the process of processing the original voice service by the electronic device based on the second voice signal is described in detail. Since the electronic device has obtained the intent and slot information corresponding to the first voice signal, it can continue to receive the corresponding information based on the intent and the slot information. The slot-filling process is performed on the second voice signal of the device, so as to realize the ability to continue processing the original voice service.

In a possible implementation manner, the electronic device is further configured to receive the first instruction sent by the cloud server; the electronic device is further configured to display the text content of the first voice reply content based on the first instruction, and/or jump to corresponding interface.

In a possible implementation manner, the electronic device is further configured to receive the first voice signal through a voice assistant application.

In a possible implementation manner, the cloud server is further configured to send, to the electronic device, the first voice reply content, intent and one or more slot information. Here is a situation in which the cloud server sends intent and slot information to the electronic device, that is, when the slot information is missing, it is determined that the current voice service is a multi-round conversation service, and then the cloud server will send the electronic device to the electronic device. The device sends the intent and slot information; if the slot information is not missing, the voice service can be processed in a single round, and there is no need to obtain the next voice signal. Through the step of judging whether the slot information is missing, the delivery intention and the slot information are further determined, which can save resources.

In a fourth aspect, the present application provides an electronic device, comprising: one or more processors and one or more memories; the one or more memories are coupled with the one or more processors; the one or more memories are used for storing computer program code, the computer program code including computer instructions; when the computer instructions are executed on the processor, the electronic device enables the electronic device to execute the voice interaction processing method in any possible implementation manner of the first aspect.

In a fifth aspect, the present application provides a cloud server, comprising: one or more processors and one or more memories; the one or more memories are coupled with the one or more processors; the one or more memories are used for The computer program code is stored in the computer program code, and the computer program code includes computer instructions; when the computer instructions are executed on the processor, the electronic device enables the electronic device to execute the voice interaction processing method in any possible implementation manner of the second aspect.

In a sixth aspect, an embodiment of the present application provides a computer storage medium, including computer instructions, which, when the computer instructions are run on an electronic device, cause the communication apparatus to perform the voice interaction processing in any possible implementation manner of any of the above aspects method.

In a seventh aspect, an embodiment of the present application provides a computer program product that, when the computer program product runs on a computer, enables the computer to execute the voice interaction processing method in any possible implementation manner of any one of the foregoing aspects.

Description of drawings

FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present application;

FIG. 2 is a schematic structural diagram of an electronic device provided by an embodiment of the present application;

3 is a schematic diagram of a software structure of an electronic device provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of the principle of a voice interaction processing method provided by an embodiment of the present application;

5A-5B are schematic schematic diagrams of still another voice interaction processing method provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of the principle of a phone call scenario provided by an embodiment of the present application;

7A-7B are schematic diagrams of scenarios of a voice interaction processing method provided by an embodiment;

8A to 8D are schematic diagrams of a group of application interfaces provided by an embodiment of the present application;

FIG. 9 is a schematic flowchart of a voice interaction processing method provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings. Wherein, in the description of the embodiments of the present application, unless otherwise specified, “/” means or, for example, A/B can mean A or B; “and/or” in the text is only a description of an associated object The association relationship indicates that there can be three kinds of relationships, for example, A and/or B can indicate that A exists alone, A and B exist at the same time, and B exists alone. In addition, in the description of the embodiments of this application , "plurality" means two or more than two.

Hereinafter, the terms "first" and "second" are only used for descriptive purposes, and should not be construed as implying or implying relative importance or implying the number of indicated technical features. Therefore, the features defined as "first" and "second" may explicitly or implicitly include one or more of the features. In the description of the embodiments of the present application, unless otherwise specified, the "multiple" The meaning is two or more. The orientation or positional relationship indicated by the terms "middle", "left", "right", "upper", "lower", etc. is based on the orientation or positional relationship shown in the accompanying drawings, and is only for the convenience of describing the present application and simplifying the description, Rather than indicating or implying that the referred device or element must have a particular orientation, be constructed and operate in a particular orientation, it should not be construed as a limitation on the application.

In this embodiment of the present application, FIG. 1 shows a schematic diagram of a scenario of a voice interaction system 10 according to an embodiment of the present invention. As shown in FIG. 1 , the system 10 includes an electronic device 100 and a cloud server 200 . It should be noted that the system 10 shown in FIG. 1 is only an example, and those skilled in the art can understand that in practical applications, the system 10 usually includes a plurality of electronic devices 100 and a cloud server 200 . The numbers of electronic devices 100 and cloud servers 200 are not limited.

The electronic device 100 is a smart device with a voice interaction function. The electronic device 100 can receive a voice instruction issued by a user and return voice or non-voice information to the user. In the embodiment of the present application, the electronic device 100 may be a mobile phone, a tablet computer, a notebook computer, an Ultra-mobile Personal Computer (UMPC), a handheld computer, a netbook, a Personal Digital Assistant (PDA), a virtual Reality devices, PDAs (Personal Digital Assistants, also known as PDAs), portable Internet devices, data storage devices, cameras, wearable devices (e.g., wireless headsets, smart watches, smart bracelets, smart glasses, headsets) Wearable devices (Head-mounted display, HMD), electronic clothing, electronic bracelets, electronic necklaces, electronic accessories, electronic tattoos and smart mirrors) or smart home devices (such as smart speakers, smart refrigerators, smart desk lamps, electric lights, smart TVs, Smart microwave ovens, smart fans, air conditioners, smart robots, smart curtains) and so on. An application scenario involved in the embodiments of this application is a home scenario, that is, the electronic device 100 is placed in the user's home, and the user can send voice instructions to the electronic device 100 to implement certain functions, such as surfing the Internet, playing songs on demand, shopping, and knowing the weather forecast. , control other smart home devices in your home, and more.

The cloud server 200 communicates with the electronic device 100 through a network, which may be, for example, a cloud server physically located at one or more locations. The cloud server 200 provides a recognition service for the voice data received on the electronic device 100, so as to obtain a text representation of the voice data input by the user; the cloud server 200 also obtains the representation of the user's intention based on the text representation, and generates a response command, which is returned to the electronic device 100. The electronic device 100 performs corresponding actions according to the response instruction to provide the user with corresponding services, such as setting an alarm clock, making a phone call, sending an email, broadcasting information, playing a song, a video, and the like. Of course, the electronic device 100 may also output a corresponding voice response to the user according to the response instruction, or display corresponding text content, which is not limited in this embodiment of the present application.

The following first introduces the electronic device 100 involved in the embodiments of the present application.

Referring to FIG. 2, FIG. 2 shows a schematic structural diagram of an exemplary electronic device 100 provided by an embodiment of the present application.

The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and Subscriber identification module (subscriber identification module, SIM) card interface 195 and so on. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.

It can be understood that the structures illustrated in the embodiments of the present application do not constitute a specific limitation on the electronic device 100 . In other embodiments of the present application, the electronic device 100 may include more or less components than shown, or combine some components, or separate some components, or arrange different components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.

The controller may be the nerve center and command center of the electronic device 100 . The controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.

In some embodiments, the processor 110 may include one or more interfaces. The interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transceiver (universal asynchronous transmitter) receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / or universal serial bus (universal serial bus, USB) interface, etc.

The I2C interface is a bidirectional synchronous serial bus that includes a serial data line (SDA) and a serial clock line (SCL). In some embodiments, the processor 110 may contain multiple sets of I2C buses. The processor 110 can be respectively coupled to the touch sensor 180K, the charger, the flash, the camera 193 and the like through different I2C bus interfaces. For example, the processor 110 may couple the touch sensor 180K through the I2C interface, so that the processor 110 and the touch sensor 180K communicate with each other through the I2C bus interface, so as to realize the touch function of the electronic device 100 .

The I2S interface can be used for audio communication. In some embodiments, the processor 110 may contain multiple sets of I2S buses. The processor 110 may be coupled with the audio module 170 through an I2S bus to implement communication between the processor 110 and the audio module 170 . In some embodiments, the audio module 170 can transmit audio signals to the wireless communication module 160 through the I2S interface, so as to realize the function of answering calls through a Bluetooth headset.

The PCM interface can also be used for audio communications, sampling, quantizing and encoding analog signals. In some embodiments, the audio module 170 and the wireless communication module 160 may be coupled through a PCM bus interface. In some embodiments, the audio module 170 can also transmit audio signals to the wireless communication module 160 through the PCM interface, so as to realize the function of answering calls through the Bluetooth headset. Both the I2S interface and the PCM interface can be used for audio communication.

The UART interface is a universal serial data bus used for asynchronous communication. The bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is typically used to connect the processor 110 with the wireless communication module 160 . For example, the processor 110 communicates with the Bluetooth module in the wireless communication module 160 through the UART interface to implement the Bluetooth function. In some embodiments, the audio module 170 can transmit audio signals to the wireless communication module 160 through the UART interface, so as to realize the function of playing music through the Bluetooth headset.

The MIPI interface can be used to connect the processor 110 with peripheral devices such as the display screen 194 and the camera 193 . MIPI interfaces include camera serial interface (CSI), display serial interface (DSI), etc. In some embodiments, the processor 110 communicates with the camera 193 through a CSI interface, so as to realize the photographing function of the electronic device 100 . The processor 110 communicates with the display screen 194 through the DSI interface to implement the display function of the electronic device 100 .

The GPIO interface can be configured by software. The GPIO interface can be configured as a control signal or as a data signal. In some embodiments, the GPIO interface may be used to connect the processor 110 with the camera 193, the display screen 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like. The GPIO interface can also be configured as I2C interface, I2S interface, UART interface, MIPI interface, etc.

The USB interface 130 is an interface that conforms to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, and the like. The USB interface 130 can be used to connect a charger to charge the electronic device 100, and can also be used to transmit data between the electronic device 100 and peripheral devices. It can also be used to connect headphones to play audio through the headphones. The interface can also be used to connect other electronic devices, such as AR devices.

It can be understood that the interface connection relationship between the modules illustrated in the embodiments of the present application is only a schematic illustration, and does not constitute a structural limitation of the electronic device 100 . In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.

The charging management module 140 is used to receive charging input from the charger.

The power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 . The power management module 141 receives input from the battery 142 and/or the charging management module 140 and supplies power to the processor 110 , the internal memory 121 , the external memory, the display screen 194 , the camera 193 , and the wireless communication module 160 . The power management module 141 can also be used to monitor parameters such as battery capacity, battery cycle times, battery health status (leakage, impedance). In some other embodiments, the power management module 141 may also be provided in the processor 110 . In other embodiments, the power management module 141 and the charging management module 140 may also be provided in the same device.

The wireless communication function of the electronic device 100 can be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor, the baseband processor, and the like.

Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in electronic device 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example, the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide wireless communication solutions including 2G/3G/4G/5G etc. applied on the electronic device 100 . The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA) and the like. The mobile communication module 150 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation. The mobile communication module 150 can also amplify the signal modulated by the modulation and demodulation processor, and then turn it into an electromagnetic wave for radiation through the antenna 1 . In some embodiments, at least part of the functional modules of the mobile communication module 150 may be provided in the processor 110 . In some embodiments, at least part of the functional modules of the mobile communication module 150 may be provided in the same device as at least part of the modules of the processor 110 .

The modem processor may include a modulator and a demodulator. Wherein, the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal. The demodulator is used to demodulate the received electromagnetic wave signal into a low frequency baseband signal. Then the demodulator transmits the demodulated low-frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and passed to the application processor. The application processor outputs sound signals through audio devices (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or videos through the display screen 194 . In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be independent of the processor 110, and may be provided in the same device as the mobile communication module 150 or other functional modules.

The wireless communication module 160 can provide applications on the electronic device 100 including UWB, wireless local area networks (WLAN) (such as wireless fidelity (WiFi) networks), bluetooth (BT), global navigation satellites Wireless communication solutions such as global navigation satellite system (GNSS), frequency modulation (FM), near field communication (NFC), and infrared technology (IR). The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 . The wireless communication module 160 can also receive the signal to be sent from the processor 110 , perform frequency modulation on it, amplify it, and convert it into electromagnetic waves for radiation through the antenna 2 .

In some embodiments, the antenna 1 of the electronic device 100 is coupled with the mobile communication module 150, and the antenna 2 is coupled with the wireless communication module 160, so that the electronic device 100 can communicate with the network and other devices through wireless communication technology. The wireless communication technology may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband Code Division Multiple Access (WCDMA), Time Division Code Division Multiple Access (TD-SCDMA), Long Term Evolution (LTE), BT, GNSS, WLAN, NFC , FM, and/or IR technology, etc. The GNSS may include global positioning system (global positioning system, GPS), global navigation satellite system (global navigation satellite system, GLONASS), Beidou navigation satellite system (beidou navigation satellite system, BDS), quasi-zenith satellite system (quasi -zenith satellite system, QZSS) and/or satellite based augmentation systems (SBAS).

The electronic device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

Display screen 194 is used to display images, videos, and the like. Display screen 194 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light). emitting diode, AMOLED), flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on. In some embodiments, the electronic device 100 may include one or N display screens 194 , where N is a positive integer greater than one.

In some embodiments of the present application, the display screen 194 displays the interface content currently output by the system. For example, the interface content is an interface provided by an instant messaging application.

The electronic device 100 may implement a shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.

The ISP is used to process the data fed back by the camera 193 . For example, when taking a photo, the shutter is opened, the light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye. ISP can also perform algorithm optimization on image noise, brightness, and skin tone. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP may be provided in the camera 193 .

Camera 193 is used to capture still images or video.

A digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals. For example, when the electronic device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy and so on.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 can play or record videos of various encoding formats, such as: Moving Picture Experts Group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4 and so on.

The NPU is a neural-network (NN) computing processor. By drawing on the structure of biological neural networks, such as the transfer mode between neurons in the human brain, it can quickly process the input information, and can continuously learn by itself. Applications such as intelligent cognition of the electronic device 100 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.

The internal memory 121 may include one or more random access memories (RAM) and one or more non-volatile memories (NVM).

Random access memory can include static random-access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronization Dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM, such as the fifth generation DDR SDRAM is generally called DDR5 SDRAM), etc.;

Non-volatile memory may include magnetic disk storage devices, flash memory.

Flash memory can be divided into NOR FLASH, NAND FLASH, 3D NAND FLASH, etc. according to the operating principle, and can include single-level memory cell (SLC), multi-level memory cell (multi-level memory cell, SLC) according to the level of storage cell potential. cell, MLC), triple-level cell (TLC), quad-level cell (QLC), etc., according to the storage specification can include universal flash storage (English: universal flash storage, UFS) , embedded multimedia memory card (embedded multi media Card, eMMC) and so on.

The random access memory can be directly read and written by the processor 110, and can be used to store executable programs (eg, machine instructions) of an operating system or other running programs, and can also be used to store data of users and application programs.

The non-volatile memory can also store executable programs and store data of user and application programs, etc., and can be loaded into the random access memory in advance for the processor 110 to directly read and write.

The external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.

The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playback, recording, etc.

The audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .

Speaker 170A, also referred to as a "speaker", is used to convert audio electrical signals into sound signals. The electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call.

The receiver 170B, also referred to as "earpiece", is used to convert audio electrical signals into sound signals. When the electronic device 100 answers a call or a voice message, the voice can be answered by placing the receiver 170B close to the human ear.

The microphone 170C, also called "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can make a sound by approaching the microphone 170C through a human mouth, and input the sound signal into the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.

The earphone jack 170D is used to connect wired earphones. The earphone interface 170D may be the USB interface 130, or may be a 3.5mm open mobile terminal platform (OMTP) standard interface, a cellular telecommunications industry association of the USA (CTIA) standard interface.

The pressure sensor 180A is used to sense pressure signals, and can convert the pressure signals into electrical signals. In some embodiments, the pressure sensor 180A may be provided on the display screen 194 . The gyro sensor 180B may be used to determine the motion attitude of the electronic device 100 . The air pressure sensor 180C is used to measure air pressure. The magnetic sensor 180D includes a Hall sensor. The electronic device 100 can detect the opening and closing of the flip holster using the magnetic sensor 180D. The acceleration sensor 180E can detect the magnitude of the acceleration of the electronic device 100 in various directions (generally three axes). The magnitude and direction of gravity can be detected when the electronic device 100 is stationary. It can also be used to identify the posture of electronic devices, and can be used in applications such as horizontal and vertical screen switching, pedometers, etc. Distance sensor 180F for measuring distance. The electronic device 100 can measure the distance through infrared or laser. Proximity light sensor 180G may include, for example, light emitting diodes (LEDs) and light detectors, such as photodiodes. The ambient light sensor 180L is used to sense ambient light brightness. The electronic device 100 can adaptively adjust the brightness of the display screen 194 according to the perceived ambient light brightness. The ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures. The ambient light sensor 180L can also cooperate with the proximity light sensor 180G to detect whether the electronic device 100 is in a pocket, so as to prevent accidental touch. The fingerprint sensor 180H is used to collect fingerprints. The electronic device 100 can use the collected fingerprint characteristics to realize fingerprint unlocking, accessing application locks, taking pictures with fingerprints, answering incoming calls with fingerprints, and the like. The temperature sensor 180J is used to detect the temperature.

Touch sensor 180K, also called "touch panel". The touch sensor 180K may be disposed on the display screen 194 , and the touch sensor 180K and the display screen 194 form a touch screen, also called a “touch screen”. The touch sensor 180K is used to detect a touch operation acting on or near it, and the touch touch operation refers to an operation of a user's hand, elbow, stylus, etc. touching the display screen 194 . The touch sensor can pass the detected touch operation to the application processor to determine the type of touch event. Visual output related to touch operations may be provided through display screen 194 . In other embodiments, the touch sensor 180K may also be disposed on the surface of the electronic device 100 , which is different from the location where the display screen 194 is located.

The bone conduction sensor 180M can acquire vibration signals. In some embodiments, the bone conduction sensor 180M can acquire vibration signals of the vibrating bone mass of the human voice. The bone conduction sensor 180M can also contact the pulse of the human body and receive the blood pressure beating signal. In some embodiments, the bone conduction sensor 180M can also be disposed in the earphone, combined with the bone conduction earphone. The audio module 170 can analyze the voice signal based on the vibration signal of the vocal vibration bone block obtained by the bone conduction sensor 180M, so as to realize the voice function. The application processor can analyze the heart rate information based on the blood pressure beat signal obtained by the bone conduction sensor 180M, and realize the function of heart rate detection.

The keys 190 include a power-on key, a volume key, and the like. Keys 190 may be mechanical keys. It can also be a touch key. The electronic device 100 may receive key inputs and generate key signal inputs related to user settings and function control of the electronic device 100 .

Motor 191 can generate vibrating cues. The motor 191 can be used for vibrating alerts for incoming calls, and can also be used for touch vibration feedback. For example, touch operations acting on different applications (such as taking pictures, playing audio, etc.) can correspond to different vibration feedback effects. The motor 191 can also correspond to different vibration feedback effects for touch operations on different areas of the display screen 194 . Different application scenarios (for example: time reminder, receiving information, alarm clock, games, etc.) can also correspond to different vibration feedback effects. The touch vibration feedback effect can also support customization.

The indicator 192 can be an indicator light, which can be used to indicate the charging state, the change of the power, and can also be used to indicate a message, a missed call, a notification, and the like.

The SIM card interface 195 is used to connect a SIM card. The SIM card can be contacted and separated from the electronic device 100 by inserting into the SIM card interface 195 or pulling out from the SIM card interface 195 .

FIG. 3 shows a block diagram of the software structure of the electronic device 100 according to the embodiment of the present application.

The layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces. In some embodiments, the Android system is divided into four layers, which are, from top to bottom, an application layer, an application framework layer, an Android runtime (Android runtime) and a system library, and a kernel layer.

The application layer can include a series of application packages. The application package may include, for example, applications such as camera, gallery, calendar, calling, map, navigation, WLAN, Bluetooth, music, video, games, shopping, travel, instant messaging (such as short messages). In addition, the application package may also include: the main screen (ie the desktop), the negative screen, the control center, the notification center and other system applications.

As shown in FIG. 3 , the application layer in the embodiment of the present application includes a voice assistant and a voice processing module.

The voice processing module provides a voice processing capability, and any application program can invoke the voice processing module capability, such as a voice assistant application, the electronic device 100 receives a voice signal through the voice assistant application, and the voice assistant application invokes the voice The processing module processes the voice signal. The speech processing module includes the ability of speech recognition (automatic speech recognitioN, ASR), the ability of semantic understanding (natural language understanding, NLU), the ability of dialogue management (dialog management, DM), the ability of natural language generation (natural language generation, NLG) and speech synthesis (text to speech, TTS) capabilities. in,

The speech recognition module is used for recognizing the speech signal to obtain the textual representation information of the speech signal. Specifically, the speech recognition module can first represent the speech signal as text data, and then perform word segmentation processing on the text data to obtain text representation information of the speech signal, that is, convert the words in the speech signal into readable input by the electronic device 100, including, for example, Binary encodings, character sequences, etc. A typical speech recognition method can be, for example, a method based on vocal tract model and speech knowledge, a method of template matching (compare the similarity between the feature vector of the input speech signal and each template in the template library in turn, and compare the similarity with the highest similarity. The embodiment of the present application does not limit which speech recognition method is used to perform speech recognition processing.

The semantic understanding module is used to convert the textual representation information of the speech signal into semantic information that the electronic device 100 can understand. Semantic information includes entities, triples, intents, events, and so on. With this information, the electronic device 100 can understand the user's language and determine what the user wants to do.

The dialog management module is used to determine the next action to be performed by the electronic device 100 based on the semantic information, and the actions to be performed include one or more of the following: playing the voice reply content (eg: providing a result, asking for a specific restriction, clarifying or confirming a requirement) etc.); display the text content of the voice reply content; jump to the corresponding interface; and so on.

Specifically, the dialogue management module determines the intent expressed in the semantic information, and then fills the slot corresponding to the intent according to the semantic information. The intent is what the user wants to do, and the slot corresponding to the intent is the information the user needs to complete the intent. For example, if the intent is "call", the slot corresponding to "call" is who to call, that is, the object of the call; For another example, if the intention is "send text message", there are two slots corresponding to "send text message", which are the object of text message and the content of text message.

In essence, dialogue management is a decision-making process. The dialogue management module continuously determines the next action to be performed according to the current state during the voice interaction process, thereby assisting the user to complete the task of information acquisition or service acquisition. If this action requires voice interaction with the user, the natural language generation module will be triggered to generate language text that the user can understand; finally, the generated language text will be played by the speech synthesis module to the user.

The natural language generation module is used to convert data sets in non-linguistic formats into textual information in language formats that users can understand. The natural language generation module determines what information should be included in the language text being constructed, and organizes the text in a reasonable order, combining multiple pieces of information into a single sentence. Then choose some connecting words and phrases to form a well-structured complete sentence.

The speech synthesis module is used to convert the textual information produced by the natural language generation module into artificial speech by mechanical and electronic means.

The application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer. The application framework layer includes some predefined functions.

As shown in Figure 3, the application framework layer may include input manager, window manager, content provider, view system, telephony manager, resource manager, notification manager, display manager, activity manager (activity manager) etc. For ease of description, in FIG. 3 , the application framework layer is illustrated by taking an example including an input manager, a window manager, a content provider, a view system, and an activity manager. It should be noted that any two modules in the input manager, window manager, content provider, view system, and activity manager can call each other.

The input manager is used to receive instructions or requests reported by lower layers such as the kernel layer and the hardware abstraction layer.

A window manager is used to manage window programs. The window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc. In this application, the window manager is used to display a window including one or more shortcut controls when the electronic device 100 meets a preset trigger condition.

The activity manager is used to manage the activities running in the system, including process, application, service, task information and so on.

Content providers are used to store and retrieve data and make these data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone book, etc.

The view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications. A display interface can consist of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures. In this application, the view system is used to display a shortcut area on the display screen 103 when the electronic device 100 meets the preset trigger condition, and the shortcut area includes one or more shortcut controls added by the electronic device 100 . Wherein, the present application does not limit the position and layout of the shortcut area, as well as the icons, positions, layout and functions of the controls in the shortcut area.

The display manager is used to transfer display content to the kernel layer.

The phone manager is used to provide the communication function of the electronic device 100 . For example, the management of call status (including connecting, hanging up, etc.).

The resource manager provides various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on.

The notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can disappear automatically after a brief pause without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc. The notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications of applications running in the background, and notifications on the screen in the form of dialog windows. For example, text information is prompted in the status bar, a prompt sound is issued, the electronic device vibrates, and the indicator light flashes.

Android Runtime includes core libraries and a virtual machine. Android runtime is responsible for scheduling and management of the Android system.

The core library consists of two parts: one is the function functions that the java language needs to call, and the other is the core library of Android.

The application layer and the application framework layer run in virtual machines. The virtual machine executes the java files of the application layer and the application framework layer as binary files. The virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.

A system library can include multiple functional modules. For example: surface manager (surface manager), media library (media library), 3D graphics processing library (eg: OpenGL ES), 2D graphics engine (eg: SGL), etc.

The Surface Manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.

The media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files. The media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.

The 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing.

2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is the layer between hardware and software. The kernel layer at least includes a display driver, a camera driver, an audio driver, a sensor driver, a touch chip driver and an input system, and the like. For the convenience of description, in FIG. 3 , the inner core layer is illustrated by taking the input system, the driver of the touch chip, the display driver and the storage driver as an example. Wherein, the display driver and the storage driver may be jointly arranged in the driver module.

It can be understood that the structures illustrated in this application do not constitute a specific limitation on the electronic device 100 . In other embodiments, the electronic device 100 may include more or fewer components than shown, or some components may be combined, or some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The following describes the technical principles based on the voice interaction system 10 in combination with the voice interaction system 10 of the embodiment of the present application. As shown in FIG. 4 , a voice interaction process in the voice interaction system 10 is shown in FIG. 4 . The cloud server 200 communicates with the electronic device 100 through a network.

First, the electronic device 100 detects that a voice signal is connected, and the electronic device 100 activates the voice interaction function. In some embodiments, the electronic device may receive the first voice signal through a voice assistant application (APP). The electronic device 100 detects whether the received voice signal contains a target object (for example, the target object is a preset wake-up word), and if it contains the target object, it enters the interactive state and activates the voice interaction function. The target object can be preset when the electronic device 100 leaves the factory, can be preset in the voice assistant application, or can be set by the user in the process of using the electronic device 100. This application does not limit the length and content of the target object.

The electronic device 100 performs distribution control on the received voice signal based on a preset rule, and the distribution path includes path 1 and path 2. The preset rule includes that when the network quality is good, the electronic device 100 uploads the received voice signal to the cloud server 200 for processing (path 1). The good network quality means that the electronic device 100 and the cloud server 200 can perform data processing. Transmission (including uplink and downlink data transmission); when the network quality is poor or disconnected, the electronic device 100 processes the received voice signal in the electronic device 100 (path 2), where poor network quality or disconnection refers to The electronic device 100 and the cloud server 200 cannot perform data transmission (including uplink or downlink data transmission), or the data transmission rate is lower than the threshold. The preset rule can also be distributed according to the intent corresponding to the recognized voice signal. In short, when the intent corresponding to the voice signal can be completed locally, such as making a call, sending a text message, opening a gallery, etc., it can be The device 100 performs processing; when the intent corresponding to the voice signal needs to make the network, such as searching web pages, playing music online, etc., the processing can be performed on the cloud server 200 .

Next, the processing of the voice signal on the cloud server 200 and the processing of the voice signal on the electronic device 100 will be described respectively.

Path 1: The voice signal is processed on the cloud server 200 .

Step 1, the electronic device 100 uploads the voice signal to the cloud server 200, and the cloud server 200 receives the voice signal, recognizes the voice signal through the voice recognition technology ASR, and converts the voice signal into text representation information, that is, the voice signal in the voice signal. The vocabulary is converted into input readable by the cloud server 200, including, for example, binary codes, character sequences, and the like.

In some embodiments, the cloud server 200 may compare the similarity between the feature vector of the input speech signal and each template in the template library in turn, take the one with the highest similarity as the recognition result, output the text data, and then perform the processing on the text data. The word segmentation process is used to obtain the textual representation information of the speech signal. Optionally, the cloud server 200 may also use the trained vocal tract model, neural network model, etc. to calculate and obtain text representation information corresponding to the speech signal.

It should be noted that when the cloud server 200 recognizes the voice signal through the voice recognition technology, it may also include some preprocessing operations on the voice signal, such as sampling, quantization, and removing voice data that does not contain voice content (eg, silence voice data), framing and windowing the voice data, and so on.

In step ②, after speech recognition, the cloud server 200 converts the textual representation information into semantic information that can be understood by the machine through the semantic understanding technology NLU.

In some embodiments, the execution of the semantic understanding technology can be simply understood as the following steps. First, the cloud server 200 divides the text representation information obtained by speech recognition into a series of units with semantics and grammar, usually using "" The word token" is used to represent the unit obtained by text segmentation. A common text segmentation method is "word segmentation", that is, the text is segmented according to the granularity of "words". Models used for word segmentation may include first-order Markov models, hidden Markov models, conditional random fields, recurrent neural networks, and the like.

Then, based on the token sequence, a text representation model such as a word vector space model, a distributed representation model, etc. is used to obtain a numerical vector or matrix. This matrix is the numerical representation of the text. Next, based on the numerically represented data of the text, use classification algorithms, sequence labeling methods, etc., to calculate the "key information" (ie semantic information), such as entities, triples, intents, events, and so on. With this information, the cloud server 200 can understand the user's language and determine what the user wants to do.

Step ③, the cloud server 200 performs dialog management based on the semantic information. Dialog management refers to a process in which the cloud server 200 determines an action to be executed next based on semantic information. The actions performed include one or more of the following: playing the content of the voice reply (such as: providing results, asking for specific restrictions, clarifying or confirming requirements, etc.); displaying the text content of the voice reply content; jumping to the corresponding interface; and many more.

In some embodiments, the cloud server 200 determines the intent expressed in the semantic information, and then fills the slot corresponding to the intent according to the semantic information. The intent is what the user wants to do, the slot corresponding to the intent is the information the user needs to complete the intent, and an intent can correspond to one or more slots. The cloud server 200 fills the intended slots based on the semantic information. If the information in one or more of the slots is missing due to insufficient semantic information, it determines that the next action to be performed is to further inquire about the missing slots; If the information is not missing, the user's intention is converted into an explicit instruction of the user, and the electronic device 100 is instructed to perform a corresponding action.

For example, the cloud server 200 acquires the semantic information of the voice signal "Please help me open the gallery", and determines that the user's intention is to open an object according to the semantic information of the voice signal, and the slot corresponding to the intention is open. The cloud server 200 fills the slot according to the semantic information of the speech signal, and determines that the opened object is a gallery. Then the dialog management determines an explicit instruction based on the semantic information, that is, an instruction to open the gallery.

The above content provides an example in which the cloud server 200 determines the operation instruction based on the user's voice signal under the condition that the slot information is not missing. In the case where one or more slots are missing due to insufficient semantic information, the cloud server 200 needs to save the current intent and slot information, and further inquire about the missing slots. Typically, this situation is called a multi-turn conversation.

For example, the cloud server 200 obtains the semantic information of the voice signal "I want to make a call", and determines that the user's intention is to make a call according to the semantic information of the voice signal, and the slot corresponding to the intention is to make a call. If the object is missing the slot due to insufficient semantic information, the dialogue management will determine a clear instruction based on the semantic information, that is, play the voice reply content, and conduct further inquiries about the missing slot.

Optionally, the dialogue management may determine another explicit instruction based on the semantic information of the voice signal, that is, to display the text content of the voice reply content.

When the cloud server 200 receives a voice signal next time, the cloud server 200 repeats the above steps ① and ②, and then fills the missing slot with the semantic information corresponding to the voice signal based on the saved intent and slot information. If all the slots are filled completely, that is, the slot information is not missing, the user's intention is converted into an explicit instruction of the user, and the electronic device 100 is instructed to perform a corresponding action. For the above example, at this time, the intent saved by the cloud server 200 is to make a call, and the missing slot is the object of the call, then when the cloud server 200 receives the voice signal "Xiao Ming" next time, the corresponding voice signal "Xiao Ming" The semantic information of , fills the slot "the object of the call", so as to determine the instruction, that is, call Xiaoming.

Step 4., after the cloud server 200 determines the action to be performed in the next step, if the action needs to perform voice interaction with the user, such as outputting the content of the voice response, the cloud server 200 can generate a language text that can be understood by the user based on the natural language generation technology, and then Speech synthesis is performed on the generated language text to generate speech data.

In some embodiments, the cloud server 200 determines which information should be included in the language text being constructed, and organizes a reasonable text order to combine multiple pieces of information into a single sentence. Then choose some conjunctions, phrases, and combine this information into a well-structured complete sentence.

Step ⑤, the cloud server 200 sends an instruction to the electronic device 100 to instruct the electronic device 100 to perform an action.

In some embodiments, based on the above steps ①②③④, the cloud server 200 sends an instruction with voice data (voice reply content) to the electronic device 100, instructing the electronic device 100 to output the voice reply content.

Optionally, the cloud server 200 sends an instruction with text data (text content of the voice reply content) to the electronic device 100 to instruct the electronic device 100 to display the text data.

In some embodiments, step ④ is optional, and if the next action determined by the cloud server 200 does not need to output the content of the voice reply, step ④ does not need to be performed. Based on the above steps ①②③, the cloud server 200 sends an instruction to the electronic device 100 to instruct the electronic device 100 to perform interface jumping.

Path 2: Process the voice signal on the electronic device 100 .

The electronic device 100 receives the voice signal, recognizes the voice signal through a voice recognition technology, and converts the voice signal into text representation information. The textual representation information is then transformed into machine-understandable semantic information through semantic understanding techniques. Then, the electronic device 100 determines the next action to be performed based on the semantic information. If the action requires voice interaction with the user, such as outputting voice reply content, the electronic device 100 can generate language text that the user can understand based on the natural language generation technology, and then perform speech synthesis on the generated language text to generate voice data. The electronic device 100 outputs the voice data. If the next action determined by the electronic device 100 does not need to output the content of the voice response, the electronic device 100 does not need to use the speech synthesis technology, and the electronic device 100 performs interface jumping.

It should be noted that, based on the same inventive concept, the principles of speech recognition, semantic understanding, dialogue management, and speech synthesis for problem solving in path 2 of the embodiment shown in FIG. 4 are similar to those in path 1. For the implementation process of speech recognition, semantic understanding, dialogue management, and speech synthesis in the second step, reference may be made to the corresponding descriptions of the cloud server 200 in steps ①②③④⑤ in the above path one, which will not be repeated here.

To sum up, the embodiment shown in FIG. 4 describes the implementation principle of the voice interaction system in detail. In some cases, the electronic device 100 distributes the voice service based on the network condition. If the network condition is good, the voice signal is uploaded to the cloud server 200 for processing, that is, the above path 2; if the network is disconnected or the network quality is poor, the electronic device 100 processing, that is, the above path one. In this case, if the voice service performs path switching during processing due to network reasons, the original voice service cannot continue to be executed, affecting user experience.

For example, when the cloud server 200 receives a voice signal for the first time, if the semantic information of the voice signal is insufficient and one or more of the slots are missing, the cloud server 200 needs to save the current intent and slot information. slot for further inquiry. For the voice service that requires multiple rounds of dialogue, if the network is interrupted before the cloud server 200 receives the voice signal next time, the cloud server 200 cannot receive the next voice signal, and the electronic device 100 distributes the next voice signal to the Processing on the electronic device 100, the electronic device 100 cannot continue to perform the original voice service based on the semantic information of the next voice signal, and the original voice service is interrupted, which affects the user experience.

In combination with the voice interaction system 10 of the embodiment of the present application, the embodiment of the present application further provides a voice interaction processing method. The cloud server 200 sends an instruction to the electronic device 100 to instruct the electronic device 100 to perform a corresponding action, synchronously to the electronic device 100 Send the context (intent and slot information) of the voice dialogue. If the network is interrupted in the voice service of multiple rounds of dialogue, it will lead to the end-cloud switch (the switch between the electronic device 100 and the cloud server 200, that is, the switch between path 1 and path 2) , the electronic device 100 can also continue to execute the original voice service based on the context of the voice dialogue and the received next voice signal, thereby solving the problem of interruption of voice services for multiple rounds of dialogue.

The following specifically introduces the step flow of a voice interaction processing method provided by the present application. As shown in FIG. 5A , FIG. 5A shows a voice interaction process in the voice interaction system 10 .

At time T1, the network quality is good, the electronic device 100 receives the voice 1, and the electronic device 100 starts the voice interaction function. The electronic device 100 performs distribution control on the received voice 1 based on preset rules. For example, at this time, the network quality is good, then the electronic device 100 uploads the received voice 1 to the cloud server 200 for processing. The processing actions include voice recognition and semantic understanding. , dialogue management, speech synthesis and other processes. In this embodiment of the present application, the voice 1 may also be referred to as the first voice signal.

Among them, based on the same inventive concept, the principles of speech recognition, semantic understanding, dialogue management, and speech synthesis at time T1 in the embodiment shown in FIG. 5A to solve problems are similar to those of the first path in the embodiment shown in FIG. 4 . Therefore, the cloud server 200 is For the implementation process of speech recognition, semantic understanding, dialogue management, and speech synthesis at time T1, reference may be made to the corresponding descriptions of steps ①②③④ of the cloud server 200 in the above path 1 in FIG. 4 , and details are not repeated here.

The cloud server 200 determines the next action to be performed, and sends an instruction to the electronic device 100 to instruct the electronic device 100 to perform action 1; and the cloud server 200 synchronously sends the electronic device 100 a voice dialogue context, which refers to the cloud server 200 Intention and slot information obtained by recognizing and understanding voice 1. The executed action 1 includes one or more of the following: playing the voice reply content for voice 1 (such as: providing results, asking specific constraints, clarifying or confirming needs, etc.); displaying the text content of the voice reply content; jumping to the corresponding interface; and so on.

After receiving the instruction and the dialog context, the electronic device 100 forwards the instruction and the dialog context through the dialog information forwarding module, and the electronic device 100 executes action 1 based on the instruction and saves the dialog context. The dialogue information forwarding module may be regarded as a node that receives data sent by the cloud server 200, and is used for receiving and forwarding the data.

At time T2, the network quality is poor. After the electronic device 100 outputs the voice reply content for Voice 1, it receives Voice 2. Due to the poor network quality at this time, data transmission cannot be achieved between the electronic device 100 and the cloud server 200. Then the electronic device 100 Failed to upload voice 2 to cloud server 200. The electronic device 100 invokes its own speech processing capability to process the speech 2, and the processing actions include speech recognition, semantic understanding, dialogue management, speech synthesis and other processes. The implementation process of speech recognition, semantic understanding, dialogue management, and speech synthesis of the electronic device 100 at time T2 may refer to the corresponding description of the electronic device 100 in the above path 2 in FIG. 4 , which will not be repeated here. In this embodiment of the present application, the voice 2 may also be referred to as the second voice signal.

It should be noted that, different from the above path 2, in the dialog management part of the embodiment of the present application, the electronic device 100 determines the next action to be performed based on the voice 2 and the dialog context saved at time T1. The electronic device 100 fills in the missing slots based on the semantic information corresponding to the speech 2 and the intent and slot information. If the semantic information corresponding to Voice 2 is insufficient, the slot information is not fully filled, and one or more slots are missing, the electronic device 100 determines that the next action to be performed is to further query the missing slots; If there is no missing, the user's intention is converted into an explicit instruction of the user, and the electronic device 100 is instructed to perform a corresponding action.

The time T2 is the time period during which the electronic device 100 uploads the voice 2 to the cloud server 200 after the electronic device 100 receives the instruction and the dialogue context sent by the cloud server 200 for the voice 1. For example, the electronic device 100 receives the voice 2, or after the voice 2 is received and before uploading to the cloud server 200. That is, due to poor network quality at time T2, the voice 2 cannot be uploaded to the cloud server 200.

In some embodiments, the time T2 may also be after the electronic device 100 uploads the voice 2 to the cloud server 200 and before the cloud server 200 sends the instruction to the electronic device 100 . That is, due to poor network quality, the cloud server 200 cannot deliver the instruction generated for the voice 2 to the electronic device 100 . As shown in FIG. 5B , the electronic device 100 uploads the voice 2 to the cloud server 200, and the cloud server 200 processes the voice 2, and the processing actions include voice recognition, semantic understanding, dialogue management, and speech synthesis. At this time, a network quality problem occurs, data transmission cannot be realized between the electronic device 100 and the cloud server 200 , and the cloud server 200 cannot deliver the command generated for the voice 2 to the electronic device 100 .

Optionally, if the electronic device 100 does not receive the instruction issued by the cloud server 200 for the voice 2 within the preset time after the electronic device 100 uploads the voice 2 to the cloud server 200, the electronic device 100 invokes its own voice processing capability to Voice 2 (which may be backup Voice 2, for example) is processed.

Optionally, after the electronic device 100 uploads the voice 2 to the cloud server 200, before receiving the instruction issued by the cloud server 200 for the voice 2, it detects that the network connection with the cloud server 200 is currently disconnected, and the electronic device 100 calls the The voice 2 (for example, the backup voice 2) is processed by its own voice processing capability.

For the processing process, reference may be made to the corresponding description of the voice 2 by the electronic device 100 at time T2 in FIG. 5A , which will not be repeated here.

In this way, in the process of voice interaction, every time the cloud server 200 issues an instruction, it synchronously sends the dialogue context to the electronic device 100, and the electronic device 100 receives and saves the dialogue context. The electronic device 100 can also continue to process the voice service processed on the cloud server 200 based on the saved dialogue context, so that the voice service is not interrupted, the processing efficiency of the voice service is improved, and the user experience is improved.

In some embodiments, each time the cloud server 200 issues an instruction, only if at least one slot information in one or more slot information is missing, will the conversation context, that is, the intent and the slot information.

Specifically, referring to FIG. 5A , at time T1, in the process of processing voice 1, the cloud server 200 determines the intent expressed by the semantic information corresponding to voice 1 and the slot corresponding to the intent based on the semantic information corresponding to voice 1 Information, an intent can correspond to one or more slots. The cloud server 200 fills the intended slot information based on the semantic information. If all the slots are completely filled, that is, the slot information is not missing, the user's intention is converted into an explicit instruction of the user, and the cloud server 200 sends an instruction to the electronic device 100 to indicate The electronic device 100 performs corresponding actions. For example, the cloud server 200 obtains the voice signal "Please help me open the gallery", and determines that the user's intention is to open an object according to the semantic information of the voice signal, and the slot corresponding to the intention is the open object, The cloud server 200 fills the slot according to the semantic information of the speech signal, and determines that the opened object is a gallery. Then the dialog management determines an explicit instruction based on the semantic information, that is, an instruction to open the gallery. It can be seen that since the user's intention is completed at this time and the cloud server 200 determines that the intention has ended, the cloud server 200 does not need to send the dialog context (intent and slot information) to the electronic device 100 at this time, saving resources.

In the case where one or more slots are missing due to insufficient semantic information, the cloud server 200 needs to save the current intent and slot information, conduct further inquiries about the missing slots, and wait until the next voice signal is received, Combined with the stored intention slot information, fill the slot with the next voice signal to determine the next execution action. In the embodiment of the present application, the cloud server 200 generates the voice reply content for the voice 1 based on the speech synthesis technology, sends an instruction to the electronic device 100 to instruct the electronic device 100 to output the voice reply content, and synchronously sends the dialogue context ( Intention and slot information corresponding to voice 1), the electronic device 100 receives and saves the dialogue context. In this way, even if the network is interrupted when the electronic device 100 receives the next voice signal, the electronic device 100 can use its own voice interaction capability in combination with the saved dialogue context to perform the next voice signal received. processing, improve the processing efficiency of voice services, and improve user experience.

In some embodiments, when there are two or more missing slots, the cloud server 200 sends not only the intention and slot information to the electronic device 100 synchronously, but the cloud server 200 can also mark the slots, indicating The order in which the slots of the electronic device 100 are filled. In this way, the electronic device can accurately fill one of the slots when processing the next voice signal.

Next, taking an application scenario of making a phone call as an example, the voice interaction processing method implemented in the scenario of making a phone call in the embodiment of the present application will be described in detail.

As shown in FIG. 6 , when the network quality is good at time T1, when the user wants to make a phone call by voice, he can start the voice assistant application (APP) and input the voice signal "I want to make a call". The electronic device 100 receives the voice signal "I want to make a call" input by the user through the voice assistant application, and controls the distribution of the received voice signal based on preset rules. For example, when the network quality is good, the electronic device 100 will receive the voice signal. "I want to make a call" is uploaded to the cloud server 200 for processing.

After receiving the voice signal of "I want to make a call", the cloud server 200 converts the voice signal into text information according to the speech recognition technology (ASR), obtains the semantic information according to the semantic understanding technology (NUL), and recognizes that the user's intention is to make a phone call . Next, the cloud server 200 determines that the slot information corresponding to the intention to make a call includes the object of the call, the cloud server 200 fills the slot according to the semantic information, and the cloud server 200 recognizes that the semantic information of "I want to make a call" does not include The calling object, that is, the cloud server 200 determines that the information of the slot (calling object) corresponding to the intention (calling) is vacant.

The cloud server 200 determines that the next action to be performed is to inquire about the vacant slot information. The cloud server 200 generates a voice reply content "Who do you want to call" according to the speech synthesis technology (TTS), and sends a message to the electronic device 100 with the The instruction of the voice reply content instructs the electronic device 100 to play the voice reply content. And, the cloud server 200 synchronously sends the dialog context to the electronic device 100, where the dialog context includes the intention "to make a call" and the slot information "the object of the call (vacancy)". The electronic device 100 receives the instruction and the dialogue context sent by the cloud server, plays the voice reply content "Who do you want to call" based on the instruction, and saves the dialogue context.

Optionally, the cloud server 200 may also send an instruction of text data with the content of the voice reply to the electronic device 100, instructing the electronic device 100 to display the text data (the text content of "who do you want to call").

After the electronic device 100 plays the voice reply content "Who do you want to call", the user inputs the voice signal "Call Xiaoming" again. At T2, the network quality is poor, and the electronic device 100 invokes its own voice processing capability to process the voice signal "Call Xiaoming". To Xiao Ming". The electronic device 100 converts the speech signal into text information according to the speech recognition technology (ASR), and obtains the semantic information according to the semantic understanding technology (NUL). Next, the electronic device 100 fills the slot based on the saved intent "call" and slot information "object to call (vacancy)", and according to the semantic information corresponding to "call Xiaoming". The electronic device 100 recognizes that "Xiao Ming" in the semantic information of "I want to make a call" is the object of the call, that is, the cloud server 200 determines that the information of the slot (the object of the call) corresponding to the intention (call) is " Xiao Ming".

The electronic device 100 determines that the next action to be performed is to make a call to Xiaoming, and outputs a voice reply content "calling Xiaoming". The electronic device 100 generates a voice reply content "calling Xiaoming" according to the speech synthesis technology (TTS), and the electronic device 100 plays the voice reply content. In addition, the electronic device 100 queries the contact Xiaoming in the address book, and invokes the call capability to call Xiaoming. Optionally, the electronic device 100 may also display the text data of the voice reply content (the text content of "calling Xiaoming").

The above describes the voice interaction processing method in a phone call scenario. The following takes a smartphone as an example of the electronic device 100, and exemplarily shows some voice interaction processes in combination with specific scenarios. In the process of voice interaction, if the network quality of the electronic device 100 changes from good to poor, the processing of the voice signal is switched from the cloud server 200 to the electronic device 100, because the cloud server 200 sends the dialogue context of the voice and saves it. On the electronic device 100, in this way, even if the network is interrupted during multiple rounds of conversations, the electronic device 100 can realize uninterrupted voice services. As shown in FIG. 7A and FIG. 7B , the wake-up word is set to "Xiaoyi Xiaoyi".

User: Xiaoyi Xiaoyi, I want to call.

Smartphone (electronic device 100): Who do you want to call.

User: Xiaoming.

Smartphone (Electronic Device 100): Okay, I'm calling Xiao Ming for you.

The following describes an implementation form of the voice interaction processing method provided by the embodiment of the present application on a display interface of a smart phone with reference to FIGS. 8A to 8D , taking the above-mentioned voice dialogue as an example.

As shown in FIG. 8A, FIG. 8A shows a voice interaction interface 801, which may be, for example, an interface of a voice assistant application. The voice interaction interface 801 includes a status bar 8011 and a function bar 8012 .

The status bar 8011 may include: one or more signal strength indicators 8013 of wireless network signals, a battery status indicator 8014, and a time indicator 8015. The signal strength indicator 8013 indicates the current network quality (it may also indicate the data transmission rate between the electronic device 100 and the cloud server 200). In FIG. 8A, the signal strength indicator 8013 is full (4 bars), indicating the current network quality good.

Function bar 8012 may include one or more function controls, such as voice input control 8016. When the electronic device 100 detects a user operation on the voice input control 8012, the electronic device 100 receives a voice signal. As shown in FIG. 8A , the electronic device 100 receives the voice signal “Xiaoyi Xiaoyi, I want to make a call”, and displays it on the voice interaction interface 801 .

As shown in FIG. 8B , the electronic device 100 receives the voice signal "Xiaoyi Xiaoyi, I want to make a call", and can upload the voice signal to the cloud server 200 for processing, and play the voice reply content "You Who to call" and displayed on the voice interface 802. The voice input control 8016 is transformed into a voice output control 8026, indicating that the electronic device 100 is currently outputting voice. In the embodiment of the present application, when the cloud server 200 returns the instruction, it synchronously returns the voice dialogue context to the electronic device 100, and the electronic device 100 receives and saves the dialogue context.

The user continues to input voice, as shown in FIG. 8C , the current network quality of the electronic device 100 is not good, and the signal of the signal strength indicator 8033 only has two bars left, then the electronic device 100 and the cloud server 200 cannot perform data transmission, or the data transmission rate too low. When the electronic device 100 receives the voice signal "Xiao Ming", the electronic device 100 cannot upload the voice signal to the cloud server 200 for processing, or the cloud server 200 cannot issue an instruction to the electronic device 100 . At this time, the electronic device 100 can continue to process the voice signal "Xiao Ming" based on the saved dialogue context, play the voice reply content "Okay, I'm calling Xiao Ming for you", and display it on the voice interaction interface 803 . And, when the action of making a call is performed, the electronic device 100 jumps to the call interface, as shown in FIG. 8D , which shows a call interface 804 indicating that the electronic device 100 is currently calling Xiao Ming.

The above is an application scenario in which the voice service is a multi-round dialogue (the above is specifically two rounds of dialogue). During the dialogue, the network quality of the electronic device 100 changes from good to poor, and the processing of the voice signal is converted from the cloud server 200 to In the electronic device 100, because the cloud server 200 delivers the speech dialogue context and saves it on the electronic device 100, in this way, even if the network is interrupted during multiple rounds of dialogue, the electronic device 100 can still implement the voice service. without interruption, which improves the processing efficiency of voice services.

Next, the embodiment of the present application further provides an application scenario of three-round dialogue. Taking the application scenario of sending short messages as an example, the voice interaction processing method implemented in the scenario of sending short messages in the embodiments of the present application is briefly described.

When the network quality is good, the electronic device 100 receives the voice signal "I want to send a text message" input by the user, and performs distribution control on the received voice signal based on preset rules. The "I want to send a text message" is uploaded to the cloud server 200 for processing.

The cloud server 200 recognizes that the user's intention is to send a text message. Next, the cloud server 200 determines that the slot information corresponding to the intention to send a text message includes the object to send the text message and the content of the text message, the cloud server 200 fills the slot according to the semantic information, and the cloud server 200 recognizes the semantics of "I want to send a text message" The information does not include the object of the call and the content of the text message, that is, the cloud server 200 determines the information vacancy of the slot (the object of the text message, the content of the text message) corresponding to the intent (send text message).

The cloud server 200 determines that the next action to be performed is to inquire about the vacant slot information. Since there are two vacant slot information, the cloud server 200 can inquire about one of the vacant slot information according to the priority. For example, first Ask the person you are texting. The cloud server 200 generates the voice reply content "Who do you want to text" according to the speech synthesis technology (TTS), sends an instruction with the voice reply content to the electronic device 100, and instructs the electronic device 100 to play the voice reply content. And, the cloud server 200 synchronously sends the dialog context to the electronic device 100, where the dialog context includes the intention "send text message" and the slot information "object to send text message (vacancy) and content of text message (vacancy)". The electronic device 100 receives the instruction and the dialogue context sent by the cloud server, plays the voice reply content "Who do you want to text" based on the instruction, and saves the dialogue context.

Next, after the electronic device 100 plays the voice reply content "who do you want to text", the user inputs the voice signal "to Xiao Ming" again. If the network quality is good at this time, the electronic device 100 will receive the received "I want to send a text message" ” is uploaded to the cloud server 200 for processing. The cloud server 200 fills the slot based on the stored intent "send text message" and slot information "object to send text message (vacancy), content to send text message (vacancy)", and semantic information corresponding to "to Xiaoming". The cloud server 200 recognizes that "Xiao Ming" in the semantic information of "To Xiao Ming" is the object of the text message, that is, the cloud server 200 determines that the information of the slot (the object of the text message) corresponding to the intention (call) is "Xiao Ming" .

Since the slot information "content of text messages" is still vacant at this time, the cloud server 200 saves the current intention and slot information, and the cloud server 200 determines that the next action to be performed is for the vacant slot information (content of text messages). ) to inquire again, the cloud server 200 generates a voice reply content "what do you want to send" according to the speech synthesis technology (TTS), sends an instruction with the voice reply content to the electronic device 100, and instructs the electronic device 100 to play the voice reply content. In addition, the cloud server 200 synchronously sends the dialog context to the electronic device 100. At this time, the dialog context includes the intent (sending a text message), and the slot information "the object of the text message (Xiao Ming), and the content of the text message (vacancy)". The electronic device 100 receives the instruction and the dialogue context sent by the cloud server, plays the voice reply content "what do you want to send" based on the instruction, and saves the dialogue context.

In some embodiments, after the electronic device 100 plays the voice reply content "who do you want to text", the user inputs the voice signal "to Xiao Ming" again. If the network quality is not good at this time, the electronic device 100 invokes its own voice processing capability The voice signal "to Xiaoming" is processed. The electronic device 100 converts the speech signal into text information according to the speech recognition technology (ASR), and obtains the semantic information according to the semantic understanding technology (NUL). Next, the electronic device 100 fills the slot based on the stored intent "send text message" and slot information "object to send text message (vacancy), content to send text message (vacancy)", and semantic information corresponding to "to Xiaoming". The electronic device 100 recognizes that "Xiao Ming" in the semantic information of "To Xiao Ming" is the object to send the text message, that is, the electronic device 100 determines that the information of the slot (the object of the text message) corresponding to the intention (call) is "Xiao Ming" .

That is, for the above example, when the cloud server 200 synchronously sends the intention and slot information to the electronic device 100, since there are two vacancies in the slot, the cloud server 200 can mark the slot to determine which slot is to be filled next time. . Then, when the electronic device 100 fills the slot, it can directly fill the slot without judging which slot the semantic information corresponds to. That is, the electronic device 100 can directly determine that the information of the slot (the object of the text message) corresponding to the intention (call) is "Xiao Ming".

Since the slot information "content of the text message" is still vacant at this time, the electronic device 100 saves the current intention and slot information, and the electronic device 100 determines that the next action to be performed is for the vacant slot information (content of the text message). ) to ask again, the electronic device 100 generates a voice reply content “what do you want to send” according to the speech synthesis technology (TTS), and the electronic device 100 plays the voice reply content.

The next time the voice signal received by the electronic device 100 is processed and the slot is filled again, until the slot information is completely filled and an instruction to execute the intent is generated, the electronic device 100 determines that the intent has been executed.

The present application provides a voice interaction processing method, as shown in FIG. 9 , the method includes:

The electronic device 100 establishes a connection with the cloud server 200 . Step S101: The electronic device 100 receives the first voice signal.

The first voice signal may be, for example, the voice 1 in the above-mentioned FIG. 5A or FIG. 5B , or the voice “I want to make a call” in FIG. 6 .

Step S102 : the electronic device 100 uploads the first voice signal to the cloud server 200 .

Step S103: The cloud server 200 identifies the first voice signal, obtains the corresponding intent and one or more slot information corresponding to the intent, and determines the content of the first voice reply based on the intent and the one or more slot information.

Step S104 : the cloud server 200 sends the first voice reply content, intent and one or more slot information to the electronic device 100 .

Step S105: The electronic device 100 outputs the first voice reply content, and saves the intent and one or more slot information.

The first voice reply content can be, for example, the voice reply content included in Action 1 in FIG. 5A , or the voice “who do you want to call” in FIG. 6 .

The communication quality between the electronic device 100 and the cloud server 200 is poor.

Step S106: The electronic device 100 receives the second voice signal.

The first voice signal may be, for example, the voice 2 in the above-mentioned FIG. 5A or FIG. 5B , or may be the voice “call Xiaoming” in FIG. 6 .

Step S107: The electronic device 100 recognizes the second voice signal, obtains corresponding semantic information, and determines the first operation based on the intent and one or more slot information and semantic information.

Step S108: Execute the first operation.

The first operation may be, for example, Action 2 in FIG. 5A or FIG. 5B , or may be playing the voice content and/or displaying the text content “Calling Xiaoming” in FIG. 6 , and executing: Calling Xiaoming”, these three actions one or more of.

In some embodiments, poor communication quality between the electronic device 100 and the cloud server 200 may occur at any time period between steps S106 and S107.

In a possible implementation manner, the electronic device 10 recognizes the second voice signal, obtains corresponding semantic information, and determines the first operation based on the intent and one or more slot information and semantic information, including: electronic The device 100 identifies that the semantic information matches one of the missing slots in the one or more slot information, and fills the semantic information with the value of the slot; the electronic device, based on the intent and the filled one or more slot information, Determine the first operation. Here, the process of processing the original voice service by the electronic device based on the second voice signal is described in detail. Since the electronic device obtains the intent and slot information corresponding to the first voice signal, it can continue to receive the The slot-filling process is performed on the second voice signal of the device, so as to realize the ability to continue processing the original voice service.

In a possible implementation manner, the first operation includes one or more of the following: playing the second voice reply content; displaying the text content of the second voice reply content; jumping to a corresponding interface. The second voice reply content may be, for example, the voice reply content included in Action 2 in FIG. 5A or FIG. 5B , or may be the voice “calling Xiaoming” in FIG. 6 .

Embodiments of the present application also provide a computer-readable storage medium. The methods described in the above method embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media can include both computer storage media and communication media and also include any medium that can transfer a computer program from one place to another. A storage medium can be any available medium that can be accessed by a computer.

The embodiments of the present application also provide a computer program product. The methods described in the above method embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. If implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the above-mentioned computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the above-mentioned method embodiments are generated. The aforementioned computers may be general purpose computers, special purpose computers, computer networks, network equipment, user equipment, or other programmable devices.

In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, optical fiber, digital subscriber line) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that includes an integration of one or more available media. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state drives), and the like.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented. The process can be completed by instructing the relevant hardware by a computer program, and the program can be stored in a computer-readable storage medium. When the program is executed , which may include the processes of the foregoing method embodiments. The aforementioned storage medium includes: ROM or random storage memory RAM, magnetic disk or optical disk and other mediums that can store program codes.

Claims

A voice interaction processing method, characterized in that the method comprises:

The electronic device receives the first voice signal;

When the electronic device establishes a connection with the cloud server, the electronic device uploads the first voice signal to the cloud server;

The electronic device receives the content of the first voice reply sent by the cloud server, the intent, and one or more slot information corresponding to the intent, where the intent and the one or more slot information are the cloud server Obtained by recognizing the first voice signal, the content of the first voice reply is determined by the cloud server based on the intention and the one or more slot information;

After the electronic device outputs the first voice reply content, it receives a second voice signal;

In the case that the communication quality between the electronic device and the cloud server is not good, the electronic device recognizes the second voice signal to obtain corresponding semantic information, and based on the intent and the one or more Slot information and the semantic information to determine the first operation;

The electronic device performs the first operation.
The method according to claim 1, wherein the electronic device determines the first operation based on the intent, the one or more slot information and the semantic information, comprising:

The electronic device identifies that the semantic information matches a missing slot in the one or more slot information, and fills the semantic information with the value of the slot;

The electronic device determines a first operation based on the intent and the filled one or more slot information.
The method according to claim 1 or 2, wherein the first operation comprises one or more of the following:

Play the second voice reply content;

displaying the text content of the second voice reply content;

Jump to the corresponding interface.
The method according to any one of claims 1-3, wherein the method further comprises:

The electronic device receives the first instruction sent by the cloud server;

The electronic device displays the text content of the first voice reply content based on the first instruction, and/or jumps to a corresponding interface.
The method according to any one of claims 1-4, wherein the communication quality between the electronic device and the cloud server is poor, comprising:

The electronic device fails to upload the second voice signal to the cloud server; or

After the electronic device uploads the first voice signal to the cloud server, it does not receive reply data from the cloud server within a preset time.
The method according to any one of claims 1-5, wherein the electronic device receives the first voice signal, comprising:

The electronic device receives the first voice signal through a voice assistant application.
A voice interaction processing method, characterized in that the method comprises:

The cloud server receives the first voice signal uploaded by the electronic device;

The cloud server identifies the first voice signal, obtains the corresponding intent and one or more slot information corresponding to the intent, and determines the first voice signal based on the intent and the one or more slot information. a voice reply content;

The cloud server sends the first voice reply content, the intent and the one or more slot information to the electronic device.
The method according to claim 7, wherein the cloud server sends the first voice reply content, the intention and the one or more slot information to the electronic device, comprising:

The cloud server sends, to the electronic device, the first voice reply content and the intent and the one or more slot information.
An electronic device, characterized by comprising: one or more processors and one or more memories; the one or more memories are respectively coupled with the one or more processors; the one or more memories for storing computer program code, the computer program code comprising computer instructions; when the computer instructions run on the processor, cause the electronic device to execute:

receiving a first voice signal;

In the case of establishing a connection with the cloud server, uploading the first voice signal to the cloud server;

Receive the content of the first voice reply sent by the cloud server and the intent and one or more slot information corresponding to the intent, where the intent and the one or more slot information are the information about the first voice sent by the cloud server. Obtained by recognizing a voice signal, the first voice reply content is determined by the cloud server based on the intention and the one or more slot information;

After outputting the first voice reply content, receive a second voice signal;

In the case of poor communication quality with the cloud server, the second voice signal is identified to obtain corresponding semantic information, and based on the intent and the one or more slot information and the semantic information, determine the first operation;

The first operation is performed.
The electronic device according to claim 9, wherein the determining the first operation based on the intent and the one or more slot information and the semantic information comprises:

Identifying that the semantic information matches one of the missing slots in the one or more slot information, and filling the semantic information with the value of the slot;

A first operation is determined based on the intent and the filled one or more slot information.
The electronic device according to claim 9 or 10, wherein the first operation includes one or more of the following:

Play the second voice reply content;

displaying the text content of the second voice reply content;

Jump to the corresponding interface.
The electronic device according to any one of claims 9-11, wherein the electronic device further executes:

receiving the first instruction sent by the cloud server;

Display the text content of the first voice reply content based on the first instruction, and/or jump to a corresponding interface.
The electronic device according to any one of claims 9-12, wherein the quality of the communication with the cloud server is poor, comprising:

Uploading the second voice signal to the cloud server fails; or

After the first voice signal is uploaded to the cloud server, no reply data from the cloud server is received within a preset time.
The electronic device according to any one of claims 9-13, wherein the electronic device receives the first voice signal, comprising:

The first voice signal is received through a voice assistant application.
A cloud server, characterized by comprising: one or more processors and one or more memories; the one or more memories are respectively coupled with the one or more processors; the one or more memories For storing computer program code, the computer program code includes computer instructions; when the computer instructions are executed on the processor, the cloud server is made to execute:

receiving the first voice signal uploaded by the electronic device;

Identify the first voice signal, obtain the corresponding intention and one or more slot information corresponding to the intention, and determine the content of the first voice reply based on the intention and the one or more slot information ;

Sending the first voice reply content, the intent, and the one or more slot information to the electronic device.
The cloud server according to claim 15, wherein the sending the first voice reply content, the intention and the one or more slot information to the electronic device comprises:

In the case that at least one slot information in the one or more slot information is missing, send the first voice reply content and the intention and the one or more slot information to the electronic device.
A computer-readable medium storing one or more programs, wherein the one or more programs are configured to be executed by the one or more processors, the one or more programs comprising instructions, the The instructions are for performing the method of claims 1-8.