WO2022143258A1 - Voice interaction processing method and related apparatus - Google Patents

Voice interaction processing method and related apparatus Download PDF

Info

Publication number
WO2022143258A1
WO2022143258A1 PCT/CN2021/139631 CN2021139631W WO2022143258A1 WO 2022143258 A1 WO2022143258 A1 WO 2022143258A1 CN 2021139631 W CN2021139631 W CN 2021139631W WO 2022143258 A1 WO2022143258 A1 WO 2022143258A1
Authority
WO
WIPO (PCT)
Prior art keywords
electronic device
voice
cloud server
voice signal
intent
Prior art date
Application number
PCT/CN2021/139631
Other languages
French (fr)
Chinese (zh)
Inventor
黄龙
王翃宇
李勇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022143258A1 publication Critical patent/WO2022143258A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • the present application relates to the field of artificial intelligence, and in particular, to a voice interaction processing method and related devices.
  • Voice interaction means that the user obtains a voice/text response by inputting voice/text. For example, the user voice input "How is the weather today", and the smart device voice returns "The weather is fine, 25 degrees to 29 degrees”.
  • the current voice interaction system needs network support. In some cases (such as network interruption), the original voice interaction service cannot continue to be executed, which affects the user experience.
  • Embodiments of the present application provide a voice interaction processing method and a related device, so as to solve the problem of voice service interruption in multiple rounds of conversations, and improve the voice service processing capability.
  • the present application provides a voice interaction processing method, which is applied to an electronic device, including: the electronic device receives an input first voice signal; when the electronic device establishes a connection with a cloud server, the electronic device sends the first voice signal to the electronic device.
  • a voice signal is uploaded to the cloud server; the electronic device receives the content of the first voice reply sent by the cloud server, the intent, and one or more slot information corresponding to the intent, and the intent and the one or more slot information are the first Recognized by the voice signal, the first voice reply content is determined by the cloud server based on the intent and one or more slot information; after the electronic device outputs the first voice reply content, it receives the second voice signal; When the communication quality of the server is not good, the electronic device recognizes the second voice signal, obtains corresponding semantic information, and determines the first operation based on the intent and one or more slot information and semantic information; the electronic device executes first operation.
  • the cloud server in the process of processing voice services, it can be processed by a cloud server or an electronic device.
  • the cloud server processes voice data
  • the cloud server sends corresponding instructions to the electronic device, instructing the electronic device to perform corresponding actions.
  • the context intent and slot information
  • the voice service is switched from the cloud server to the electronic device for processing, and the electronic device can also be based on the context of the voice dialogue and the received next voice signal. , continue to execute the original voice service, so as to solve the problem of interruption of voice service for multiple rounds of dialogues, and improve the ability of voice service processing.
  • the electronic device determines the first operation based on the intent and one or more slot information and semantic information, including: the electronic device identifies one of the semantic information and the one or more slot information The missing slot matches, and the semantic information is filled with the value of the slot; the electronic device determines the first operation based on the intent and the filled information of one or more slots.
  • the process of processing the original voice service by the electronic device based on the second voice signal is described in detail. Since the electronic device has obtained the intent and slot information corresponding to the first voice signal, it can continue to receive the corresponding information based on the intent and the slot information. The slot-filling process is performed on the second voice signal of the device, so as to realize the ability to continue processing the original voice service.
  • the first operation includes one or more of the following: playing the second voice reply content; displaying the text content of the second voice reply content; jumping to a corresponding interface.
  • the method further includes: the electronic device receives the first instruction sent by the cloud server; the electronic device displays the text content of the first voice reply content based on the first instruction, and/or jumps to a corresponding interface.
  • the poor communication quality between the electronic device and the cloud server includes: the electronic device fails to upload the second voice signal to the cloud server; or after the electronic device uploads the first voice signal to the cloud server, the The reply data from the cloud server has not been received within the set time.
  • the timing of poor communication quality is described here. It can be when the electronic device uploads the second voice signal and the communication quality between the electronic device and the cloud server is poor, or when the cloud server delivers the voice reply content for the second voice signal.
  • the communication quality between the electronic device and the cloud server is poor.
  • the electronic device receiving the first voice signal includes: the electronic device receives the first voice signal through a voice assistant application.
  • the present application provides a voice interaction processing method, which is applied to a cloud server, including: the cloud server receives a first voice signal uploaded by an electronic device; the cloud server recognizes the first voice signal to obtain a corresponding intent One or more slot information corresponding to the intention, and determine the first voice reply content based on the intention and the one or more slot information; the cloud server sends the first voice reply content, the intention and one or more slots to the electronic device bit information.
  • the cloud server sends an instruction to the electronic device to instruct the electronic device to simultaneously send the context (intent and slot information) of the voice dialogue to the electronic device when performing a corresponding action.
  • the voice service is switched from the cloud server to the electronic device for processing, and the electronic device can also be based on the context of the voice dialogue and the received next voice signal. , continue to execute the original voice service, so as to solve the problem of interruption of voice service for multiple rounds of dialogues, and improve the ability of voice service processing.
  • the cloud server sends the first voice reply content and intent and one or more slot information to the electronic device, including: the cloud server has at least one slot information in the one or more slot information.
  • the content and intent of the first voice reply and one or more slot information are sent to the electronic device.
  • the cloud server sends intent and slot information to the electronic device, that is, when the slot information is missing, it is determined that the current voice service is a multi-round conversation service, and then the cloud server will send the electronic device to the electronic device.
  • the device sends the intent and slot information; if the slot information is not missing, the voice service can be processed in a single round, and there is no need to obtain the next voice signal.
  • the delivery intention and the slot information are further determined, which can save resources.
  • the present application provides a voice interaction processing system, the voice interaction processing system includes an electronic device and a cloud server, wherein,
  • the electronic device is further configured to upload the first voice signal to the cloud server when the electronic device establishes a connection with the cloud server;
  • the cloud server is used to identify the first voice signal, obtain the corresponding intention and one or more slot information corresponding to the intention, and determine the content of the first voice reply based on the intention and the one or more slot information;
  • the cloud server is further configured to send the first voice reply content, intent and one or more slot information to the electronic device;
  • the electronic device is further configured to receive the second voice signal after outputting the content of the first voice reply;
  • the electronic device is also used for identifying the second voice signal in the case of poor communication quality between the electronic device and the cloud server, to obtain corresponding semantic information, and based on the intent and one or more slot information and semantic information, determine the first operation;
  • the electronic device is further configured to perform the first operation.
  • the cloud server in the process of processing voice services, it can be processed by a cloud server or an electronic device.
  • the cloud server processes voice data
  • the cloud server sends corresponding instructions to the electronic device, instructing the electronic device to perform corresponding actions.
  • the context intent and slot information
  • the voice service is switched from the cloud server to the electronic device for processing, and the electronic device can also be based on the context of the voice dialogue and the received next voice signal. , continue to execute the original voice service, so as to solve the problem of interruption of voice service for multiple rounds of dialogues, and improve the ability of voice service processing.
  • the electronic device is further configured to identify that the semantic information matches one of the missing slots in the one or more slot information, and fill the semantic information as the value of the slot; the electronic device, Also used to determine the first operation based on the intent and the filled one or more slot information.
  • the process of processing the original voice service by the electronic device based on the second voice signal is described in detail. Since the electronic device has obtained the intent and slot information corresponding to the first voice signal, it can continue to receive the corresponding information based on the intent and the slot information. The slot-filling process is performed on the second voice signal of the device, so as to realize the ability to continue processing the original voice service.
  • the first operation includes one or more of the following: playing the second voice reply content; displaying the text content of the second voice reply content; jumping to a corresponding interface.
  • the electronic device is further configured to receive the first instruction sent by the cloud server; the electronic device is further configured to display the text content of the first voice reply content based on the first instruction, and/or jump to corresponding interface.
  • the poor communication quality between the electronic device and the cloud server includes: the electronic device fails to upload the second voice signal to the cloud server; or after the electronic device uploads the first voice signal to the cloud server, the The reply data from the cloud server has not been received within the set time.
  • the timing of poor communication quality is described here. It can be when the electronic device uploads the second voice signal and the communication quality between the electronic device and the cloud server is poor, or when the cloud server delivers the voice reply content for the second voice signal.
  • the communication quality between the electronic device and the cloud server is poor.
  • the electronic device is further configured to receive the first voice signal through a voice assistant application.
  • the cloud server is further configured to send, to the electronic device, the first voice reply content, intent and one or more slot information.
  • the cloud server sends intent and slot information to the electronic device, that is, when the slot information is missing, it is determined that the current voice service is a multi-round conversation service, and then the cloud server will send the electronic device to the electronic device.
  • the device sends the intent and slot information; if the slot information is not missing, the voice service can be processed in a single round, and there is no need to obtain the next voice signal.
  • the delivery intention and the slot information are further determined, which can save resources.
  • the present application provides an electronic device, comprising: one or more processors and one or more memories; the one or more memories are coupled with the one or more processors; the one or more memories are used for storing computer program code, the computer program code including computer instructions; when the computer instructions are executed on the processor, the electronic device enables the electronic device to execute the voice interaction processing method in any possible implementation manner of the first aspect.
  • the present application provides a cloud server, comprising: one or more processors and one or more memories; the one or more memories are coupled with the one or more processors; the one or more memories are used for
  • the computer program code is stored in the computer program code, and the computer program code includes computer instructions; when the computer instructions are executed on the processor, the electronic device enables the electronic device to execute the voice interaction processing method in any possible implementation manner of the second aspect.
  • an embodiment of the present application provides a computer storage medium, including computer instructions, which, when the computer instructions are run on an electronic device, cause the communication apparatus to perform the voice interaction processing in any possible implementation manner of any of the above aspects method.
  • an embodiment of the present application provides a computer program product that, when the computer program product runs on a computer, enables the computer to execute the voice interaction processing method in any possible implementation manner of any one of the foregoing aspects.
  • FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a software structure of an electronic device provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of the principle of a voice interaction processing method provided by an embodiment of the present application.
  • FIGS. 5A-5B are schematic schematic diagrams of still another voice interaction processing method provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of the principle of a phone call scenario provided by an embodiment of the present application.
  • FIGS. 7A-7B are schematic diagrams of scenarios of a voice interaction processing method provided by an embodiment
  • 8A to 8D are schematic diagrams of a group of application interfaces provided by an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of a voice interaction processing method provided by an embodiment of the present application.
  • first and second are only used for descriptive purposes, and should not be construed as implying or implying relative importance or implying the number of indicated technical features. Therefore, the features defined as “first” and “second” may explicitly or implicitly include one or more of the features. In the description of the embodiments of the present application, unless otherwise specified, the “multiple” The meaning is two or more. The orientation or positional relationship indicated by the terms “middle”, “left”, “right”, “upper”, “lower”, etc.
  • FIG. 1 shows a schematic diagram of a scenario of a voice interaction system 10 according to an embodiment of the present invention.
  • the system 10 includes an electronic device 100 and a cloud server 200 .
  • the system 10 shown in FIG. 1 is only an example, and those skilled in the art can understand that in practical applications, the system 10 usually includes a plurality of electronic devices 100 and a cloud server 200 .
  • the numbers of electronic devices 100 and cloud servers 200 are not limited.
  • the electronic device 100 is a smart device with a voice interaction function.
  • the electronic device 100 can receive a voice instruction issued by a user and return voice or non-voice information to the user.
  • the electronic device 100 may be a mobile phone, a tablet computer, a notebook computer, an Ultra-mobile Personal Computer (UMPC), a handheld computer, a netbook, a Personal Digital Assistant (PDA), a virtual Reality devices, PDAs (Personal Digital Assistants, also known as PDAs), portable Internet devices, data storage devices, cameras, wearable devices (e.g., wireless headsets, smart watches, smart bracelets, smart glasses, headsets) Wearable devices (Head-mounted display, HMD), electronic clothing, electronic bracelets, electronic necklaces, electronic accessories, electronic tattoos and smart mirrors) or smart home devices (such as smart speakers, smart refrigerators, smart desk lamps, electric lights, smart TVs, Smart microwave ovens, smart fans, air conditioners, smart robots, smart curtains) and so on.
  • UMPC Ultra-mobile Personal Computer
  • An application scenario involved in the embodiments of this application is a home scenario, that is, the electronic device 100 is placed in the user's home, and the user can send voice instructions to the electronic device 100 to implement certain functions, such as surfing the Internet, playing songs on demand, shopping, and knowing the weather forecast. , control other smart home devices in your home, and more.
  • the cloud server 200 communicates with the electronic device 100 through a network, which may be, for example, a cloud server physically located at one or more locations.
  • the cloud server 200 provides a recognition service for the voice data received on the electronic device 100, so as to obtain a text representation of the voice data input by the user; the cloud server 200 also obtains the representation of the user's intention based on the text representation, and generates a response command, which is returned to the electronic device 100.
  • the electronic device 100 performs corresponding actions according to the response instruction to provide the user with corresponding services, such as setting an alarm clock, making a phone call, sending an email, broadcasting information, playing a song, a video, and the like.
  • the electronic device 100 may also output a corresponding voice response to the user according to the response instruction, or display corresponding text content, which is not limited in this embodiment of the present application.
  • FIG. 2 shows a schematic structural diagram of an exemplary electronic device 100 provided by an embodiment of the present application.
  • the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and Subscriber identification module (subscriber identification module, SIM) card interface 195 and so on.
  • SIM Subscriber identification module
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
  • the structures illustrated in the embodiments of the present application do not constitute a specific limitation on the electronic device 100 .
  • the electronic device 100 may include more or less components than shown, or combine some components, or separate some components, or arrange different components.
  • the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • application processor application processor, AP
  • modem processor graphics processor
  • graphics processor graphics processor
  • ISP image signal processor
  • controller memory
  • video codec digital signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • the controller may be the nerve center and command center of the electronic device 100 .
  • the controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 110 for storing instructions and data.
  • the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.
  • the processor 110 may include one or more interfaces.
  • the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transceiver (universal asynchronous transmitter) receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / or universal serial bus (universal serial bus, USB) interface, etc.
  • I2C integrated circuit
  • I2S integrated circuit built-in audio
  • PCM pulse code modulation
  • PCM pulse code modulation
  • UART universal asynchronous transceiver
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM subscriber identity module
  • USB universal serial bus
  • the I2C interface is a bidirectional synchronous serial bus that includes a serial data line (SDA) and a serial clock line (SCL).
  • the processor 110 may contain multiple sets of I2C buses.
  • the processor 110 can be respectively coupled to the touch sensor 180K, the charger, the flash, the camera 193 and the like through different I2C bus interfaces.
  • the processor 110 may couple the touch sensor 180K through the I2C interface, so that the processor 110 and the touch sensor 180K communicate with each other through the I2C bus interface, so as to realize the touch function of the electronic device 100 .
  • the I2S interface can be used for audio communication.
  • the processor 110 may contain multiple sets of I2S buses.
  • the processor 110 may be coupled with the audio module 170 through an I2S bus to implement communication between the processor 110 and the audio module 170 .
  • the audio module 170 can transmit audio signals to the wireless communication module 160 through the I2S interface, so as to realize the function of answering calls through a Bluetooth headset.
  • the PCM interface can also be used for audio communications, sampling, quantizing and encoding analog signals.
  • the audio module 170 and the wireless communication module 160 may be coupled through a PCM bus interface.
  • the audio module 170 can also transmit audio signals to the wireless communication module 160 through the PCM interface, so as to realize the function of answering calls through the Bluetooth headset. Both the I2S interface and the PCM interface can be used for audio communication.
  • the UART interface is a universal serial data bus used for asynchronous communication.
  • the bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication.
  • a UART interface is typically used to connect the processor 110 with the wireless communication module 160 .
  • the processor 110 communicates with the Bluetooth module in the wireless communication module 160 through the UART interface to implement the Bluetooth function.
  • the audio module 170 can transmit audio signals to the wireless communication module 160 through the UART interface, so as to realize the function of playing music through the Bluetooth headset.
  • the MIPI interface can be used to connect the processor 110 with peripheral devices such as the display screen 194 and the camera 193 .
  • MIPI interfaces include camera serial interface (CSI), display serial interface (DSI), etc.
  • the processor 110 communicates with the camera 193 through a CSI interface, so as to realize the photographing function of the electronic device 100 .
  • the processor 110 communicates with the display screen 194 through the DSI interface to implement the display function of the electronic device 100 .
  • the GPIO interface can be configured by software.
  • the GPIO interface can be configured as a control signal or as a data signal.
  • the GPIO interface may be used to connect the processor 110 with the camera 193, the display screen 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like.
  • the GPIO interface can also be configured as I2C interface, I2S interface, UART interface, MIPI interface, etc.
  • the USB interface 130 is an interface that conforms to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, and the like.
  • the USB interface 130 can be used to connect a charger to charge the electronic device 100, and can also be used to transmit data between the electronic device 100 and peripheral devices. It can also be used to connect headphones to play audio through the headphones.
  • the interface can also be used to connect other electronic devices, such as AR devices.
  • the interface connection relationship between the modules illustrated in the embodiments of the present application is only a schematic illustration, and does not constitute a structural limitation of the electronic device 100 .
  • the electronic device 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
  • the charging management module 140 is used to receive charging input from the charger.
  • the power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 .
  • the power management module 141 receives input from the battery 142 and/or the charging management module 140 and supplies power to the processor 110 , the internal memory 121 , the external memory, the display screen 194 , the camera 193 , and the wireless communication module 160 .
  • the power management module 141 can also be used to monitor parameters such as battery capacity, battery cycle times, battery health status (leakage, impedance).
  • the power management module 141 may also be provided in the processor 110 .
  • the power management module 141 and the charging management module 140 may also be provided in the same device.
  • the wireless communication function of the electronic device 100 can be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor, the baseband processor, and the like.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in electronic device 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
  • the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
  • the mobile communication module 150 may provide wireless communication solutions including 2G/3G/4G/5G etc. applied on the electronic device 100 .
  • the mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA) and the like.
  • the mobile communication module 150 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation.
  • the mobile communication module 150 can also amplify the signal modulated by the modulation and demodulation processor, and then turn it into an electromagnetic wave for radiation through the antenna 1 .
  • at least part of the functional modules of the mobile communication module 150 may be provided in the processor 110 .
  • at least part of the functional modules of the mobile communication module 150 may be provided in the same device as at least part of the modules of the processor 110 .
  • the modem processor may include a modulator and a demodulator.
  • the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal.
  • the demodulator is used to demodulate the received electromagnetic wave signal into a low frequency baseband signal. Then the demodulator transmits the demodulated low-frequency baseband signal to the baseband processor for processing.
  • the low frequency baseband signal is processed by the baseband processor and passed to the application processor.
  • the application processor outputs sound signals through audio devices (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or videos through the display screen 194 .
  • the modem processor may be a stand-alone device.
  • the modem processor may be independent of the processor 110, and may be provided in the same device as the mobile communication module 150 or other functional modules.
  • the wireless communication module 160 can provide applications on the electronic device 100 including UWB, wireless local area networks (WLAN) (such as wireless fidelity (WiFi) networks), bluetooth (BT), global navigation satellites Wireless communication solutions such as global navigation satellite system (GNSS), frequency modulation (FM), near field communication (NFC), and infrared technology (IR).
  • WLAN wireless local area networks
  • BT Bluetooth
  • GNSS global navigation satellite system
  • FM frequency modulation
  • NFC near field communication
  • IR infrared technology
  • the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 .
  • the wireless communication module 160 can also receive the signal to be sent from the processor 110 , perform frequency modulation on it, amplify it, and convert it into electromagnetic waves for radiation through the antenna 2 .
  • the antenna 1 of the electronic device 100 is coupled with the mobile communication module 150, and the antenna 2 is coupled with the wireless communication module 160, so that the electronic device 100 can communicate with the network and other devices through wireless communication technology.
  • the wireless communication technology may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband Code Division Multiple Access (WCDMA), Time Division Code Division Multiple Access (TD-SCDMA), Long Term Evolution (LTE), BT, GNSS, WLAN, NFC , FM, and/or IR technology, etc.
  • the GNSS may include global positioning system (global positioning system, GPS), global navigation satellite system (global navigation satellite system, GLONASS), Beidou navigation satellite system (beidou navigation satellite system, BDS), quasi-zenith satellite system (quasi -zenith satellite system, QZSS) and/or satellite based augmentation systems (SBAS).
  • global positioning system global positioning system, GPS
  • global navigation satellite system global navigation satellite system, GLONASS
  • Beidou navigation satellite system beidou navigation satellite system, BDS
  • quasi-zenith satellite system quadsi -zenith satellite system, QZSS
  • SBAS satellite based augmentation systems
  • the electronic device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like.
  • the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
  • Display screen 194 is used to display images, videos, and the like.
  • Display screen 194 includes a display panel.
  • the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light).
  • LED diode AMOLED
  • flexible light-emitting diode flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on.
  • the electronic device 100 may include one or N display screens 194 , where N is a positive integer greater than one.
  • the display screen 194 displays the interface content currently output by the system.
  • the interface content is an interface provided by an instant messaging application.
  • the electronic device 100 may implement a shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.
  • the ISP is used to process the data fed back by the camera 193 .
  • the shutter is opened, the light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye.
  • ISP can also perform algorithm optimization on image noise, brightness, and skin tone.
  • ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
  • the ISP may be provided in the camera 193 .
  • Camera 193 is used to capture still images or video.
  • a digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals. For example, when the electronic device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy and so on.
  • Video codecs are used to compress or decompress digital video.
  • the electronic device 100 may support one or more video codecs.
  • the electronic device 100 can play or record videos of various encoding formats, such as: Moving Picture Experts Group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4 and so on.
  • MPEG Moving Picture Experts Group
  • MPEG2 moving picture experts group
  • MPEG3 MPEG4
  • MPEG4 Moving Picture Experts Group
  • the NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • Applications such as intelligent cognition of the electronic device 100 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.
  • the internal memory 121 may include one or more random access memories (RAM) and one or more non-volatile memories (NVM).
  • RAM random access memories
  • NVM non-volatile memories
  • Random access memory can include static random-access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronization Dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM, such as the fifth generation DDR SDRAM is generally called DDR5 SDRAM), etc.;
  • SRAM static random-access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • DDR SDRAM double data rate synchronous dynamic random access memory
  • DDR SDRAM double data rate synchronous dynamic random access memory
  • DDR SDRAM double data rate synchronous dynamic random access memory
  • DDR5 SDRAM double data rate synchronous dynamic random access memory
  • Non-volatile memory may include magnetic disk storage devices, flash memory.
  • Flash memory can be divided into NOR FLASH, NAND FLASH, 3D NAND FLASH, etc. according to the operating principle, and can include single-level memory cell (SLC), multi-level memory cell (multi-level memory cell, SLC) according to the level of storage cell potential.
  • cell, MLC multi-level memory cell
  • TLC triple-level cell
  • QLC quad-level cell
  • UFS universal flash storage
  • eMMC embedded multimedia memory card
  • the random access memory can be directly read and written by the processor 110, and can be used to store executable programs (eg, machine instructions) of an operating system or other running programs, and can also be used to store data of users and application programs.
  • executable programs eg, machine instructions
  • the random access memory can be directly read and written by the processor 110, and can be used to store executable programs (eg, machine instructions) of an operating system or other running programs, and can also be used to store data of users and application programs.
  • the non-volatile memory can also store executable programs and store data of user and application programs, etc., and can be loaded into the random access memory in advance for the processor 110 to directly read and write.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.
  • the electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playback, recording, etc.
  • the audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
  • Speaker 170A also referred to as a "speaker" is used to convert audio electrical signals into sound signals.
  • the electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call.
  • the receiver 170B also referred to as "earpiece" is used to convert audio electrical signals into sound signals.
  • the voice can be answered by placing the receiver 170B close to the human ear.
  • the microphone 170C also called “microphone” or “microphone” is used to convert sound signals into electrical signals.
  • the user can make a sound by approaching the microphone 170C through a human mouth, and input the sound signal into the microphone 170C.
  • the electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
  • the earphone jack 170D is used to connect wired earphones.
  • the earphone interface 170D may be the USB interface 130, or may be a 3.5mm open mobile terminal platform (OMTP) standard interface, a cellular telecommunications industry association of the USA (CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association of the USA
  • the pressure sensor 180A is used to sense pressure signals, and can convert the pressure signals into electrical signals.
  • the pressure sensor 180A may be provided on the display screen 194 .
  • the gyro sensor 180B may be used to determine the motion attitude of the electronic device 100 .
  • the air pressure sensor 180C is used to measure air pressure.
  • the magnetic sensor 180D includes a Hall sensor.
  • the electronic device 100 can detect the opening and closing of the flip holster using the magnetic sensor 180D.
  • the acceleration sensor 180E can detect the magnitude of the acceleration of the electronic device 100 in various directions (generally three axes). The magnitude and direction of gravity can be detected when the electronic device 100 is stationary.
  • the electronic device 100 can measure the distance through infrared or laser.
  • Proximity light sensor 180G may include, for example, light emitting diodes (LEDs) and light detectors, such as photodiodes.
  • the ambient light sensor 180L is used to sense ambient light brightness.
  • the electronic device 100 can adaptively adjust the brightness of the display screen 194 according to the perceived ambient light brightness.
  • the ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures.
  • the ambient light sensor 180L can also cooperate with the proximity light sensor 180G to detect whether the electronic device 100 is in a pocket, so as to prevent accidental touch.
  • the fingerprint sensor 180H is used to collect fingerprints.
  • the electronic device 100 can use the collected fingerprint characteristics to realize fingerprint unlocking, accessing application locks, taking pictures with fingerprints, answering incoming calls with fingerprints, and the like.
  • the temperature sensor 180J is used to detect the temperature.
  • Touch sensor 180K also called “touch panel”.
  • the touch sensor 180K may be disposed on the display screen 194 , and the touch sensor 180K and the display screen 194 form a touch screen, also called a “touch screen”.
  • the touch sensor 180K is used to detect a touch operation acting on or near it, and the touch touch operation refers to an operation of a user's hand, elbow, stylus, etc. touching the display screen 194 .
  • the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
  • Visual output related to touch operations may be provided through display screen 194 .
  • the touch sensor 180K may also be disposed on the surface of the electronic device 100 , which is different from the location where the display screen 194 is located.
  • the bone conduction sensor 180M can acquire vibration signals. In some embodiments, the bone conduction sensor 180M can acquire vibration signals of the vibrating bone mass of the human voice. The bone conduction sensor 180M can also contact the pulse of the human body and receive the blood pressure beating signal. In some embodiments, the bone conduction sensor 180M can also be disposed in the earphone, combined with the bone conduction earphone.
  • the audio module 170 can analyze the voice signal based on the vibration signal of the vocal vibration bone block obtained by the bone conduction sensor 180M, so as to realize the voice function.
  • the application processor can analyze the heart rate information based on the blood pressure beat signal obtained by the bone conduction sensor 180M, and realize the function of heart rate detection.
  • the keys 190 include a power-on key, a volume key, and the like. Keys 190 may be mechanical keys. It can also be a touch key.
  • the electronic device 100 may receive key inputs and generate key signal inputs related to user settings and function control of the electronic device 100 .
  • Motor 191 can generate vibrating cues.
  • the motor 191 can be used for vibrating alerts for incoming calls, and can also be used for touch vibration feedback.
  • touch operations acting on different applications can correspond to different vibration feedback effects.
  • the motor 191 can also correspond to different vibration feedback effects for touch operations on different areas of the display screen 194 .
  • Different application scenarios for example: time reminder, receiving information, alarm clock, games, etc.
  • the touch vibration feedback effect can also support customization.
  • the indicator 192 can be an indicator light, which can be used to indicate the charging state, the change of the power, and can also be used to indicate a message, a missed call, a notification, and the like.
  • the SIM card interface 195 is used to connect a SIM card.
  • the SIM card can be contacted and separated from the electronic device 100 by inserting into the SIM card interface 195 or pulling out from the SIM card interface 195 .
  • FIG. 3 shows a block diagram of the software structure of the electronic device 100 according to the embodiment of the present application.
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces.
  • the Android system is divided into four layers, which are, from top to bottom, an application layer, an application framework layer, an Android runtime (Android runtime) and a system library, and a kernel layer.
  • the application layer can include a series of application packages.
  • the application package may include, for example, applications such as camera, gallery, calendar, calling, map, navigation, WLAN, Bluetooth, music, video, games, shopping, travel, instant messaging (such as short messages).
  • the application package may also include: the main screen (ie the desktop), the negative screen, the control center, the notification center and other system applications.
  • the application layer in the embodiment of the present application includes a voice assistant and a voice processing module.
  • the voice processing module provides a voice processing capability, and any application program can invoke the voice processing module capability, such as a voice assistant application, the electronic device 100 receives a voice signal through the voice assistant application, and the voice assistant application invokes the voice
  • the processing module processes the voice signal.
  • the speech processing module includes the ability of speech recognition (automatic speech recognitioN, ASR), the ability of semantic understanding (natural language understanding, NLU), the ability of dialogue management (dialog management, DM), the ability of natural language generation (natural language generation, NLG) and speech synthesis (text to speech, TTS) capabilities. in,
  • the speech recognition module is used for recognizing the speech signal to obtain the textual representation information of the speech signal.
  • the speech recognition module can first represent the speech signal as text data, and then perform word segmentation processing on the text data to obtain text representation information of the speech signal, that is, convert the words in the speech signal into readable input by the electronic device 100, including, for example, Binary encodings, character sequences, etc.
  • a typical speech recognition method can be, for example, a method based on vocal tract model and speech knowledge, a method of template matching (compare the similarity between the feature vector of the input speech signal and each template in the template library in turn, and compare the similarity with the highest similarity.
  • the embodiment of the present application does not limit which speech recognition method is used to perform speech recognition processing.
  • the semantic understanding module is used to convert the textual representation information of the speech signal into semantic information that the electronic device 100 can understand.
  • Semantic information includes entities, triples, intents, events, and so on. With this information, the electronic device 100 can understand the user's language and determine what the user wants to do.
  • the dialog management module is used to determine the next action to be performed by the electronic device 100 based on the semantic information, and the actions to be performed include one or more of the following: playing the voice reply content (eg: providing a result, asking for a specific restriction, clarifying or confirming a requirement) etc.); display the text content of the voice reply content; jump to the corresponding interface; and so on.
  • the dialogue management module determines the intent expressed in the semantic information, and then fills the slot corresponding to the intent according to the semantic information.
  • the intent is what the user wants to do, and the slot corresponding to the intent is the information the user needs to complete the intent. For example, if the intent is "call”, the slot corresponding to "call” is who to call, that is, the object of the call; For another example, if the intention is "send text message", there are two slots corresponding to "send text message", which are the object of text message and the content of text message.
  • dialogue management is a decision-making process.
  • the dialogue management module continuously determines the next action to be performed according to the current state during the voice interaction process, thereby assisting the user to complete the task of information acquisition or service acquisition. If this action requires voice interaction with the user, the natural language generation module will be triggered to generate language text that the user can understand; finally, the generated language text will be played by the speech synthesis module to the user.
  • the natural language generation module is used to convert data sets in non-linguistic formats into textual information in language formats that users can understand.
  • the natural language generation module determines what information should be included in the language text being constructed, and organizes the text in a reasonable order, combining multiple pieces of information into a single sentence. Then choose some connecting words and phrases to form a well-structured complete sentence.
  • the speech synthesis module is used to convert the textual information produced by the natural language generation module into artificial speech by mechanical and electronic means.
  • the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
  • the application framework layer includes some predefined functions.
  • the application framework layer may include input manager, window manager, content provider, view system, telephony manager, resource manager, notification manager, display manager, activity manager (activity manager) etc.
  • the application framework layer is illustrated by taking an example including an input manager, a window manager, a content provider, a view system, and an activity manager. It should be noted that any two modules in the input manager, window manager, content provider, view system, and activity manager can call each other.
  • the input manager is used to receive instructions or requests reported by lower layers such as the kernel layer and the hardware abstraction layer.
  • a window manager is used to manage window programs.
  • the window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc.
  • the window manager is used to display a window including one or more shortcut controls when the electronic device 100 meets a preset trigger condition.
  • the activity manager is used to manage the activities running in the system, including process, application, service, task information and so on.
  • Content providers are used to store and retrieve data and make these data accessible to applications.
  • the data may include video, images, audio, calls made and received, browsing history and bookmarks, phone book, etc.
  • the view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications.
  • a display interface can consist of one or more views.
  • the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.
  • the view system is used to display a shortcut area on the display screen 103 when the electronic device 100 meets the preset trigger condition, and the shortcut area includes one or more shortcut controls added by the electronic device 100 .
  • the present application does not limit the position and layout of the shortcut area, as well as the icons, positions, layout and functions of the controls in the shortcut area.
  • the display manager is used to transfer display content to the kernel layer.
  • the phone manager is used to provide the communication function of the electronic device 100 .
  • the management of call status including connecting, hanging up, etc.).
  • the resource manager provides various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on.
  • the notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can disappear automatically after a brief pause without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications of applications running in the background, and notifications on the screen in the form of dialog windows. For example, text information is prompted in the status bar, a prompt sound is issued, the electronic device vibrates, and the indicator light flashes.
  • Android Runtime includes core libraries and a virtual machine. Android runtime is responsible for scheduling and management of the Android system.
  • the core library consists of two parts: one is the function functions that the java language needs to call, and the other is the core library of Android.
  • the application layer and the application framework layer run in virtual machines.
  • the virtual machine executes the java files of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.
  • a system library can include multiple functional modules. For example: surface manager (surface manager), media library (media library), 3D graphics processing library (eg: OpenGL ES), 2D graphics engine (eg: SGL), etc.
  • surface manager surface manager
  • media library media library
  • 3D graphics processing library eg: OpenGL ES
  • 2D graphics engine eg: SGL
  • the Surface Manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
  • the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing.
  • 2D graphics engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer at least includes a display driver, a camera driver, an audio driver, a sensor driver, a touch chip driver and an input system, and the like.
  • the inner core layer is illustrated by taking the input system, the driver of the touch chip, the display driver and the storage driver as an example.
  • the display driver and the storage driver may be jointly arranged in the driver module.
  • the structures illustrated in this application do not constitute a specific limitation on the electronic device 100 .
  • the electronic device 100 may include more or fewer components than shown, or some components may be combined, or some components may be split, or a different arrangement of components.
  • the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
  • the cloud server 200 communicates with the electronic device 100 through a network.
  • the electronic device 100 detects that a voice signal is connected, and the electronic device 100 activates the voice interaction function.
  • the electronic device may receive the first voice signal through a voice assistant application (APP).
  • APP voice assistant application
  • the electronic device 100 detects whether the received voice signal contains a target object (for example, the target object is a preset wake-up word), and if it contains the target object, it enters the interactive state and activates the voice interaction function.
  • the target object can be preset when the electronic device 100 leaves the factory, can be preset in the voice assistant application, or can be set by the user in the process of using the electronic device 100. This application does not limit the length and content of the target object.
  • the electronic device 100 performs distribution control on the received voice signal based on a preset rule, and the distribution path includes path 1 and path 2.
  • the preset rule includes that when the network quality is good, the electronic device 100 uploads the received voice signal to the cloud server 200 for processing (path 1).
  • the good network quality means that the electronic device 100 and the cloud server 200 can perform data processing. Transmission (including uplink and downlink data transmission); when the network quality is poor or disconnected, the electronic device 100 processes the received voice signal in the electronic device 100 (path 2), where poor network quality or disconnection refers to The electronic device 100 and the cloud server 200 cannot perform data transmission (including uplink or downlink data transmission), or the data transmission rate is lower than the threshold.
  • the preset rule can also be distributed according to the intent corresponding to the recognized voice signal.
  • the intent corresponding to the voice signal can be completed locally, such as making a call, sending a text message, opening a gallery, etc.
  • it can be The device 100 performs processing; when the intent corresponding to the voice signal needs to make the network, such as searching web pages, playing music online, etc., the processing can be performed on the cloud server 200 .
  • Path 1 The voice signal is processed on the cloud server 200 .
  • Step 1 the electronic device 100 uploads the voice signal to the cloud server 200, and the cloud server 200 receives the voice signal, recognizes the voice signal through the voice recognition technology ASR, and converts the voice signal into text representation information, that is, the voice signal in the voice signal.
  • the vocabulary is converted into input readable by the cloud server 200, including, for example, binary codes, character sequences, and the like.
  • the cloud server 200 may compare the similarity between the feature vector of the input speech signal and each template in the template library in turn, take the one with the highest similarity as the recognition result, output the text data, and then perform the processing on the text data.
  • the word segmentation process is used to obtain the textual representation information of the speech signal.
  • the cloud server 200 may also use the trained vocal tract model, neural network model, etc. to calculate and obtain text representation information corresponding to the speech signal.
  • the cloud server 200 may also include some preprocessing operations on the voice signal, such as sampling, quantization, and removing voice data that does not contain voice content (eg, silence voice data), framing and windowing the voice data, and so on.
  • step 2 after speech recognition, the cloud server 200 converts the textual representation information into semantic information that can be understood by the machine through the semantic understanding technology NLU.
  • the execution of the semantic understanding technology can be simply understood as the following steps.
  • the cloud server 200 divides the text representation information obtained by speech recognition into a series of units with semantics and grammar, usually using ""
  • the word token is used to represent the unit obtained by text segmentation.
  • a common text segmentation method is "word segmentation", that is, the text is segmented according to the granularity of "words".
  • Models used for word segmentation may include first-order Markov models, hidden Markov models, conditional random fields, recurrent neural networks, and the like.
  • a text representation model such as a word vector space model, a distributed representation model, etc. is used to obtain a numerical vector or matrix.
  • This matrix is the numerical representation of the text.
  • the cloud server 200 can understand the user's language and determine what the user wants to do.
  • Step 3 the cloud server 200 performs dialog management based on the semantic information.
  • Dialog management refers to a process in which the cloud server 200 determines an action to be executed next based on semantic information.
  • the actions performed include one or more of the following: playing the content of the voice reply (such as: providing results, asking for specific restrictions, clarifying or confirming requirements, etc.); displaying the text content of the voice reply content; jumping to the corresponding interface; and many more.
  • the cloud server 200 determines the intent expressed in the semantic information, and then fills the slot corresponding to the intent according to the semantic information.
  • the intent is what the user wants to do
  • the slot corresponding to the intent is the information the user needs to complete the intent, and an intent can correspond to one or more slots.
  • the cloud server 200 fills the intended slots based on the semantic information. If the information in one or more of the slots is missing due to insufficient semantic information, it determines that the next action to be performed is to further inquire about the missing slots; If the information is not missing, the user's intention is converted into an explicit instruction of the user, and the electronic device 100 is instructed to perform a corresponding action.
  • the cloud server 200 acquires the semantic information of the voice signal "Please help me open the gallery", and determines that the user's intention is to open an object according to the semantic information of the voice signal, and the slot corresponding to the intention is open.
  • the cloud server 200 fills the slot according to the semantic information of the speech signal, and determines that the opened object is a gallery.
  • the dialog management determines an explicit instruction based on the semantic information, that is, an instruction to open the gallery.
  • the above content provides an example in which the cloud server 200 determines the operation instruction based on the user's voice signal under the condition that the slot information is not missing.
  • the cloud server 200 needs to save the current intent and slot information, and further inquire about the missing slots. Typically, this situation is called a multi-turn conversation.
  • the cloud server 200 obtains the semantic information of the voice signal "I want to make a call", and determines that the user's intention is to make a call according to the semantic information of the voice signal, and the slot corresponding to the intention is to make a call. If the object is missing the slot due to insufficient semantic information, the dialogue management will determine a clear instruction based on the semantic information, that is, play the voice reply content, and conduct further inquiries about the missing slot.
  • the dialogue management may determine another explicit instruction based on the semantic information of the voice signal, that is, to display the text content of the voice reply content.
  • the cloud server 200 When the cloud server 200 receives a voice signal next time, the cloud server 200 repeats the above steps 1 and 2, and then fills the missing slot with the semantic information corresponding to the voice signal based on the saved intent and slot information. If all the slots are filled completely, that is, the slot information is not missing, the user's intention is converted into an explicit instruction of the user, and the electronic device 100 is instructed to perform a corresponding action.
  • the intent saved by the cloud server 200 is to make a call, and the missing slot is the object of the call, then when the cloud server 200 receives the voice signal "Xiao Ming" next time, the corresponding voice signal "Xiao Ming"
  • the semantic information of fills the slot "the object of the call", so as to determine the instruction, that is, call Huaweing.
  • Step 4 after the cloud server 200 determines the action to be performed in the next step, if the action needs to perform voice interaction with the user, such as outputting the content of the voice response, the cloud server 200 can generate a language text that can be understood by the user based on the natural language generation technology, and then Speech synthesis is performed on the generated language text to generate speech data.
  • the cloud server 200 determines which information should be included in the language text being constructed, and organizes a reasonable text order to combine multiple pieces of information into a single sentence. Then choose some conjunctions, phrases, and combine this information into a well-structured complete sentence.
  • Step 5 the cloud server 200 sends an instruction to the electronic device 100 to instruct the electronic device 100 to perform an action.
  • the cloud server 200 sends an instruction with voice data (voice reply content) to the electronic device 100, instructing the electronic device 100 to output the voice reply content.
  • voice reply content voice reply content
  • the cloud server 200 sends an instruction with text data (text content of the voice reply content) to the electronic device 100 to instruct the electronic device 100 to display the text data.
  • text data text content of the voice reply content
  • step 4 is optional, and if the next action determined by the cloud server 200 does not need to output the content of the voice reply, step 4 does not need to be performed. Based on the above steps 123, the cloud server 200 sends an instruction to the electronic device 100 to instruct the electronic device 100 to perform interface jumping.
  • Path 2 Process the voice signal on the electronic device 100 .
  • the electronic device 100 receives the voice signal, recognizes the voice signal through a voice recognition technology, and converts the voice signal into text representation information. The textual representation information is then transformed into machine-understandable semantic information through semantic understanding techniques. Then, the electronic device 100 determines the next action to be performed based on the semantic information. If the action requires voice interaction with the user, such as outputting voice reply content, the electronic device 100 can generate language text that the user can understand based on the natural language generation technology, and then perform speech synthesis on the generated language text to generate voice data. The electronic device 100 outputs the voice data. If the next action determined by the electronic device 100 does not need to output the content of the voice response, the electronic device 100 does not need to use the speech synthesis technology, and the electronic device 100 performs interface jumping.
  • the electronic device 100 distributes the voice service based on the network condition. If the network condition is good, the voice signal is uploaded to the cloud server 200 for processing, that is, the above path 2; if the network is disconnected or the network quality is poor, the electronic device 100 processing, that is, the above path one. In this case, if the voice service performs path switching during processing due to network reasons, the original voice service cannot continue to be executed, affecting user experience.
  • the cloud server 200 when the cloud server 200 receives a voice signal for the first time, if the semantic information of the voice signal is insufficient and one or more of the slots are missing, the cloud server 200 needs to save the current intent and slot information. slot for further inquiry.
  • the cloud server 200 For the voice service that requires multiple rounds of dialogue, if the network is interrupted before the cloud server 200 receives the voice signal next time, the cloud server 200 cannot receive the next voice signal, and the electronic device 100 distributes the next voice signal to the Processing on the electronic device 100, the electronic device 100 cannot continue to perform the original voice service based on the semantic information of the next voice signal, and the original voice service is interrupted, which affects the user experience.
  • the embodiment of the present application further provides a voice interaction processing method.
  • the cloud server 200 sends an instruction to the electronic device 100 to instruct the electronic device 100 to perform a corresponding action, synchronously to the electronic device 100 Send the context (intent and slot information) of the voice dialogue. If the network is interrupted in the voice service of multiple rounds of dialogue, it will lead to the end-cloud switch (the switch between the electronic device 100 and the cloud server 200, that is, the switch between path 1 and path 2) , the electronic device 100 can also continue to execute the original voice service based on the context of the voice dialogue and the received next voice signal, thereby solving the problem of interruption of voice services for multiple rounds of dialogue.
  • FIG. 5A shows a voice interaction process in the voice interaction system 10 .
  • the electronic device 100 receives the voice 1, and the electronic device 100 starts the voice interaction function.
  • the electronic device 100 performs distribution control on the received voice 1 based on preset rules. For example, at this time, the network quality is good, then the electronic device 100 uploads the received voice 1 to the cloud server 200 for processing.
  • the processing actions include voice recognition and semantic understanding. , dialogue management, speech synthesis and other processes.
  • the voice 1 may also be referred to as the first voice signal.
  • the principles of speech recognition, semantic understanding, dialogue management, and speech synthesis at time T1 in the embodiment shown in FIG. 5A to solve problems are similar to those of the first path in the embodiment shown in FIG. 4 . Therefore, the cloud server 200 is
  • the implementation process of speech recognition, semantic understanding, dialogue management, and speech synthesis at time T1 reference may be made to the corresponding descriptions of steps 1234 of the cloud server 200 in the above path 1 in FIG. 4 , and details are not repeated here.
  • the cloud server 200 determines the next action to be performed, and sends an instruction to the electronic device 100 to instruct the electronic device 100 to perform action 1; and the cloud server 200 synchronously sends the electronic device 100 a voice dialogue context, which refers to the cloud server 200 Intention and slot information obtained by recognizing and understanding voice 1.
  • the executed action 1 includes one or more of the following: playing the voice reply content for voice 1 (such as: providing results, asking specific constraints, clarifying or confirming needs, etc.); displaying the text content of the voice reply content; jumping to the corresponding interface; and so on.
  • the electronic device 100 After receiving the instruction and the dialog context, the electronic device 100 forwards the instruction and the dialog context through the dialog information forwarding module, and the electronic device 100 executes action 1 based on the instruction and saves the dialog context.
  • the dialogue information forwarding module may be regarded as a node that receives data sent by the cloud server 200, and is used for receiving and forwarding the data.
  • the network quality is poor.
  • the electronic device 100 outputs the voice reply content for Voice 1, it receives Voice 2. Due to the poor network quality at this time, data transmission cannot be achieved between the electronic device 100 and the cloud server 200. Then the electronic device 100 Failed to upload voice 2 to cloud server 200.
  • the electronic device 100 invokes its own speech processing capability to process the speech 2, and the processing actions include speech recognition, semantic understanding, dialogue management, speech synthesis and other processes.
  • the implementation process of speech recognition, semantic understanding, dialogue management, and speech synthesis of the electronic device 100 at time T2 may refer to the corresponding description of the electronic device 100 in the above path 2 in FIG. 4 , which will not be repeated here.
  • the voice 2 may also be referred to as the second voice signal.
  • the electronic device 100 determines the next action to be performed based on the voice 2 and the dialog context saved at time T1.
  • the electronic device 100 fills in the missing slots based on the semantic information corresponding to the speech 2 and the intent and slot information. If the semantic information corresponding to Voice 2 is insufficient, the slot information is not fully filled, and one or more slots are missing, the electronic device 100 determines that the next action to be performed is to further query the missing slots; If there is no missing, the user's intention is converted into an explicit instruction of the user, and the electronic device 100 is instructed to perform a corresponding action.
  • the time T2 is the time period during which the electronic device 100 uploads the voice 2 to the cloud server 200 after the electronic device 100 receives the instruction and the dialogue context sent by the cloud server 200 for the voice 1.
  • the electronic device 100 receives the voice 2, or after the voice 2 is received and before uploading to the cloud server 200. That is, due to poor network quality at time T2, the voice 2 cannot be uploaded to the cloud server 200.
  • the time T2 may also be after the electronic device 100 uploads the voice 2 to the cloud server 200 and before the cloud server 200 sends the instruction to the electronic device 100 . That is, due to poor network quality, the cloud server 200 cannot deliver the instruction generated for the voice 2 to the electronic device 100 .
  • the electronic device 100 uploads the voice 2 to the cloud server 200, and the cloud server 200 processes the voice 2, and the processing actions include voice recognition, semantic understanding, dialogue management, and speech synthesis.
  • a network quality problem occurs, data transmission cannot be realized between the electronic device 100 and the cloud server 200 , and the cloud server 200 cannot deliver the command generated for the voice 2 to the electronic device 100 .
  • the electronic device 100 invokes its own voice processing capability to Voice 2 (which may be backup Voice 2, for example) is processed.
  • Voice 2 which may be backup Voice 2, for example
  • the electronic device 100 uploads the voice 2 to the cloud server 200, before receiving the instruction issued by the cloud server 200 for the voice 2, it detects that the network connection with the cloud server 200 is currently disconnected, and the electronic device 100 calls the The voice 2 (for example, the backup voice 2) is processed by its own voice processing capability.
  • the voice 2 for example, the backup voice 2
  • every time the cloud server 200 issues an instruction it synchronously sends the dialogue context to the electronic device 100, and the electronic device 100 receives and saves the dialogue context.
  • the electronic device 100 can also continue to process the voice service processed on the cloud server 200 based on the saved dialogue context, so that the voice service is not interrupted, the processing efficiency of the voice service is improved, and the user experience is improved.
  • each time the cloud server 200 issues an instruction only if at least one slot information in one or more slot information is missing, will the conversation context, that is, the intent and the slot information.
  • the cloud server 200 determines the intent expressed by the semantic information corresponding to voice 1 and the slot corresponding to the intent based on the semantic information corresponding to voice 1 Information, an intent can correspond to one or more slots.
  • the cloud server 200 fills the intended slot information based on the semantic information. If all the slots are completely filled, that is, the slot information is not missing, the user's intention is converted into an explicit instruction of the user, and the cloud server 200 sends an instruction to the electronic device 100 to indicate The electronic device 100 performs corresponding actions.
  • the cloud server 200 obtains the voice signal "Please help me open the gallery", and determines that the user's intention is to open an object according to the semantic information of the voice signal, and the slot corresponding to the intention is the open object, The cloud server 200 fills the slot according to the semantic information of the speech signal, and determines that the opened object is a gallery. Then the dialog management determines an explicit instruction based on the semantic information, that is, an instruction to open the gallery. It can be seen that since the user's intention is completed at this time and the cloud server 200 determines that the intention has ended, the cloud server 200 does not need to send the dialog context (intent and slot information) to the electronic device 100 at this time, saving resources.
  • the cloud server 200 needs to save the current intent and slot information, conduct further inquiries about the missing slots, and wait until the next voice signal is received, Combined with the stored intention slot information, fill the slot with the next voice signal to determine the next execution action.
  • the cloud server 200 generates the voice reply content for the voice 1 based on the speech synthesis technology, sends an instruction to the electronic device 100 to instruct the electronic device 100 to output the voice reply content, and synchronously sends the dialogue context ( Intention and slot information corresponding to voice 1), the electronic device 100 receives and saves the dialogue context.
  • the electronic device 100 can use its own voice interaction capability in combination with the saved dialogue context to perform the next voice signal received. processing, improve the processing efficiency of voice services, and improve user experience.
  • the cloud server 200 when there are two or more missing slots, the cloud server 200 sends not only the intention and slot information to the electronic device 100 synchronously, but the cloud server 200 can also mark the slots, indicating The order in which the slots of the electronic device 100 are filled. In this way, the electronic device can accurately fill one of the slots when processing the next voice signal.
  • the network quality is good at time T1
  • the user wants to make a phone call by voice
  • APP voice assistant application
  • the electronic device 100 receives the voice signal "I want to make a call” input by the user through the voice assistant application, and controls the distribution of the received voice signal based on preset rules. For example, when the network quality is good, the electronic device 100 will receive the voice signal. "I want to make a call” is uploaded to the cloud server 200 for processing.
  • the cloud server 200 After receiving the voice signal of "I want to make a call", the cloud server 200 converts the voice signal into text information according to the speech recognition technology (ASR), obtains the semantic information according to the semantic understanding technology (NUL), and recognizes that the user's intention is to make a phone call .
  • ASR speech recognition technology
  • NUL semantic understanding technology
  • the cloud server 200 determines that the slot information corresponding to the intention to make a call includes the object of the call, the cloud server 200 fills the slot according to the semantic information, and the cloud server 200 recognizes that the semantic information of "I want to make a call” does not include
  • the calling object that is, the cloud server 200 determines that the information of the slot (calling object) corresponding to the intention (calling) is vacant.
  • the cloud server 200 determines that the next action to be performed is to inquire about the vacant slot information.
  • the cloud server 200 generates a voice reply content "Who do you want to call” according to the speech synthesis technology (TTS), and sends a message to the electronic device 100 with the The instruction of the voice reply content instructs the electronic device 100 to play the voice reply content.
  • the cloud server 200 synchronously sends the dialog context to the electronic device 100, where the dialog context includes the intention "to make a call” and the slot information "the object of the call (vacancy)".
  • the electronic device 100 receives the instruction and the dialogue context sent by the cloud server, plays the voice reply content "Who do you want to call” based on the instruction, and saves the dialogue context.
  • the cloud server 200 may also send an instruction of text data with the content of the voice reply to the electronic device 100, instructing the electronic device 100 to display the text data (the text content of "who do you want to call”).
  • the electronic device 100 After the electronic device 100 plays the voice reply content "Who do you want to call", the user inputs the voice signal "Call Huaweing” again. At T2, the network quality is poor, and the electronic device 100 invokes its own voice processing capability to process the voice signal "Call Huaweing”. To Xiao Ming".
  • the electronic device 100 converts the speech signal into text information according to the speech recognition technology (ASR), and obtains the semantic information according to the semantic understanding technology (NUL). Next, the electronic device 100 fills the slot based on the saved intent "call” and slot information "object to call (vacancy)", and according to the semantic information corresponding to "call Xiaoming".
  • the electronic device 100 recognizes that "Xiao Ming" in the semantic information of "I want to make a call" is the object of the call, that is, the cloud server 200 determines that the information of the slot (the object of the call) corresponding to the intention (call) is " Xiao Ming".
  • the electronic device 100 determines that the next action to be performed is to make a call to Huaweing, and outputs a voice reply content "calling Xiaoming".
  • the electronic device 100 generates a voice reply content "calling Xiaoming” according to the speech synthesis technology (TTS), and the electronic device 100 plays the voice reply content.
  • TTS speech synthesis technology
  • the electronic device 100 queries the contact Huaweing in the address book, and invokes the call capability to call Huaweing.
  • the electronic device 100 may also display the text data of the voice reply content (the text content of "calling Xiaoming").
  • the above describes the voice interaction processing method in a phone call scenario.
  • the following takes a smartphone as an example of the electronic device 100, and exemplarily shows some voice interaction processes in combination with specific scenarios.
  • the processing of the voice signal is switched from the cloud server 200 to the electronic device 100, because the cloud server 200 sends the dialogue context of the voice and saves it.
  • the electronic device 100 in this way, even if the network is interrupted during multiple rounds of conversations, the electronic device 100 can realize uninterrupted voice services.
  • the wake-up word is set to "Xiaoyi Xiaoyi".
  • Smartphone electronic device 100
  • Smartphone Electronic Device 100: Okay, I'm calling Xiao Ming for you.
  • the following describes an implementation form of the voice interaction processing method provided by the embodiment of the present application on a display interface of a smart phone with reference to FIGS. 8A to 8D , taking the above-mentioned voice dialogue as an example.
  • FIG. 8A shows a voice interaction interface 801, which may be, for example, an interface of a voice assistant application.
  • the voice interaction interface 801 includes a status bar 8011 and a function bar 8012 .
  • the status bar 8011 may include: one or more signal strength indicators 8013 of wireless network signals, a battery status indicator 8014, and a time indicator 8015.
  • the signal strength indicator 8013 indicates the current network quality (it may also indicate the data transmission rate between the electronic device 100 and the cloud server 200). In FIG. 8A, the signal strength indicator 8013 is full (4 bars), indicating the current network quality good.
  • Function bar 8012 may include one or more function controls, such as voice input control 8016.
  • voice input control 8016 When the electronic device 100 detects a user operation on the voice input control 8012, the electronic device 100 receives a voice signal. As shown in FIG. 8A , the electronic device 100 receives the voice signal “Xiaoyi Xiaoyi, I want to make a call”, and displays it on the voice interaction interface 801 .
  • the electronic device 100 receives the voice signal "Xiaoyi Xiaoyi, I want to make a call", and can upload the voice signal to the cloud server 200 for processing, and play the voice reply content "You Who to call” and displayed on the voice interface 802.
  • the voice input control 8016 is transformed into a voice output control 8026, indicating that the electronic device 100 is currently outputting voice.
  • the cloud server 200 when the cloud server 200 returns the instruction, it synchronously returns the voice dialogue context to the electronic device 100, and the electronic device 100 receives and saves the dialogue context.
  • the user continues to input voice, as shown in FIG. 8C , the current network quality of the electronic device 100 is not good, and the signal of the signal strength indicator 8033 only has two bars left, then the electronic device 100 and the cloud server 200 cannot perform data transmission, or the data transmission rate too low.
  • the electronic device 100 receives the voice signal "Xiao Ming”
  • the electronic device 100 cannot upload the voice signal to the cloud server 200 for processing, or the cloud server 200 cannot issue an instruction to the electronic device 100 .
  • the electronic device 100 can continue to process the voice signal "Xiao Ming" based on the saved dialogue context, play the voice reply content "Okay, I'm calling Xiao Ming for you", and display it on the voice interaction interface 803 .
  • the electronic device 100 jumps to the call interface, as shown in FIG. 8D , which shows a call interface 804 indicating that the electronic device 100 is currently calling Xiao Ming.
  • the above is an application scenario in which the voice service is a multi-round dialogue (the above is specifically two rounds of dialogue).
  • the network quality of the electronic device 100 changes from good to poor, and the processing of the voice signal is converted from the cloud server 200 to In the electronic device 100, because the cloud server 200 delivers the speech dialogue context and saves it on the electronic device 100, in this way, even if the network is interrupted during multiple rounds of dialogue, the electronic device 100 can still implement the voice service. without interruption, which improves the processing efficiency of voice services.
  • the embodiment of the present application further provides an application scenario of three-round dialogue.
  • the application scenario of sending short messages as an example, the voice interaction processing method implemented in the scenario of sending short messages in the embodiments of the present application is briefly described.
  • the electronic device 100 receives the voice signal "I want to send a text message" input by the user, and performs distribution control on the received voice signal based on preset rules.
  • the "I want to send a text message” is uploaded to the cloud server 200 for processing.
  • the cloud server 200 recognizes that the user's intention is to send a text message. Next, the cloud server 200 determines that the slot information corresponding to the intention to send a text message includes the object to send the text message and the content of the text message, the cloud server 200 fills the slot according to the semantic information, and the cloud server 200 recognizes the semantics of "I want to send a text message" The information does not include the object of the call and the content of the text message, that is, the cloud server 200 determines the information vacancy of the slot (the object of the text message, the content of the text message) corresponding to the intent (send text message).
  • the cloud server 200 determines that the next action to be performed is to inquire about the vacant slot information. Since there are two vacant slot information, the cloud server 200 can inquire about one of the vacant slot information according to the priority. For example, first Ask the person you are texting. The cloud server 200 generates the voice reply content "Who do you want to text" according to the speech synthesis technology (TTS), sends an instruction with the voice reply content to the electronic device 100, and instructs the electronic device 100 to play the voice reply content. And, the cloud server 200 synchronously sends the dialog context to the electronic device 100, where the dialog context includes the intention "send text message” and the slot information "object to send text message (vacancy) and content of text message (vacancy)". The electronic device 100 receives the instruction and the dialogue context sent by the cloud server, plays the voice reply content "Who do you want to text” based on the instruction, and saves the dialogue context.
  • TTS speech synthesis technology
  • the electronic device 100 plays the voice reply content "who do you want to text"
  • the user inputs the voice signal "to Xiao Ming" again.
  • the electronic device 100 will receive the received "I want to send a text message” ” is uploaded to the cloud server 200 for processing.
  • the cloud server 200 fills the slot based on the stored intent "send text message” and slot information "object to send text message (vacancy), content to send text message (vacancy)”, and semantic information corresponding to "to Xiaoming".
  • the cloud server 200 recognizes that "Xiao Ming" in the semantic information of "To Xiao Ming" is the object of the text message, that is, the cloud server 200 determines that the information of the slot (the object of the text message) corresponding to the intention (call) is "Xiao Ming" .
  • the cloud server 200 Since the slot information "content of text messages" is still vacant at this time, the cloud server 200 saves the current intention and slot information, and the cloud server 200 determines that the next action to be performed is for the vacant slot information (content of text messages). ) to inquire again, the cloud server 200 generates a voice reply content "what do you want to send” according to the speech synthesis technology (TTS), sends an instruction with the voice reply content to the electronic device 100, and instructs the electronic device 100 to play the voice reply content. In addition, the cloud server 200 synchronously sends the dialog context to the electronic device 100. At this time, the dialog context includes the intent (sending a text message), and the slot information "the object of the text message (Xiao Ming), and the content of the text message (vacancy)". The electronic device 100 receives the instruction and the dialogue context sent by the cloud server, plays the voice reply content "what do you want to send” based on the instruction, and saves the dialogue context.
  • TTS speech synthesis technology
  • the electronic device 100 plays the voice reply content "who do you want to text" again. If the network quality is not good at this time, the electronic device 100 invokes its own voice processing capability The voice signal "to Xiaoming” is processed.
  • the electronic device 100 converts the speech signal into text information according to the speech recognition technology (ASR), and obtains the semantic information according to the semantic understanding technology (NUL).
  • ASR speech recognition technology
  • NUL semantic understanding technology
  • the electronic device 100 fills the slot based on the stored intent "send text message” and slot information "object to send text message (vacancy), content to send text message (vacancy)", and semantic information corresponding to "to Xiaoming".
  • the electronic device 100 recognizes that "Xiao Ming" in the semantic information of "To Xiao Ming" is the object to send the text message, that is, the electronic device 100 determines that the information of the slot (the object of the text message) corresponding to the intention (call) is "Xiao Ming" .
  • the cloud server 200 when there are two or more missing slots, the cloud server 200 sends not only the intention and slot information to the electronic device 100 synchronously, but the cloud server 200 can also mark the slots, indicating The order in which the slots of the electronic device 100 are filled. In this way, the electronic device can accurately fill one of the slots when processing the next voice signal.
  • the cloud server 200 when the cloud server 200 synchronously sends the intention and slot information to the electronic device 100, since there are two vacancies in the slot, the cloud server 200 can mark the slot to determine which slot is to be filled next time. . Then, when the electronic device 100 fills the slot, it can directly fill the slot without judging which slot the semantic information corresponds to. That is, the electronic device 100 can directly determine that the information of the slot (the object of the text message) corresponding to the intention (call) is "Xiao Ming".
  • the electronic device 100 Since the slot information "content of the text message" is still vacant at this time, the electronic device 100 saves the current intention and slot information, and the electronic device 100 determines that the next action to be performed is for the vacant slot information (content of the text message). ) to ask again, the electronic device 100 generates a voice reply content “what do you want to send” according to the speech synthesis technology (TTS), and the electronic device 100 plays the voice reply content.
  • TTS speech synthesis technology
  • the electronic device 100 determines that the intent has been executed.
  • the present application provides a voice interaction processing method, as shown in FIG. 9 , the method includes:
  • the electronic device 100 establishes a connection with the cloud server 200 .
  • Step S101 The electronic device 100 receives the first voice signal.
  • the first voice signal may be, for example, the voice 1 in the above-mentioned FIG. 5A or FIG. 5B , or the voice “I want to make a call” in FIG. 6 .
  • Step S102 the electronic device 100 uploads the first voice signal to the cloud server 200 .
  • Step S103 The cloud server 200 identifies the first voice signal, obtains the corresponding intent and one or more slot information corresponding to the intent, and determines the content of the first voice reply based on the intent and the one or more slot information.
  • Step S104 the cloud server 200 sends the first voice reply content, intent and one or more slot information to the electronic device 100 .
  • Step S105 The electronic device 100 outputs the first voice reply content, and saves the intent and one or more slot information.
  • the first voice reply content can be, for example, the voice reply content included in Action 1 in FIG. 5A , or the voice “who do you want to call” in FIG. 6 .
  • the communication quality between the electronic device 100 and the cloud server 200 is poor.
  • Step S106 The electronic device 100 receives the second voice signal.
  • the first voice signal may be, for example, the voice 2 in the above-mentioned FIG. 5A or FIG. 5B , or may be the voice “call Huaweing” in FIG. 6 .
  • Step S107 The electronic device 100 recognizes the second voice signal, obtains corresponding semantic information, and determines the first operation based on the intent and one or more slot information and semantic information.
  • Step S108 Execute the first operation.
  • the first operation may be, for example, Action 2 in FIG. 5A or FIG. 5B , or may be playing the voice content and/or displaying the text content “Calling Xiaoming” in FIG. 6 , and executing: Calling Xiaoming”, these three actions one or more of.
  • poor communication quality between the electronic device 100 and the cloud server 200 may occur at any time period between steps S106 and S107.
  • the electronic device 10 recognizes the second voice signal, obtains corresponding semantic information, and determines the first operation based on the intent and one or more slot information and semantic information, including: electronic The device 100 identifies that the semantic information matches one of the missing slots in the one or more slot information, and fills the semantic information with the value of the slot; the electronic device, based on the intent and the filled one or more slot information, Determine the first operation.
  • the process of processing the original voice service by the electronic device based on the second voice signal is described in detail. Since the electronic device obtains the intent and slot information corresponding to the first voice signal, it can continue to receive the The slot-filling process is performed on the second voice signal of the device, so as to realize the ability to continue processing the original voice service.
  • the first operation includes one or more of the following: playing the second voice reply content; displaying the text content of the second voice reply content; jumping to a corresponding interface.
  • the second voice reply content may be, for example, the voice reply content included in Action 2 in FIG. 5A or FIG. 5B , or may be the voice “calling Xiaoming” in FIG. 6 .
  • the method further includes: the electronic device receives the first instruction sent by the cloud server; the electronic device displays the text content of the first voice reply content based on the first instruction, and/or jumps to a corresponding interface.
  • the poor communication quality between the electronic device and the cloud server includes: the electronic device fails to upload the second voice signal to the cloud server; or after the electronic device uploads the first voice signal to the cloud server, the The reply data from the cloud server has not been received within the set time.
  • the timing of poor communication quality is described here. It can be when the electronic device uploads the second voice signal and the communication quality between the electronic device and the cloud server is poor, or when the cloud server delivers the voice reply content for the second voice signal.
  • the communication quality between the electronic device and the cloud server is poor.
  • the electronic device receiving the first voice signal includes: the electronic device receives the first voice signal through a voice assistant application.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the methods described in the above method embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
  • Computer-readable media can include both computer storage media and communication media and also include any medium that can transfer a computer program from one place to another.
  • a storage medium can be any available medium that can be accessed by a computer.
  • the embodiments of the present application also provide a computer program product.
  • the methods described in the above method embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. If implemented in software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the above-mentioned computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the above-mentioned method embodiments are generated.
  • the aforementioned computers may be general purpose computers, special purpose computers, computer networks, network equipment, user equipment, or other programmable devices.
  • the above-mentioned embodiments it may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software it can be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, optical fiber, digital subscriber line) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that includes an integration of one or more available media.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state drives), and the like.
  • the process can be completed by instructing the relevant hardware by a computer program, and the program can be stored in a computer-readable storage medium.
  • the program When the program is executed , which may include the processes of the foregoing method embodiments.
  • the aforementioned storage medium includes: ROM or random storage memory RAM, magnetic disk or optical disk and other mediums that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Telephone Function (AREA)

Abstract

A voice interaction processing method, an electronic device, a cloud server, and a computer-readable medium, relating to a natural language processing technology in the field of artificial intelligence, especially a multi-round dialogue processing technology. The method comprises: an electronic device (100) receives a first voice signal (S101); when the electronic device (100) establishes a connection with a cloud server (200), the electronic device (100) uploads the first voice signal to the cloud server (200) (S102); the cloud server (200) recognizes the first voice signal to obtain a corresponding intent and one or more slot information corresponding to the intent, and determines first voice response content on the basis of the intent and one or more slot information (S103); the cloud server (200) sends the first voice response content, intent, and one or more slot information to the electronic device (100) (S104); the electronic device (100) outputs the first voice response content (S105), and then receives a second voice signal (S106); in the case of poor quality of communication between the electronic device (100) and the cloud server (200), the electronic device (100) recognizes the second voice signal to obtain corresponding semantic information, and determines a first operation on the basis of the intent, one or more slot information, and semantic information (S107); and the electronic device (100) performs the first operation (S108), such that when processing of a voice service is switched from the cloud server to the electronic device due to a network connection failure during the voice service of a multi-round dialogue, the electronic device can also continue to execute the original voice service on the basis of the context of the voice dialogue and a next voice signal received, thereby solving the problem of interruption of voice services of multi-round dialogues.

Description

一种语音交互处理方法及相关装置A kind of voice interaction processing method and related device
本申请要求于2020年12月31日提交中国专利局、申请号为202011636583.9、申请名称为“一种语音交互处理方法及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202011636583.9 and the application title "A Voice Interaction Processing Method and Related Apparatus" filed with the China Patent Office on December 31, 2020, the entire contents of which are incorporated herein by reference Applying.
技术领域technical field
本申请涉及人工智能领域,尤其涉及一种语音交互处理方法及相关装置。The present application relates to the field of artificial intelligence, and in particular, to a voice interaction processing method and related devices.
背景技术Background technique
随着语音交互技术的逐步发展,越来越多的智能设备具有了语音交互功能。语音交互是指用户通过输入语音/文本得到语音/文本响应,比如用户语音输入“今天天气怎么样”,智能设备语音返回“天气晴,25度~29度”。With the gradual development of voice interaction technology, more and more smart devices have a voice interaction function. Voice interaction means that the user obtains a voice/text response by inputting voice/text. For example, the user voice input "How is the weather today", and the smart device voice returns "The weather is fine, 25 degrees to 29 degrees".
目前的语音交互系统是需要网络支持的。当某些情况(比如网络中断)时,原语音交互业务无法继续执行,影响用户体验。The current voice interaction system needs network support. In some cases (such as network interruption), the original voice interaction service cannot continue to be executed, which affects the user experience.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了一种语音交互处理方法及相关装置,以解决多轮对话的语音业务中断的问题,提升了语音业务处理的能力。Embodiments of the present application provide a voice interaction processing method and a related device, so as to solve the problem of voice service interruption in multiple rounds of conversations, and improve the voice service processing capability.
第一方面,本申请提供了一种语音交互处理方法,该方法应用于电子设备,包括:电子设备接收输入的第一语音信号;在电子设备与云服务器建立连接的情况下,电子设备将第一语音信号上传到云服务器;电子设备接收云服务器发送的第一语音回复内容、意图和意图对应的一个或多个槽位信息,该意图和一个或多个槽位信息是云服务器对第一语音信号进行识别得到的,第一语音回复内容是云服务器基于意图和一个或多个槽位信息确定出的;电子设备输出第一语音回复内容后,接收第二语音信号;在电子设备与云服务器的通信质量不佳的情况下,电子设备对第二语音信号进行识别,得到对应的语义信息,并基于意图和一个或多个槽位信息和语义信息,确定出第一操作;电子设备执行第一操作。In a first aspect, the present application provides a voice interaction processing method, which is applied to an electronic device, including: the electronic device receives an input first voice signal; when the electronic device establishes a connection with a cloud server, the electronic device sends the first voice signal to the electronic device. A voice signal is uploaded to the cloud server; the electronic device receives the content of the first voice reply sent by the cloud server, the intent, and one or more slot information corresponding to the intent, and the intent and the one or more slot information are the first Recognized by the voice signal, the first voice reply content is determined by the cloud server based on the intent and one or more slot information; after the electronic device outputs the first voice reply content, it receives the second voice signal; When the communication quality of the server is not good, the electronic device recognizes the second voice signal, obtains corresponding semantic information, and determines the first operation based on the intent and one or more slot information and semantic information; the electronic device executes first operation.
本申请实施例,在处理语音业务的过程中,可以由云服务器处理,也可以由电子设备处理,当云服务器处理语音数据,云服务器向电子设备发送相应的指令,指示电子设备执行相应的动作,并且同步向电子设备发送语音对话的上下文(意图和槽位信息)。这样,若在多轮对话的语音业务中出现网络中断的情况,导致语音业务从云服务器上处理切换到电子设备到处理,电子设备也能够基于语音对话的上下文和接收到的下一次的语音信号,继续执行原语音业务,从而能够解决多轮对话的语音业务中断的问题,提升了语音业务处理的能力。In this embodiment of the present application, in the process of processing voice services, it can be processed by a cloud server or an electronic device. When the cloud server processes voice data, the cloud server sends corresponding instructions to the electronic device, instructing the electronic device to perform corresponding actions. , and synchronously sends the context (intent and slot information) of the voice conversation to the electronic device. In this way, if the network is interrupted in the voice service of multiple rounds of dialogue, the voice service is switched from the cloud server to the electronic device for processing, and the electronic device can also be based on the context of the voice dialogue and the received next voice signal. , continue to execute the original voice service, so as to solve the problem of interruption of voice service for multiple rounds of dialogues, and improve the ability of voice service processing.
在一种可能的实现方式中,电子设备基于意图和一个或多个槽位信息和语义信息,确定出第一操作,包括:电子设备识别出语义信息和一个或多个槽位信息中其中一个缺失的槽位匹配,将语义信息填充为该槽位的值;电子设备基于意图和填充后的一个或多个槽位信息,确定第一操作。这里,具体描述了电子设备基于第二语音信号处理原语音业务的过程,由于电子设备获取了第一语音信号对应的意图和槽位信息,则能够继续基于该意图和槽位信息,对接收到的第二语音信号进行填槽处理,实现继续处理原语音业务的能力。In a possible implementation manner, the electronic device determines the first operation based on the intent and one or more slot information and semantic information, including: the electronic device identifies one of the semantic information and the one or more slot information The missing slot matches, and the semantic information is filled with the value of the slot; the electronic device determines the first operation based on the intent and the filled information of one or more slots. Here, the process of processing the original voice service by the electronic device based on the second voice signal is described in detail. Since the electronic device has obtained the intent and slot information corresponding to the first voice signal, it can continue to receive the corresponding information based on the intent and the slot information. The slot-filling process is performed on the second voice signal of the device, so as to realize the ability to continue processing the original voice service.
在一种可能的实现方式中,第一操作包括以下一项或多项:播放第二语音回复内容;显示第二语音回复内容的文字内容;跳转到相应的界面。In a possible implementation manner, the first operation includes one or more of the following: playing the second voice reply content; displaying the text content of the second voice reply content; jumping to a corresponding interface.
在一种可能的实现方式中,方法还包括:电子设备接收云服务器发送的第一指令;电子 设备基于第一指令显示第一语音回复内容的文字内容,和/或跳转到相应的界面。In a possible implementation manner, the method further includes: the electronic device receives the first instruction sent by the cloud server; the electronic device displays the text content of the first voice reply content based on the first instruction, and/or jumps to a corresponding interface.
在一种可能的实现方式中,电子设备与云服务器的通信质量不佳,包括:电子设备上传第二语音信号到云服务器失败;或者电子设备将第一语音信号上传到云服务器后,在预设时间内没有接收到云服务器的回复数据。这里说明了通信质量不佳发生的时机,可以是在电子设备上传第二语音信号时电子设备与云服务器的通信质量不佳,也可以是云服务器下发针对第二语音信号的语音回复内容时电子设备与云服务器的通信质量不佳。In a possible implementation manner, the poor communication quality between the electronic device and the cloud server includes: the electronic device fails to upload the second voice signal to the cloud server; or after the electronic device uploads the first voice signal to the cloud server, the The reply data from the cloud server has not been received within the set time. The timing of poor communication quality is described here. It can be when the electronic device uploads the second voice signal and the communication quality between the electronic device and the cloud server is poor, or when the cloud server delivers the voice reply content for the second voice signal. The communication quality between the electronic device and the cloud server is poor.
在一种可能的实现方式中,电子设备接收第一语音信号,包括:电子设备通过语音助手应用接收第一语音信号。In a possible implementation manner, the electronic device receiving the first voice signal includes: the electronic device receives the first voice signal through a voice assistant application.
第二方面,本申请提供了一种语音交互处理方法,该方法应用于云服务器,包括:云服务器接收电子设备上传的第一语音信号;云服务器对第一语音信号进行识别,得到对应的意图和意图对应的一个或多个槽位信息,并基于意图和一个或多个槽位信息确定出第一语音回复内容;云服务器向电子设备发送第一语音回复内容、意图和一个或多个槽位信息。In a second aspect, the present application provides a voice interaction processing method, which is applied to a cloud server, including: the cloud server receives a first voice signal uploaded by an electronic device; the cloud server recognizes the first voice signal to obtain a corresponding intent One or more slot information corresponding to the intention, and determine the first voice reply content based on the intention and the one or more slot information; the cloud server sends the first voice reply content, the intention and one or more slots to the electronic device bit information.
本申请实施例,云服务器向电子设备发送指令,指示电子设备执行相应的动作时,同步向电子设备发送语音对话的上下文(意图和槽位信息)。这样,若在多轮对话的语音业务中出现网络中断的情况,导致语音业务从云服务器上处理切换到电子设备到处理,电子设备也能够基于语音对话的上下文和接收到的下一次的语音信号,继续执行原语音业务,从而能够解决多轮对话的语音业务中断的问题,提升了语音业务处理的能力。In this embodiment of the present application, the cloud server sends an instruction to the electronic device to instruct the electronic device to simultaneously send the context (intent and slot information) of the voice dialogue to the electronic device when performing a corresponding action. In this way, if the network is interrupted in the voice service of multiple rounds of dialogue, the voice service is switched from the cloud server to the electronic device for processing, and the electronic device can also be based on the context of the voice dialogue and the received next voice signal. , continue to execute the original voice service, so as to solve the problem of interruption of voice service for multiple rounds of dialogues, and improve the ability of voice service processing.
在一种可能的实现方式中,云服务器向电子设备发送第一语音回复内容和意图和一个或多个槽位信息,包括:云服务器在一个或多个槽位信息中至少一个槽位信息存在缺失的情况下,向电子设备发送第一语音回复内容和意图和一个或多个槽位信息。这里提供了一种云服务器向电子设备发送意图和槽位信息的一种情况,即当槽位信息存在缺失时,则判断当前语音业务为多轮对话业务,则这时云服务器才会向电子设备发送意图和槽位信息;若槽位信息不存在缺失,则该语音业务单轮就可以处理完成,无需获取下一次的语音信号。通过对槽位信息是否缺失的判断步骤,对下发意图和槽位信息进行进一步的决定,能够节约资源。In a possible implementation manner, the cloud server sends the first voice reply content and intent and one or more slot information to the electronic device, including: the cloud server has at least one slot information in the one or more slot information. In the case of absence, the content and intent of the first voice reply and one or more slot information are sent to the electronic device. Here is a situation in which the cloud server sends intent and slot information to the electronic device, that is, when the slot information is missing, it is determined that the current voice service is a multi-round conversation service, and then the cloud server will send the electronic device to the electronic device. The device sends the intent and slot information; if the slot information is not missing, the voice service can be processed in a single round, and there is no need to obtain the next voice signal. Through the step of judging whether the slot information is missing, the delivery intention and the slot information are further determined, which can save resources.
第三方面,本申请提供了一种语音交互处理系统,该语音交互处理系统包括电子设备和云服务器,其中,In a third aspect, the present application provides a voice interaction processing system, the voice interaction processing system includes an electronic device and a cloud server, wherein,
电子设备,用于接收第一语音信号;an electronic device for receiving the first voice signal;
电子设备,还用于在电子设备与云服务器建立连接的情况下,将第一语音信号上传到云服务器;The electronic device is further configured to upload the first voice signal to the cloud server when the electronic device establishes a connection with the cloud server;
云服务器,用于对第一语音信号进行识别,得到对应的意图和意图对应的一个或多个槽位信息,并基于意图和一个或多个槽位信息确定出第一语音回复内容;The cloud server is used to identify the first voice signal, obtain the corresponding intention and one or more slot information corresponding to the intention, and determine the content of the first voice reply based on the intention and the one or more slot information;
云服务器,还用于向电子设备发送第一语音回复内容、意图和一个或多个槽位信息;The cloud server is further configured to send the first voice reply content, intent and one or more slot information to the electronic device;
电子设备,还用于输出第一语音回复内容后,接收第二语音信号;The electronic device is further configured to receive the second voice signal after outputting the content of the first voice reply;
电子设备,还用于在电子设备与云服务器的通信质量不佳的情况下,对第二语音信号进行识别,得到对应的语义信息,并基于意图和一个或多个槽位信息和语义信息,确定出第一操作;The electronic device is also used for identifying the second voice signal in the case of poor communication quality between the electronic device and the cloud server, to obtain corresponding semantic information, and based on the intent and one or more slot information and semantic information, determine the first operation;
电子设备,还用于执行第一操作。The electronic device is further configured to perform the first operation.
本申请实施例,在处理语音业务的过程中,可以由云服务器处理,也可以由电子设备处理,当云服务器处理语音数据,云服务器向电子设备发送相应的指令,指示电子设备执行相应的动作,并且同步向电子设备发送语音对话的上下文(意图和槽位信息)。这样,若在多 轮对话的语音业务中出现网络中断的情况,导致语音业务从云服务器上处理切换到电子设备到处理,电子设备也能够基于语音对话的上下文和接收到的下一次的语音信号,继续执行原语音业务,从而能够解决多轮对话的语音业务中断的问题,提升了语音业务处理的能力。In this embodiment of the present application, in the process of processing voice services, it can be processed by a cloud server or an electronic device. When the cloud server processes voice data, the cloud server sends corresponding instructions to the electronic device, instructing the electronic device to perform corresponding actions. , and synchronously sends the context (intent and slot information) of the voice conversation to the electronic device. In this way, if the network is interrupted in the voice service of multiple rounds of dialogue, the voice service is switched from the cloud server to the electronic device for processing, and the electronic device can also be based on the context of the voice dialogue and the received next voice signal. , continue to execute the original voice service, so as to solve the problem of interruption of voice service for multiple rounds of dialogues, and improve the ability of voice service processing.
在一种可能的实现方式中,电子设备还用于,识别出语义信息和一个或多个槽位信息中其中一个缺失的槽位匹配,将语义信息填充为该槽位的值;电子设备,还用于基于意图和填充后的一个或多个槽位信息,确定第一操作。这里,具体描述了电子设备基于第二语音信号处理原语音业务的过程,由于电子设备获取了第一语音信号对应的意图和槽位信息,则能够继续基于该意图和槽位信息,对接收到的第二语音信号进行填槽处理,实现继续处理原语音业务的能力。In a possible implementation manner, the electronic device is further configured to identify that the semantic information matches one of the missing slots in the one or more slot information, and fill the semantic information as the value of the slot; the electronic device, Also used to determine the first operation based on the intent and the filled one or more slot information. Here, the process of processing the original voice service by the electronic device based on the second voice signal is described in detail. Since the electronic device has obtained the intent and slot information corresponding to the first voice signal, it can continue to receive the corresponding information based on the intent and the slot information. The slot-filling process is performed on the second voice signal of the device, so as to realize the ability to continue processing the original voice service.
在一种可能的实现方式中,第一操作包括以下一项或多项:播放第二语音回复内容;显示第二语音回复内容的文字内容;跳转到相应的界面。In a possible implementation manner, the first operation includes one or more of the following: playing the second voice reply content; displaying the text content of the second voice reply content; jumping to a corresponding interface.
在一种可能的实现方式中,电子设备,还用于接收云服务器发送的第一指令;电子设备,还用于基于第一指令显示第一语音回复内容的文字内容,和/或跳转到相应的界面。In a possible implementation manner, the electronic device is further configured to receive the first instruction sent by the cloud server; the electronic device is further configured to display the text content of the first voice reply content based on the first instruction, and/or jump to corresponding interface.
在一种可能的实现方式中,电子设备与云服务器的通信质量不佳,包括:电子设备上传第二语音信号到云服务器失败;或者电子设备将第一语音信号上传到云服务器后,在预设时间内没有接收到云服务器的回复数据。这里说明了通信质量不佳发生的时机,可以是在电子设备上传第二语音信号时电子设备与云服务器的通信质量不佳,也可以是云服务器下发针对第二语音信号的语音回复内容时电子设备与云服务器的通信质量不佳。In a possible implementation manner, the poor communication quality between the electronic device and the cloud server includes: the electronic device fails to upload the second voice signal to the cloud server; or after the electronic device uploads the first voice signal to the cloud server, the The reply data from the cloud server has not been received within the set time. The timing of poor communication quality is described here. It can be when the electronic device uploads the second voice signal and the communication quality between the electronic device and the cloud server is poor, or when the cloud server delivers the voice reply content for the second voice signal. The communication quality between the electronic device and the cloud server is poor.
在一种可能的实现方式中,电子设备还用于通过语音助手应用接收第一语音信号。In a possible implementation manner, the electronic device is further configured to receive the first voice signal through a voice assistant application.
在一种可能的实现方式中,云服务器还用于在一个或多个槽位信息中至少一个槽位信息存在缺失的情况下,向电子设备发送第一语音回复内容、意图和一个或多个槽位信息。这里提供了一种云服务器向电子设备发送意图和槽位信息的一种情况,即当槽位信息存在缺失时,则判断当前语音业务为多轮对话业务,则这时云服务器才会向电子设备发送意图和槽位信息;若槽位信息不存在缺失,则该语音业务单轮就可以处理完成,无需获取下一次的语音信号。通过对槽位信息是否缺失的判断步骤,对下发意图和槽位信息进行进一步的决定,能够节约资源。In a possible implementation manner, the cloud server is further configured to send, to the electronic device, the first voice reply content, intent and one or more slot information. Here is a situation in which the cloud server sends intent and slot information to the electronic device, that is, when the slot information is missing, it is determined that the current voice service is a multi-round conversation service, and then the cloud server will send the electronic device to the electronic device. The device sends the intent and slot information; if the slot information is not missing, the voice service can be processed in a single round, and there is no need to obtain the next voice signal. Through the step of judging whether the slot information is missing, the delivery intention and the slot information are further determined, which can save resources.
第四方面,本申请提供了一种电子设备,包括:一个或多个处理器、一个或多个存储器;该一个或多个存储与一个或多个处理器耦合;该一个或多个存储器用于存储计算机程序代码,该计算机程序代码包括计算机指令;当该计算机指令在该处理器上运行时,使得该电子设备执行上述第一方面任一种可能的实现方式中的语音交互处理方法。In a fourth aspect, the present application provides an electronic device, comprising: one or more processors and one or more memories; the one or more memories are coupled with the one or more processors; the one or more memories are used for storing computer program code, the computer program code including computer instructions; when the computer instructions are executed on the processor, the electronic device enables the electronic device to execute the voice interaction processing method in any possible implementation manner of the first aspect.
第五方面,本申请提供了一种云服务器,包括:一个或多个处理器、一个或多个存储器;该一个或多个存储与一个或多个处理器耦合;该一个或多个存储器用于存储计算机程序代码,该计算机程序代码包括计算机指令;当该计算机指令在该处理器上运行时,使得该电子设备执行上述第二方面任一种可能的实现方式中的语音交互处理方法。In a fifth aspect, the present application provides a cloud server, comprising: one or more processors and one or more memories; the one or more memories are coupled with the one or more processors; the one or more memories are used for The computer program code is stored in the computer program code, and the computer program code includes computer instructions; when the computer instructions are executed on the processor, the electronic device enables the electronic device to execute the voice interaction processing method in any possible implementation manner of the second aspect.
第六方面,本申请实施例提供了一种计算机存储介质,包括计算机指令,当计算机指令在电子设备上运行时,使得通信装置执行上述任一方面任一项可能的实现方式中的语音交互处理方法。In a sixth aspect, an embodiment of the present application provides a computer storage medium, including computer instructions, which, when the computer instructions are run on an electronic device, cause the communication apparatus to perform the voice interaction processing in any possible implementation manner of any of the above aspects method.
第七方面,本申请实施例提供了一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算机执行上述任一方面任一项可能的实现方式中的语音交互处理方法。In a seventh aspect, an embodiment of the present application provides a computer program product that, when the computer program product runs on a computer, enables the computer to execute the voice interaction processing method in any possible implementation manner of any one of the foregoing aspects.
附图说明Description of drawings
图1为本申请实施例提供的一种系统架构示意图;FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present application;
图2为本申请实施例提供的一种电子设备的结构示意图;FIG. 2 is a schematic structural diagram of an electronic device provided by an embodiment of the present application;
图3为本申请实施例提供的一种电子设备的软件结构示意图;3 is a schematic diagram of a software structure of an electronic device provided by an embodiment of the present application;
图4为本申请实施例提供的一种语音交互处理方法的原理示意图;FIG. 4 is a schematic diagram of the principle of a voice interaction processing method provided by an embodiment of the present application;
图5A~图5B为本申请实施例提供的又一种语音交互处理方法的原理示意图;5A-5B are schematic schematic diagrams of still another voice interaction processing method provided by an embodiment of the present application;
图6为本申请实施例提供的一种打电话场景的原理示意图;FIG. 6 is a schematic diagram of the principle of a phone call scenario provided by an embodiment of the present application;
图7A~图7B为实施例提供的一种语音交互处理方法的场景示意图;7A-7B are schematic diagrams of scenarios of a voice interaction processing method provided by an embodiment;
图8A~图8D为本申请实施例提供的一组应用界面示意图;8A to 8D are schematic diagrams of a group of application interfaces provided by an embodiment of the present application;
图9为本申请实施例提供的一种语音交互处理方法的流程示意图。FIG. 9 is a schematic flowchart of a voice interaction processing method provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合附图对本申请实施例中的技术方案进行地描述。其中,在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;文本中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,另外,在本申请实施例的描述中,“多个”是指两个或多于两个。The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings. Wherein, in the description of the embodiments of the present application, unless otherwise specified, “/” means or, for example, A/B can mean A or B; “and/or” in the text is only a description of an associated object The association relationship indicates that there can be three kinds of relationships, for example, A and/or B can indicate that A exists alone, A and B exist at the same time, and B exists alone. In addition, in the description of the embodiments of this application , "plurality" means two or more than two.
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为暗示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征,在本申请实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。术语“中间”、“左”、“右”、“上”、“下”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本申请和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本申请的限制。Hereinafter, the terms "first" and "second" are only used for descriptive purposes, and should not be construed as implying or implying relative importance or implying the number of indicated technical features. Therefore, the features defined as "first" and "second" may explicitly or implicitly include one or more of the features. In the description of the embodiments of the present application, unless otherwise specified, the "multiple" The meaning is two or more. The orientation or positional relationship indicated by the terms "middle", "left", "right", "upper", "lower", etc. is based on the orientation or positional relationship shown in the accompanying drawings, and is only for the convenience of describing the present application and simplifying the description, Rather than indicating or implying that the referred device or element must have a particular orientation, be constructed and operate in a particular orientation, it should not be construed as a limitation on the application.
本申请实施例,图1示出了根据本发明一个实施例的语音交互系统10的场景示意图。如图1所示,系统10中包括电子设备100和云服务器200。应当指出,图1所示的系统10仅作为一个示例,本领域技术人员可以理解,在实际应用中,系统10通常包括多个电子设备100和云服务器200,本申请对系统10中所包括的电子设备100和云服务器200的数量均不做限制。In this embodiment of the present application, FIG. 1 shows a schematic diagram of a scenario of a voice interaction system 10 according to an embodiment of the present invention. As shown in FIG. 1 , the system 10 includes an electronic device 100 and a cloud server 200 . It should be noted that the system 10 shown in FIG. 1 is only an example, and those skilled in the art can understand that in practical applications, the system 10 usually includes a plurality of electronic devices 100 and a cloud server 200 . The numbers of electronic devices 100 and cloud servers 200 are not limited.
电子设备100为具有语音交互功能的智能设备,电子设备100可以接收用户发出的语音指示,以及向用户返回语音或非语音信息。本申请实施例中,电子设备100可以是手机、平板电脑、笔记本电脑、超级移动个人计算机(Ultra-mobile Personal Computer,UMPC)、手持计算机、上网本、个人数字助理(Personal Digital Assistant,PDA)、虚拟现实设备、PDA(Personal Digital Assistant,个人数字助手,又称为掌上电脑)、便携式互联网设备、数据存储设备、相机、可穿戴设备(例如,无线耳机、智能手表、智能手环、智能眼镜、头戴式设备(Head-mounted display,HMD)、电子衣物、电子手镯、电子项链、电子配件、电子纹身和智能镜子)或智能家居设备(例如智能音箱、智能冰箱、智能台灯、电灯、智能电视、智能微波炉、智能风扇、空调、智能机器人、智能窗帘)等等。本申请实施例中涉及的一个应用场景为家用场景,即,电子设备100放置于用户家中,用户可以向电子设备100发出语音指示以实现某些功能,例如上网、点播歌曲、购物、了解天气预报、对家中的其他智能家居设备 进行控制,等等。The electronic device 100 is a smart device with a voice interaction function. The electronic device 100 can receive a voice instruction issued by a user and return voice or non-voice information to the user. In the embodiment of the present application, the electronic device 100 may be a mobile phone, a tablet computer, a notebook computer, an Ultra-mobile Personal Computer (UMPC), a handheld computer, a netbook, a Personal Digital Assistant (PDA), a virtual Reality devices, PDAs (Personal Digital Assistants, also known as PDAs), portable Internet devices, data storage devices, cameras, wearable devices (e.g., wireless headsets, smart watches, smart bracelets, smart glasses, headsets) Wearable devices (Head-mounted display, HMD), electronic clothing, electronic bracelets, electronic necklaces, electronic accessories, electronic tattoos and smart mirrors) or smart home devices (such as smart speakers, smart refrigerators, smart desk lamps, electric lights, smart TVs, Smart microwave ovens, smart fans, air conditioners, smart robots, smart curtains) and so on. An application scenario involved in the embodiments of this application is a home scenario, that is, the electronic device 100 is placed in the user's home, and the user can send voice instructions to the electronic device 100 to implement certain functions, such as surfing the Internet, playing songs on demand, shopping, and knowing the weather forecast. , control other smart home devices in your home, and more.
云服务器200与电子设备100通过网络进行通信,其例如可以是物理上位于一个或多个地点的云服务器。云服务器200为电子设备100上接收的语音数据提供识别服务,以得到用户输入的语音数据的文本表示;云服务器200还会基于文本表示得到用户意图的表示,并生成响应指令,返回给电子设备100。电子设备100根据该响应指令执行相应的动作,来为用户提供相应的服务,例如设置闹钟、拨打电话、发送邮件、播报资讯、播放歌曲、视频等。当然,电子设备100也可以根据响应指令输出相应的语音响应给用户,或者显示相应的文本内容,本申请实施例对此不做限制。The cloud server 200 communicates with the electronic device 100 through a network, which may be, for example, a cloud server physically located at one or more locations. The cloud server 200 provides a recognition service for the voice data received on the electronic device 100, so as to obtain a text representation of the voice data input by the user; the cloud server 200 also obtains the representation of the user's intention based on the text representation, and generates a response command, which is returned to the electronic device 100. The electronic device 100 performs corresponding actions according to the response instruction to provide the user with corresponding services, such as setting an alarm clock, making a phone call, sending an email, broadcasting information, playing a song, a video, and the like. Of course, the electronic device 100 may also output a corresponding voice response to the user according to the response instruction, or display corresponding text content, which is not limited in this embodiment of the present application.
下面首先介绍本申请实施例中涉及的电子设备100。The following first introduces the electronic device 100 involved in the embodiments of the present application.
参见图2,图2示出了本申请实施例提供的示例性电子设备100的结构示意图。Referring to FIG. 2, FIG. 2 shows a schematic structural diagram of an exemplary electronic device 100 provided by an embodiment of the present application.
电子设备100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and Subscriber identification module (subscriber identification module, SIM) card interface 195 and so on. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
可以理解的是,本申请实施例示意的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。It can be understood that the structures illustrated in the embodiments of the present application do not constitute a specific limitation on the electronic device 100 . In other embodiments of the present application, the electronic device 100 may include more or less components than shown, or combine some components, or separate some components, or arrange different components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。The processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
其中,控制器可以是电子设备100的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。The controller may be the nerve center and command center of the electronic device 100 . The controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。In some embodiments, the processor 110 may include one or more interfaces. The interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transceiver (universal asynchronous transmitter) receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / or universal serial bus (universal serial bus, USB) interface, etc.
I2C接口是一种双向同步串行总线,包括一根串行数据线(serial data line,SDA)和一根串行时钟线(derail clock line,SCL)。在一些实施例中,处理器110可以包含多组I2C总线。处理器110可以通过不同的I2C总线接口分别耦合触摸传感器180K,充电器,闪光灯,摄像头193等。例如:处理器110可以通过I2C接口耦合触摸传感器180K,使处理器110与触摸传感器180K通过I2C总线接口通信,实现电子设备100的触摸功能。The I2C interface is a bidirectional synchronous serial bus that includes a serial data line (SDA) and a serial clock line (SCL). In some embodiments, the processor 110 may contain multiple sets of I2C buses. The processor 110 can be respectively coupled to the touch sensor 180K, the charger, the flash, the camera 193 and the like through different I2C bus interfaces. For example, the processor 110 may couple the touch sensor 180K through the I2C interface, so that the processor 110 and the touch sensor 180K communicate with each other through the I2C bus interface, so as to realize the touch function of the electronic device 100 .
I2S接口可以用于音频通信。在一些实施例中,处理器110可以包含多组I2S总线。处理器110可以通过I2S总线与音频模块170耦合,实现处理器110与音频模块170之间的通信。在一些实施例中,音频模块170可以通过I2S接口向无线通信模块160传递音频信号,实现通过蓝牙耳机接听电话的功能。The I2S interface can be used for audio communication. In some embodiments, the processor 110 may contain multiple sets of I2S buses. The processor 110 may be coupled with the audio module 170 through an I2S bus to implement communication between the processor 110 and the audio module 170 . In some embodiments, the audio module 170 can transmit audio signals to the wireless communication module 160 through the I2S interface, so as to realize the function of answering calls through a Bluetooth headset.
PCM接口也可以用于音频通信,将模拟信号抽样,量化和编码。在一些实施例中,音频模块170与无线通信模块160可以通过PCM总线接口耦合。在一些实施例中,音频模块170也可以通过PCM接口向无线通信模块160传递音频信号,实现通过蓝牙耳机接听电话的功能。所述I2S接口和所述PCM接口都可以用于音频通信。The PCM interface can also be used for audio communications, sampling, quantizing and encoding analog signals. In some embodiments, the audio module 170 and the wireless communication module 160 may be coupled through a PCM bus interface. In some embodiments, the audio module 170 can also transmit audio signals to the wireless communication module 160 through the PCM interface, so as to realize the function of answering calls through the Bluetooth headset. Both the I2S interface and the PCM interface can be used for audio communication.
UART接口是一种通用串行数据总线,用于异步通信。该总线可以为双向通信总线。它将要传输的数据在串行通信与并行通信之间转换。在一些实施例中,UART接口通常被用于连接处理器110与无线通信模块160。例如:处理器110通过UART接口与无线通信模块160中的蓝牙模块通信,实现蓝牙功能。在一些实施例中,音频模块170可以通过UART接口向无线通信模块160传递音频信号,实现通过蓝牙耳机播放音乐的功能。The UART interface is a universal serial data bus used for asynchronous communication. The bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is typically used to connect the processor 110 with the wireless communication module 160 . For example, the processor 110 communicates with the Bluetooth module in the wireless communication module 160 through the UART interface to implement the Bluetooth function. In some embodiments, the audio module 170 can transmit audio signals to the wireless communication module 160 through the UART interface, so as to realize the function of playing music through the Bluetooth headset.
MIPI接口可以被用于连接处理器110与显示屏194,摄像头193等外围器件。MIPI接口包括摄像头串行接口(camera serial interface,CSI),显示屏串行接口(display serial interface,DSI)等。在一些实施例中,处理器110和摄像头193通过CSI接口通信,实现电子设备100的拍摄功能。处理器110和显示屏194通过DSI接口通信,实现电子设备100的显示功能。The MIPI interface can be used to connect the processor 110 with peripheral devices such as the display screen 194 and the camera 193 . MIPI interfaces include camera serial interface (CSI), display serial interface (DSI), etc. In some embodiments, the processor 110 communicates with the camera 193 through a CSI interface, so as to realize the photographing function of the electronic device 100 . The processor 110 communicates with the display screen 194 through the DSI interface to implement the display function of the electronic device 100 .
GPIO接口可以通过软件配置。GPIO接口可以被配置为控制信号,也可被配置为数据信号。在一些实施例中,GPIO接口可以用于连接处理器110与摄像头193,显示屏194,无线通信模块160,音频模块170,传感器模块180等。GPIO接口还可以被配置为I2C接口,I2S接口,UART接口,MIPI接口等。The GPIO interface can be configured by software. The GPIO interface can be configured as a control signal or as a data signal. In some embodiments, the GPIO interface may be used to connect the processor 110 with the camera 193, the display screen 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like. The GPIO interface can also be configured as I2C interface, I2S interface, UART interface, MIPI interface, etc.
USB接口130是符合USB标准规范的接口,具体可以是Mini USB接口,Micro USB接口,USB Type C接口等。USB接口130可以用于连接充电器为电子设备100充电,也可以用于电子设备100与外围设备之间传输数据。也可以用于连接耳机,通过耳机播放音频。该接口还可以用于连接其他电子设备,例如AR设备等。The USB interface 130 is an interface that conforms to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, and the like. The USB interface 130 can be used to connect a charger to charge the electronic device 100, and can also be used to transmit data between the electronic device 100 and peripheral devices. It can also be used to connect headphones to play audio through the headphones. The interface can also be used to connect other electronic devices, such as AR devices.
可以理解的是,本申请实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对电子设备100的结构限定。在本申请另一些实施例中,电子设备100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。It can be understood that the interface connection relationship between the modules illustrated in the embodiments of the present application is only a schematic illustration, and does not constitute a structural limitation of the electronic device 100 . In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
充电管理模块140用于从充电器接收充电输入。The charging management module 140 is used to receive charging input from the charger.
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,外部存储器,显示屏194,摄像头193,和无线通信模块160等供电。电源管理模块141还可以用于监测电池容量,电池循环次数,电池健康状态(漏电,阻抗)等参数。在其他一些实施例中,电源管理模块141也可以设置于处理器110中。在另一些实施例中,电源管理模块141和充电管理模块140也可以设置于同一个器件中。The power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 . The power management module 141 receives input from the battery 142 and/or the charging management module 140 and supplies power to the processor 110 , the internal memory 121 , the external memory, the display screen 194 , the camera 193 , and the wireless communication module 160 . The power management module 141 can also be used to monitor parameters such as battery capacity, battery cycle times, battery health status (leakage, impedance). In some other embodiments, the power management module 141 may also be provided in the processor 110 . In other embodiments, the power management module 141 and the charging management module 140 may also be provided in the same device.
电子设备100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模 块160,调制解调处理器以及基带处理器等实现。The wireless communication function of the electronic device 100 can be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor, the baseband processor, and the like.
天线1和天线2用于发射和接收电磁波信号。电子设备100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。 Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in electronic device 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example, the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
移动通信模块150可以提供应用在电子设备100上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块150的至少部分功能模块可以被设置于处理器110中。在一些实施例中,移动通信模块150的至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。The mobile communication module 150 may provide wireless communication solutions including 2G/3G/4G/5G etc. applied on the electronic device 100 . The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA) and the like. The mobile communication module 150 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation. The mobile communication module 150 can also amplify the signal modulated by the modulation and demodulation processor, and then turn it into an electromagnetic wave for radiation through the antenna 1 . In some embodiments, at least part of the functional modules of the mobile communication module 150 may be provided in the processor 110 . In some embodiments, at least part of the functional modules of the mobile communication module 150 may be provided in the same device as at least part of the modules of the processor 110 .
调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。随后解调器将解调得到的低频基带信号传送至基带处理器处理。低频基带信号经基带处理器处理后,被传递给应用处理器。应用处理器通过音频设备(不限于扬声器170A,受话器170B等)输出声音信号,或通过显示屏194显示图像或视频。在一些实施例中,调制解调处理器可以是独立的器件。在另一些实施例中,调制解调处理器可以独立于处理器110,与移动通信模块150或其他功能模块设置在同一个器件中。The modem processor may include a modulator and a demodulator. Wherein, the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal. The demodulator is used to demodulate the received electromagnetic wave signal into a low frequency baseband signal. Then the demodulator transmits the demodulated low-frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and passed to the application processor. The application processor outputs sound signals through audio devices (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or videos through the display screen 194 . In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be independent of the processor 110, and may be provided in the same device as the mobile communication module 150 or other functional modules.
无线通信模块160可以提供应用在电子设备100上的包括UWB,无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,WiFi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块160经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。The wireless communication module 160 can provide applications on the electronic device 100 including UWB, wireless local area networks (WLAN) (such as wireless fidelity (WiFi) networks), bluetooth (BT), global navigation satellites Wireless communication solutions such as global navigation satellite system (GNSS), frequency modulation (FM), near field communication (NFC), and infrared technology (IR). The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 . The wireless communication module 160 can also receive the signal to be sent from the processor 110 , perform frequency modulation on it, amplify it, and convert it into electromagnetic waves for radiation through the antenna 2 .
在一些实施例中,电子设备100的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得电子设备100可以通过无线通信技术与网络以及其他设备通信。所述无线通信技术可以包括全球移动通讯系统(global system for mobile communications,GSM),通用分组无线服务(general packet radio service,GPRS),码分多址接入(code division multiple access,CDMA),宽带码分多址(wideband code division multiple access,WCDMA),时分码分多址(time-division code division multiple access,TD-SCDMA),长期演进(long term evolution,LTE),BT,GNSS,WLAN,NFC,FM,和/或IR技术等。所述GNSS可以包括全球卫星定位系统(global positioning system,GPS),全球导航卫星系统(global navigation satellite system,GLONASS),北斗卫星导航系统(beidou navigation satellite system,BDS),准天顶卫星系统(quasi-zenith satellite system,QZSS)和/或星基增强系统(satellite based augmentation systems,SBAS)。In some embodiments, the antenna 1 of the electronic device 100 is coupled with the mobile communication module 150, and the antenna 2 is coupled with the wireless communication module 160, so that the electronic device 100 can communicate with the network and other devices through wireless communication technology. The wireless communication technology may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband Code Division Multiple Access (WCDMA), Time Division Code Division Multiple Access (TD-SCDMA), Long Term Evolution (LTE), BT, GNSS, WLAN, NFC , FM, and/or IR technology, etc. The GNSS may include global positioning system (global positioning system, GPS), global navigation satellite system (global navigation satellite system, GLONASS), Beidou navigation satellite system (beidou navigation satellite system, BDS), quasi-zenith satellite system (quasi -zenith satellite system, QZSS) and/or satellite based augmentation systems (SBAS).
电子设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。The electronic device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶 显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,电子设备100可以包括1个或N个显示屏194,N为大于1的正整数。Display screen 194 is used to display images, videos, and the like. Display screen 194 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light). emitting diode, AMOLED), flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on. In some embodiments, the electronic device 100 may include one or N display screens 194 , where N is a positive integer greater than one.
在本申请的一些实施例中,显示屏194中显示有系统当前输出的界面内容。例如,界面内容为即时通讯应用提供的界面。In some embodiments of the present application, the display screen 194 displays the interface content currently output by the system. For example, the interface content is an interface provided by an instant messaging application.
电子设备100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。The electronic device 100 may implement a shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。The ISP is used to process the data fed back by the camera 193 . For example, when taking a photo, the shutter is opened, the light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye. ISP can also perform algorithm optimization on image noise, brightness, and skin tone. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP may be provided in the camera 193 .
摄像头193用于捕获静态图像或视频。Camera 193 is used to capture still images or video.
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当电子设备100在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。A digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals. For example, when the electronic device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy and so on.
视频编解码器用于对数字视频压缩或解压缩。电子设备100可以支持一种或多种视频编解码器。这样,电子设备100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 can play or record videos of various encoding formats, such as: Moving Picture Experts Group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4 and so on.
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现电子设备100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。The NPU is a neural-network (NN) computing processor. By drawing on the structure of biological neural networks, such as the transfer mode between neurons in the human brain, it can quickly process the input information, and can continuously learn by itself. Applications such as intelligent cognition of the electronic device 100 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.
内部存储器121可以包括一个或多个随机存取存储器(random access memory,RAM)和一个或多个非易失性存储器(non-volatile memory,NVM)。The internal memory 121 may include one or more random access memories (RAM) and one or more non-volatile memories (NVM).
随机存取存储器可以包括静态随机存储器(static random-access memory,SRAM)、动态随机存储器(dynamic random access memory,DRAM)、同步动态随机存储器(synchronous dynamic random access memory,SDRAM)、双倍资料率同步动态随机存取存储器(double data rate synchronous dynamic random access memory,DDR SDRAM,例如第五代DDR SDRAM一般称为DDR5 SDRAM)等;Random access memory can include static random-access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronization Dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM, such as the fifth generation DDR SDRAM is generally called DDR5 SDRAM), etc.;
非易失性存储器可以包括磁盘存储器件、快闪存储器(flash memory)。Non-volatile memory may include magnetic disk storage devices, flash memory.
快闪存储器按照运作原理划分可以包括NOR FLASH、NAND FLASH、3D NAND FLASH等,按照存储单元电位阶数划分可以包括单阶存储单元(single-level cell,SLC)、多阶存储单元(multi-level cell,MLC)、三阶储存单元(triple-level cell,TLC)、四阶储存单元(quad-level cell,QLC)等,按照存储规范划分可以包括通用闪存存储(英文:universal flash storage,UFS)、嵌入式多媒体存储卡(embedded multi media Card,eMMC)等。Flash memory can be divided into NOR FLASH, NAND FLASH, 3D NAND FLASH, etc. according to the operating principle, and can include single-level memory cell (SLC), multi-level memory cell (multi-level memory cell, SLC) according to the level of storage cell potential. cell, MLC), triple-level cell (TLC), quad-level cell (QLC), etc., according to the storage specification can include universal flash storage (English: universal flash storage, UFS) , embedded multimedia memory card (embedded multi media Card, eMMC) and so on.
随机存取存储器可以由处理器110直接进行读写,可以用于存储操作系统或其他正在运行中的程序的可执行程序(例如机器指令),还可以用于存储用户及应用程序的数据等。The random access memory can be directly read and written by the processor 110, and can be used to store executable programs (eg, machine instructions) of an operating system or other running programs, and can also be used to store data of users and application programs.
非易失性存储器也可以存储可执行程序和存储用户及应用程序的数据等,可以提前加载到随机存取存储器中,用于处理器110直接进行读写。The non-volatile memory can also store executable programs and store data of user and application programs, etc., and can be loaded into the random access memory in advance for the processor 110 to directly read and write.
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备 100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。The external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.
电子设备100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playback, recording, etc.
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。The audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。电子设备100可以通过扬声器170A收听音乐,或收听免提通话。Speaker 170A, also referred to as a "speaker", is used to convert audio electrical signals into sound signals. The electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call.
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当电子设备100接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。The receiver 170B, also referred to as "earpiece", is used to convert audio electrical signals into sound signals. When the electronic device 100 answers a call or a voice message, the voice can be answered by placing the receiver 170B close to the human ear.
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。电子设备100可以设置至少一个麦克风170C。在另一些实施例中,电子设备100可以设置两个麦克风170C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,电子设备100还可以设置三个,四个或更多麦克风170C,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。The microphone 170C, also called "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can make a sound by approaching the microphone 170C through a human mouth, and input the sound signal into the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
耳机接口170D用于连接有线耳机。耳机接口170D可以是USB接口130,也可以是3.5mm的开放移动电子设备平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。The earphone jack 170D is used to connect wired earphones. The earphone interface 170D may be the USB interface 130, or may be a 3.5mm open mobile terminal platform (OMTP) standard interface, a cellular telecommunications industry association of the USA (CTIA) standard interface.
压力传感器180A用于感受压力信号,可以将压力信号转换成电信号。在一些实施例中,压力传感器180A可以设置于显示屏194。陀螺仪传感器180B可以用于确定电子设备100的运动姿态。气压传感器180C用于测量气压。磁传感器180D包括霍尔传感器。电子设备100可以利用磁传感器180D检测翻盖皮套的开合。加速度传感器180E可检测电子设备100在各个方向上(一般为三轴)加速度的大小。当电子设备100静止时可检测出重力的大小及方向。还可以用于识别电子设备姿态,应用于横竖屏切换,计步器等应用。距离传感器180F,用于测量距离。电子设备100可以通过红外或激光测量距离。接近光传感器180G可以包括例如发光二极管(LED)和光检测器,例如光电二极管。环境光传感器180L用于感知环境光亮度。电子设备100可以根据感知的环境光亮度自适应调节显示屏194亮度。环境光传感器180L也可用于拍照时自动调节白平衡。环境光传感器180L还可以与接近光传感器180G配合,检测电子设备100是否在口袋里,以防误触。指纹传感器180H用于采集指纹。电子设备100可以利用采集的指纹特性实现指纹解锁,访问应用锁,指纹拍照,指纹接听来电等。温度传感器180J用于检测温度。The pressure sensor 180A is used to sense pressure signals, and can convert the pressure signals into electrical signals. In some embodiments, the pressure sensor 180A may be provided on the display screen 194 . The gyro sensor 180B may be used to determine the motion attitude of the electronic device 100 . The air pressure sensor 180C is used to measure air pressure. The magnetic sensor 180D includes a Hall sensor. The electronic device 100 can detect the opening and closing of the flip holster using the magnetic sensor 180D. The acceleration sensor 180E can detect the magnitude of the acceleration of the electronic device 100 in various directions (generally three axes). The magnitude and direction of gravity can be detected when the electronic device 100 is stationary. It can also be used to identify the posture of electronic devices, and can be used in applications such as horizontal and vertical screen switching, pedometers, etc. Distance sensor 180F for measuring distance. The electronic device 100 can measure the distance through infrared or laser. Proximity light sensor 180G may include, for example, light emitting diodes (LEDs) and light detectors, such as photodiodes. The ambient light sensor 180L is used to sense ambient light brightness. The electronic device 100 can adaptively adjust the brightness of the display screen 194 according to the perceived ambient light brightness. The ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures. The ambient light sensor 180L can also cooperate with the proximity light sensor 180G to detect whether the electronic device 100 is in a pocket, so as to prevent accidental touch. The fingerprint sensor 180H is used to collect fingerprints. The electronic device 100 can use the collected fingerprint characteristics to realize fingerprint unlocking, accessing application locks, taking pictures with fingerprints, answering incoming calls with fingerprints, and the like. The temperature sensor 180J is used to detect the temperature.
触摸传感器180K,也称“触控面板”。触摸传感器180K可以设置于显示屏194,由触摸传感器180K与显示屏194组成触摸屏,也称“触控屏”。触摸传感器180K用于检测作用于其上或附近的触摸操作,该触摸触控操作是指用户手部、手肘、触控笔等接触显示屏194的操作。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏194提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器180K也可以设置于电子设备100的表面,与显示屏194所处的位置不同。Touch sensor 180K, also called "touch panel". The touch sensor 180K may be disposed on the display screen 194 , and the touch sensor 180K and the display screen 194 form a touch screen, also called a “touch screen”. The touch sensor 180K is used to detect a touch operation acting on or near it, and the touch touch operation refers to an operation of a user's hand, elbow, stylus, etc. touching the display screen 194 . The touch sensor can pass the detected touch operation to the application processor to determine the type of touch event. Visual output related to touch operations may be provided through display screen 194 . In other embodiments, the touch sensor 180K may also be disposed on the surface of the electronic device 100 , which is different from the location where the display screen 194 is located.
骨传导传感器180M可以获取振动信号。在一些实施例中,骨传导传感器180M可以获 取人体声部振动骨块的振动信号。骨传导传感器180M也可以接触人体脉搏,接收血压跳动信号。在一些实施例中,骨传导传感器180M也可以设置于耳机中,结合成骨传导耳机。音频模块170可以基于所述骨传导传感器180M获取的声部振动骨块的振动信号,解析出语音信号,实现语音功能。应用处理器可以基于所述骨传导传感器180M获取的血压跳动信号解析心率信息,实现心率检测功能。The bone conduction sensor 180M can acquire vibration signals. In some embodiments, the bone conduction sensor 180M can acquire vibration signals of the vibrating bone mass of the human voice. The bone conduction sensor 180M can also contact the pulse of the human body and receive the blood pressure beating signal. In some embodiments, the bone conduction sensor 180M can also be disposed in the earphone, combined with the bone conduction earphone. The audio module 170 can analyze the voice signal based on the vibration signal of the vocal vibration bone block obtained by the bone conduction sensor 180M, so as to realize the voice function. The application processor can analyze the heart rate information based on the blood pressure beat signal obtained by the bone conduction sensor 180M, and realize the function of heart rate detection.
按键190包括开机键,音量键等。按键190可以是机械按键。也可以是触摸式按键。电子设备100可以接收按键输入,产生与电子设备100的用户设置以及功能控制有关的键信号输入。The keys 190 include a power-on key, a volume key, and the like. Keys 190 may be mechanical keys. It can also be a touch key. The electronic device 100 may receive key inputs and generate key signal inputs related to user settings and function control of the electronic device 100 .
马达191可以产生振动提示。马达191可以用于来电振动提示,也可以用于触摸振动反馈。例如,作用于不同应用(例如拍照,音频播放等)的触摸操作,可以对应不同的振动反馈效果。作用于显示屏194不同区域的触摸操作,马达191也可对应不同的振动反馈效果。不同的应用场景(例如:时间提醒,接收信息,闹钟,游戏等)也可以对应不同的振动反馈效果。触摸振动反馈效果还可以支持自定义。Motor 191 can generate vibrating cues. The motor 191 can be used for vibrating alerts for incoming calls, and can also be used for touch vibration feedback. For example, touch operations acting on different applications (such as taking pictures, playing audio, etc.) can correspond to different vibration feedback effects. The motor 191 can also correspond to different vibration feedback effects for touch operations on different areas of the display screen 194 . Different application scenarios (for example: time reminder, receiving information, alarm clock, games, etc.) can also correspond to different vibration feedback effects. The touch vibration feedback effect can also support customization.
指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。The indicator 192 can be an indicator light, which can be used to indicate the charging state, the change of the power, and can also be used to indicate a message, a missed call, a notification, and the like.
SIM卡接口195用于连接SIM卡。SIM卡可以通过插入SIM卡接口195,或从SIM卡接口195拔出,实现和电子设备100的接触和分离。The SIM card interface 195 is used to connect a SIM card. The SIM card can be contacted and separated from the electronic device 100 by inserting into the SIM card interface 195 or pulling out from the SIM card interface 195 .
图3示出了本申请实施例的电子设备100的软件结构框图。FIG. 3 shows a block diagram of the software structure of the electronic device 100 according to the embodiment of the present application.
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。The layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces. In some embodiments, the Android system is divided into four layers, which are, from top to bottom, an application layer, an application framework layer, an Android runtime (Android runtime) and a system library, and a kernel layer.
应用程序层可以包括一系列应用程序包。应用程序包例如可以包括相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,游戏,购物,出行,即时通信(如短信息)等应用程序。另外,应用程序包还可以包括:主屏幕(即桌面),负一屏,控制中心,通知中心等系统应用程序。The application layer can include a series of application packages. The application package may include, for example, applications such as camera, gallery, calendar, calling, map, navigation, WLAN, Bluetooth, music, video, games, shopping, travel, instant messaging (such as short messages). In addition, the application package may also include: the main screen (ie the desktop), the negative screen, the control center, the notification center and other system applications.
如图3所示,本申请实施例中应用程序层包括语音助手以及语音处理模块。As shown in FIG. 3 , the application layer in the embodiment of the present application includes a voice assistant and a voice processing module.
语音处理模块提供了一种语音处理的能力,任意一个应用程序都可以调用该语音处理模块的能力,例如语音助手的应用程序,电子设备100通过语音助手应用接收到语音信号,语音助手应用调用语音处理模块对该语音信号进行处理。语音处理模块中包括语音识别(automatic speech recognitioN,ASR)的能力、语义理解(natural language understanding,NLU)的能力、对话管理(dialog management,DM)的能力、自然语言生成(natural language generation,NLG)的能力和语音合成(text to speech,TTS)的能力等。其中,The voice processing module provides a voice processing capability, and any application program can invoke the voice processing module capability, such as a voice assistant application, the electronic device 100 receives a voice signal through the voice assistant application, and the voice assistant application invokes the voice The processing module processes the voice signal. The speech processing module includes the ability of speech recognition (automatic speech recognitioN, ASR), the ability of semantic understanding (natural language understanding, NLU), the ability of dialogue management (dialog management, DM), the ability of natural language generation (natural language generation, NLG) and speech synthesis (text to speech, TTS) capabilities. in,
语音识别模块用于对语音信号进行识别,得到语音信号的文本表示信息。具体的,语音识别模块可以先将语音信号表示为文本数据,再对文本数据进行分词处理,得到语音信号的文本表示信息,即将语音信号中的词汇转换为电子设备100可读的输入,包括例如二进制编码、字符序列等。典型的语音识别方法例如可以是:基于声道模型和语音知识的方法、模板匹配的方法(将输入的语音信号的特征矢量依次与模板库中的每个模板进行相似度比较,将相似度最高者作为识别结果输出)、以及利用神经网络的方法等,本申请实施例对采用何种语音识别方法进行语音识别处理不作限制。The speech recognition module is used for recognizing the speech signal to obtain the textual representation information of the speech signal. Specifically, the speech recognition module can first represent the speech signal as text data, and then perform word segmentation processing on the text data to obtain text representation information of the speech signal, that is, convert the words in the speech signal into readable input by the electronic device 100, including, for example, Binary encodings, character sequences, etc. A typical speech recognition method can be, for example, a method based on vocal tract model and speech knowledge, a method of template matching (compare the similarity between the feature vector of the input speech signal and each template in the template library in turn, and compare the similarity with the highest similarity. The embodiment of the present application does not limit which speech recognition method is used to perform speech recognition processing.
语义理解模块用于将语音信号的文本表示信息转换为电子设备100可以理解的语义信 息。语义信息包括实体、三元组、意图、事件等等。有了这些信息,电子设备100就可以理解用户的语言,判断用户想要做什么。The semantic understanding module is used to convert the textual representation information of the speech signal into semantic information that the electronic device 100 can understand. Semantic information includes entities, triples, intents, events, and so on. With this information, the electronic device 100 can understand the user's language and determine what the user wants to do.
对话管理模块用于基于语义信息,确定下一步电子设备100的执行的动作,执行的动作包括下列一项或多项:播放语音回复内容(如:提供结果,询问特定限制条件,澄清或确认需求等);显示所述语音回复内容的文字内容;跳转到相应的界面;等等。The dialog management module is used to determine the next action to be performed by the electronic device 100 based on the semantic information, and the actions to be performed include one or more of the following: playing the voice reply content (eg: providing a result, asking for a specific restriction, clarifying or confirming a requirement) etc.); display the text content of the voice reply content; jump to the corresponding interface; and so on.
具体的,对话管理模块确定语义信息中表达出来的意图,然后根据语义信息填充该意图对应的槽位。意图就是用户要做什么,意图对应的槽位就是用户要完成这个意图需要的信息,例如意图为“打电话”,则“打电话”对应的槽位就是打给谁,即打电话的对象;又例如意图为“发短信”,则“发短信”对应的槽位有两个,分别为发短信的对象和发短信的内容。Specifically, the dialogue management module determines the intent expressed in the semantic information, and then fills the slot corresponding to the intent according to the semantic information. The intent is what the user wants to do, and the slot corresponding to the intent is the information the user needs to complete the intent. For example, if the intent is "call", the slot corresponding to "call" is who to call, that is, the object of the call; For another example, if the intention is "send text message", there are two slots corresponding to "send text message", which are the object of text message and the content of text message.
实质上,对话管理就是一个进行决策的过程,对话管理模块在语音交互过程中不断根据当前状态决定下一步应该执行的动作,从而辅助用户完成信息获取或服务获取的任务。如果这个动作需要和用户进行语音交互,那么自然语言生成模块会被触发,生成用户可理解的语言文本;最后,生成的语言文本由语音合成模块播放给用户听。In essence, dialogue management is a decision-making process. The dialogue management module continuously determines the next action to be performed according to the current state during the voice interaction process, thereby assisting the user to complete the task of information acquisition or service acquisition. If this action requires voice interaction with the user, the natural language generation module will be triggered to generate language text that the user can understand; finally, the generated language text will be played by the speech synthesis module to the user.
自然语言生成模块用于将非语言格式的数据集转换为用户可理解的语言格式的文本信息。自然语言生成模块确定出哪些信息应该包含在正在构建的语言文本中,并组织出合理的文本顺序,将多个信息合并到一个句子里。然后选择一些连接词、短语构成一个结构良好的完整句子。The natural language generation module is used to convert data sets in non-linguistic formats into textual information in language formats that users can understand. The natural language generation module determines what information should be included in the language text being constructed, and organizes the text in a reasonable order, combining multiple pieces of information into a single sentence. Then choose some connecting words and phrases to form a well-structured complete sentence.
语音合成模块用于通过机械的、电子的方法将自然语言生成模块产生的文本信息转换为人造语音。The speech synthesis module is used to convert the textual information produced by the natural language generation module into artificial speech by mechanical and electronic means.
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。The application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer. The application framework layer includes some predefined functions.
如图3所示,应用程序框架层可以包括输入管理器,窗口管理器(window manager),内容提供器,视图系统,电话管理器,资源管理器,通知管理器,显示管理器,活动管理器(activity manager)等。为了便于说明,图3中,应用程序框架层以包括输入管理器,窗口管理器,内容提供器,视图系统,以及活动管理器为例进行示意。需要说明的是,输入管理器,窗口管理器,内容提供器,视图系统,以及活动管理器中的任意两个模块均可以相互调用。As shown in Figure 3, the application framework layer may include input manager, window manager, content provider, view system, telephony manager, resource manager, notification manager, display manager, activity manager (activity manager) etc. For ease of description, in FIG. 3 , the application framework layer is illustrated by taking an example including an input manager, a window manager, a content provider, a view system, and an activity manager. It should be noted that any two modules in the input manager, window manager, content provider, view system, and activity manager can call each other.
输入管理器用于接收如内核层、硬件抽象层等下层上报的指令或请求。The input manager is used to receive instructions or requests reported by lower layers such as the kernel layer and the hardware abstraction layer.
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。本申请中,窗口管理器用于在电子设备100符合预设的触发条件时,显示包括一个或多个快捷控件的窗口。A window manager is used to manage window programs. The window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc. In this application, the window manager is used to display a window including one or more shortcut controls when the electronic device 100 meets a preset trigger condition.
活动管理器用于管理系统里正在运行的activities,包括进程(process)、应用程序、服务(service)、任务(task)信息等。The activity manager is used to manage the activities running in the system, including process, application, service, task information and so on.
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。Content providers are used to store and retrieve data and make these data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone book, etc.
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。本申请中,视图系统用于在电子设备100符合预设的触发条件时,在显示屏103上显示一个快捷区域,该快捷区域中包括电子设备100添加的一个或多个快捷控件。其中,本申请对快捷区域的位置、布局,以及快捷区域中的控件的图标、位置、布局以及功能不作限定。The view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications. A display interface can consist of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures. In this application, the view system is used to display a shortcut area on the display screen 103 when the electronic device 100 meets the preset trigger condition, and the shortcut area includes one or more shortcut controls added by the electronic device 100 . Wherein, the present application does not limit the position and layout of the shortcut area, as well as the icons, positions, layout and functions of the controls in the shortcut area.
显示管理器用于向内核层传输显示内容。The display manager is used to transfer display content to the kernel layer.
电话管理器用于提供电子设备100的通信功能。例如通话状态的管理(包括接通,挂断等)。The phone manager is used to provide the communication function of the electronic device 100 . For example, the management of call status (including connecting, hanging up, etc.).
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。The resource manager provides various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on.
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。The notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can disappear automatically after a brief pause without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc. The notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications of applications running in the background, and notifications on the screen in the form of dialog windows. For example, text information is prompted in the status bar, a prompt sound is issued, the electronic device vibrates, and the indicator light flashes.
Android Runtime包括核心库和虚拟机。Android runtime负责安卓系统的调度和管理。Android Runtime includes core libraries and a virtual machine. Android runtime is responsible for scheduling and management of the Android system.
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。The core library consists of two parts: one is the function functions that the java language needs to call, and the other is the core library of Android.
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。The application layer and the application framework layer run in virtual machines. The virtual machine executes the java files of the application layer and the application framework layer as binary files. The virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.
系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(media libraries),三维图形处理库(例如:OpenGL ES),2D图形引擎(例如:SGL)等。A system library can include multiple functional modules. For example: surface manager (surface manager), media library (media library), 3D graphics processing library (eg: OpenGL ES), 2D graphics engine (eg: SGL), etc.
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。The Surface Manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。The media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files. The media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。The 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing.
2D图形引擎是2D绘图的绘图引擎。2D graphics engine is a drawing engine for 2D drawing.
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动,触控芯片的驱动和输入(input)系统等。为了便于说明,图3中,内核层以包括输入系统、触控芯片的驱动、显示驱动以及存储驱动为例进行示意。其中,显示驱动以及存储驱动可共同设置在驱动模块中。The kernel layer is the layer between hardware and software. The kernel layer at least includes a display driver, a camera driver, an audio driver, a sensor driver, a touch chip driver and an input system, and the like. For the convenience of description, in FIG. 3 , the inner core layer is illustrated by taking the input system, the driver of the touch chip, the display driver and the storage driver as an example. Wherein, the display driver and the storage driver may be jointly arranged in the driver module.
可以理解的是,本申请示意的结构并不构成对电子设备100的具体限定。在另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。It can be understood that the structures illustrated in this application do not constitute a specific limitation on the electronic device 100 . In other embodiments, the electronic device 100 may include more or fewer components than shown, or some components may be combined, or some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
下面结合本申请实施例的语音交互系统10,介绍基于该语音交互系统10中的技术原理。如图4所示,图4中示出了在语音交互系统10中的一种语音交互过程。其中,云服务器200与电子设备100通过网络进行通信。The following describes the technical principles based on the voice interaction system 10 in combination with the voice interaction system 10 of the embodiment of the present application. As shown in FIG. 4 , a voice interaction process in the voice interaction system 10 is shown in FIG. 4 . The cloud server 200 communicates with the electronic device 100 through a network.
首先,电子设备100检测到有语音信号接入,电子设备100启动语音交互功能。在一些实施例中,电子设备可以通过语音助手应用(APP)接收第一语音信号。电子设备100检测接收到的语音信号中其中是否包含目标对象(目标对象例如是预先设置的唤醒词),若包含目标对象则进入交互状态,启动语音交互功能。目标对象可以在电子设备100出厂时预先设置,可以在语音助手应用内预先设置,也可以由用户在使用电子设备100的过程中自行设置,本申请对目标对象的长短、内容均不做限制。First, the electronic device 100 detects that a voice signal is connected, and the electronic device 100 activates the voice interaction function. In some embodiments, the electronic device may receive the first voice signal through a voice assistant application (APP). The electronic device 100 detects whether the received voice signal contains a target object (for example, the target object is a preset wake-up word), and if it contains the target object, it enters the interactive state and activates the voice interaction function. The target object can be preset when the electronic device 100 leaves the factory, can be preset in the voice assistant application, or can be set by the user in the process of using the electronic device 100. This application does not limit the length and content of the target object.
电子设备100基于预设规则对接收到的语音信号进行分发控制,分发路径包括路径一和路径二。该预设规则包括当网络质量良好时,电子设备100将接收到的语音信号上传到云服 务器200进行处理(路径一),其中网络质量良好指的是电子设备100和云服务器200的能够进行数据传输(包括上行和下行数据传输);当网络质量不佳或断开时,电子设备100将接收到的语音信号在电子设备100进行处理(路径二),其中网络质量不佳或断开指的是电子设备100和云服务器200的不能进行数据传输(包括上行或下行数据传输),或者数据传输速率低于阈值。预设规则还可以是根据识别出的语音信号对应的意图来进行分发,简单来说,当语音信号对应的意图在本地即可完成,例如打电话、发短信、打开图库等,则可以在电子设备100进行处理;当语音信号对应的意图需要使网络,例如搜索网页、在线播放音乐等,则可以在云服务器200进行处理。The electronic device 100 performs distribution control on the received voice signal based on a preset rule, and the distribution path includes path 1 and path 2. The preset rule includes that when the network quality is good, the electronic device 100 uploads the received voice signal to the cloud server 200 for processing (path 1). The good network quality means that the electronic device 100 and the cloud server 200 can perform data processing. Transmission (including uplink and downlink data transmission); when the network quality is poor or disconnected, the electronic device 100 processes the received voice signal in the electronic device 100 (path 2), where poor network quality or disconnection refers to The electronic device 100 and the cloud server 200 cannot perform data transmission (including uplink or downlink data transmission), or the data transmission rate is lower than the threshold. The preset rule can also be distributed according to the intent corresponding to the recognized voice signal. In short, when the intent corresponding to the voice signal can be completed locally, such as making a call, sending a text message, opening a gallery, etc., it can be The device 100 performs processing; when the intent corresponding to the voice signal needs to make the network, such as searching web pages, playing music online, etc., the processing can be performed on the cloud server 200 .
接下来对于在云服务器200上处理语音信号和在电子设备100上处理语音信号分别进行描述。Next, the processing of the voice signal on the cloud server 200 and the processing of the voice signal on the electronic device 100 will be described respectively.
路径一:在云服务器200上处理语音信号。Path 1: The voice signal is processed on the cloud server 200 .
步骤①,电子设备100将语音信号上传到云服务器200,云服务器200接收到语音信号,通过语音识别技术ASR对该语音信号进行识别,将该语音信号转化为文本表示信息,即将语音信号中的词汇转换为云服务器200可读的输入,包括例如二进制编码、字符序列等。 Step 1, the electronic device 100 uploads the voice signal to the cloud server 200, and the cloud server 200 receives the voice signal, recognizes the voice signal through the voice recognition technology ASR, and converts the voice signal into text representation information, that is, the voice signal in the voice signal. The vocabulary is converted into input readable by the cloud server 200, including, for example, binary codes, character sequences, and the like.
在一些实施例中,云服务器200可以将输入的语音信号的特征矢量依次与模板库中的每个模板进行相似度比较,将相似度最高者作为识别结果,输出文本数据,再对文本数据进行分词处理,得到语音信号的文本表示信息。可选的,云服务器200也可以使用训练好的声道模型、神经网络模型等计算得到语音信号对应的文本表示信息。In some embodiments, the cloud server 200 may compare the similarity between the feature vector of the input speech signal and each template in the template library in turn, take the one with the highest similarity as the recognition result, output the text data, and then perform the processing on the text data. The word segmentation process is used to obtain the textual representation information of the speech signal. Optionally, the cloud server 200 may also use the trained vocal tract model, neural network model, etc. to calculate and obtain text representation information corresponding to the speech signal.
需要说明的是,云服务器200在通过语音识别技术对语音信号进行识别时,还可以包括对语音信号的一些预处理操作,如:采样、量化、去除不包含语音内容的语音数据(如,静默的语音数据)、对语音数据进行分帧、加窗等处理,等等。It should be noted that when the cloud server 200 recognizes the voice signal through the voice recognition technology, it may also include some preprocessing operations on the voice signal, such as sampling, quantization, and removing voice data that does not contain voice content (eg, silence voice data), framing and windowing the voice data, and so on.
步骤②,经过语音识别后,云服务器200通过语义理解技术NLU将文本表示信息转换为机器可以理解的语义信息。In step ②, after speech recognition, the cloud server 200 converts the textual representation information into semantic information that can be understood by the machine through the semantic understanding technology NLU.
在一些实施例中,语义理解技术的执行简单来说可以理解为以下几个步骤,首先,云服务器200将语音识别得到的文本表示信息切分为一系列具有语义、语法的单元,通常用“token”这个单词来表示文本切分得到的单元。常见的文本切分方式就是“分词”,即将文本按照“词语”的粒度进行切分。用于分词的模型可以包括一阶马尔科夫模型、隐马尔科夫模型、条件随机场、循环神经网络等等。In some embodiments, the execution of the semantic understanding technology can be simply understood as the following steps. First, the cloud server 200 divides the text representation information obtained by speech recognition into a series of units with semantics and grammar, usually using "" The word token" is used to represent the unit obtained by text segmentation. A common text segmentation method is "word segmentation", that is, the text is segmented according to the granularity of "words". Models used for word segmentation may include first-order Markov models, hidden Markov models, conditional random fields, recurrent neural networks, and the like.
然后,基于token序列,使用词向量空间模型、分布式表示模型等等文本表示模型,得到一个数值向量或者矩阵。这个矩阵就是文本的数值化表示。接下来,基于文本的数值化表示的数据,使用分类算法、序列标注方法等等,计算得到其中的“关键信息”(即语义信息),比如实体、三元组、意图、事件等等。有了这些信息,云服务器200就可以理解用户的语言、判断用户想要做什么。Then, based on the token sequence, a text representation model such as a word vector space model, a distributed representation model, etc. is used to obtain a numerical vector or matrix. This matrix is the numerical representation of the text. Next, based on the numerically represented data of the text, use classification algorithms, sequence labeling methods, etc., to calculate the "key information" (ie semantic information), such as entities, triples, intents, events, and so on. With this information, the cloud server 200 can understand the user's language and determine what the user wants to do.
步骤③,云服务器200基于语义信息进行对话管理。对话管理指的是云服务器200基于语义信息,确定下一步执行的动作的过程。执行的动作包括下列一项或多项:播放语音回复内容(如:提供结果,询问特定限制条件,澄清或确认需求等);显示所述语音回复内容的文字内容;跳转到相应的界面;等等。Step ③, the cloud server 200 performs dialog management based on the semantic information. Dialog management refers to a process in which the cloud server 200 determines an action to be executed next based on semantic information. The actions performed include one or more of the following: playing the content of the voice reply (such as: providing results, asking for specific restrictions, clarifying or confirming requirements, etc.); displaying the text content of the voice reply content; jumping to the corresponding interface; and many more.
在一些实施例中,云服务器200确定语义信息中表达出来的意图,然后根据语义信息填充该意图对应的槽位。意图就是用户要做什么,意图对应的槽位就是用户要完成这个意图需要的信息,一个意图可以对应一个或多个槽位。云服务器200基于语义信息填充意图的槽位,若语义信息不足导致其中一个或多个槽位中的信息缺失,则确定下一步执行的动作为针对缺 失的槽位进行进一步的询问;若槽位中的信息没有缺失,则将用户意图转化为用户明确的指令,指示电子设备100执行相应的动作。In some embodiments, the cloud server 200 determines the intent expressed in the semantic information, and then fills the slot corresponding to the intent according to the semantic information. The intent is what the user wants to do, the slot corresponding to the intent is the information the user needs to complete the intent, and an intent can correspond to one or more slots. The cloud server 200 fills the intended slots based on the semantic information. If the information in one or more of the slots is missing due to insufficient semantic information, it determines that the next action to be performed is to further inquire about the missing slots; If the information is not missing, the user's intention is converted into an explicit instruction of the user, and the electronic device 100 is instructed to perform a corresponding action.
举例来说,云服务器200获取到语音信号“请帮我打开图库”的语义信息,根据该语音信号的语义信息确定用户的意图为打开(open)一个对象,则该意图对应的槽位为打开的对象,云服务器200根据语音信号的语义信息进行填槽,确定打开的对象为图库。则对话管理基于该语义信息确定一个明确的指令,即打开图库的指令。For example, the cloud server 200 acquires the semantic information of the voice signal "Please help me open the gallery", and determines that the user's intention is to open an object according to the semantic information of the voice signal, and the slot corresponding to the intention is open. The cloud server 200 fills the slot according to the semantic information of the speech signal, and determines that the opened object is a gallery. Then the dialog management determines an explicit instruction based on the semantic information, that is, an instruction to open the gallery.
上述内容提供了在槽位信息没有缺失的情况下,云服务器200基于用户的语音信号确定出操作指令的示例。而在语义信息不足导致其中一个或多个槽位缺失的情况下,云服务器200需要保存当前的意图和槽位信息,针对缺失的槽位进行进一步的询问。通常,这种情况叫做多轮对话。The above content provides an example in which the cloud server 200 determines the operation instruction based on the user's voice signal under the condition that the slot information is not missing. In the case where one or more slots are missing due to insufficient semantic information, the cloud server 200 needs to save the current intent and slot information, and further inquire about the missing slots. Typically, this situation is called a multi-turn conversation.
举例来说,云服务器200获取到语音信号“我要打电话”的语义信息,根据该语音信号的语义信息确定用户的意图为打电话(call),则该意图对应的槽位为打电话的对象,由于语义信息不足,导致该槽位缺失,则对话管理基于该语义信息确定一个明确的指令,即播放语音回复内容,针对缺失的槽位进行进一步的询问。For example, the cloud server 200 obtains the semantic information of the voice signal "I want to make a call", and determines that the user's intention is to make a call according to the semantic information of the voice signal, and the slot corresponding to the intention is to make a call. If the object is missing the slot due to insufficient semantic information, the dialogue management will determine a clear instruction based on the semantic information, that is, play the voice reply content, and conduct further inquiries about the missing slot.
可选的,对话管理可以基于语音信号的语义信息确定又一个明确的指令,即显示该语音回复内容的文字内容。Optionally, the dialogue management may determine another explicit instruction based on the semantic information of the voice signal, that is, to display the text content of the voice reply content.
在云服务器200下一次接收到语音信号时,云服务器200重复执行上述步骤①②,然后基于保存的意图和槽位信息,使用该语音信号对应的语义信息对缺失的槽位进行填槽。若槽位全部填充完整,即槽位信息没有缺失,则将用户意图转化为用户明确的指令,指示电子设备100执行相应的动作。针对上述的示例,此时云服务器200保存的意图为打电话,缺失的槽位为打电话的对象,则在云服务器200下一次接收到语音信号“小明”时,基于语音信号“小明”对应的语义信息,对槽位“打电话的对象”进行填槽,从而确定出指令,即打电话给小明。When the cloud server 200 receives a voice signal next time, the cloud server 200 repeats the above steps ① and ②, and then fills the missing slot with the semantic information corresponding to the voice signal based on the saved intent and slot information. If all the slots are filled completely, that is, the slot information is not missing, the user's intention is converted into an explicit instruction of the user, and the electronic device 100 is instructed to perform a corresponding action. For the above example, at this time, the intent saved by the cloud server 200 is to make a call, and the missing slot is the object of the call, then when the cloud server 200 receives the voice signal "Xiao Ming" next time, the corresponding voice signal "Xiao Ming" The semantic information of , fills the slot "the object of the call", so as to determine the instruction, that is, call Xiaoming.
步骤④,在云服务器200确定出下一步执行的动作后,若该动作需要和用户进行语音交互,例如输出语音回复内容,云服务器200可以基于自然语言生成技术生成用户可理解的语言文本,然后将生成的语言文本进行语音合成,生成语音数据。Step 4., after the cloud server 200 determines the action to be performed in the next step, if the action needs to perform voice interaction with the user, such as outputting the content of the voice response, the cloud server 200 can generate a language text that can be understood by the user based on the natural language generation technology, and then Speech synthesis is performed on the generated language text to generate speech data.
在一些实施例中,云服务器200确定出哪些信息应该包含在正在构建的语言文本中,并组织出合理的文本顺序,将多个信息合并到一个句子里。然后选择一些连接词、短语,将这些信息构成一个结构良好的完整句子。In some embodiments, the cloud server 200 determines which information should be included in the language text being constructed, and organizes a reasonable text order to combine multiple pieces of information into a single sentence. Then choose some conjunctions, phrases, and combine this information into a well-structured complete sentence.
步骤⑤,云服务器200向电子设备100发送指令,指示电子设备100执行的动作。Step ⑤, the cloud server 200 sends an instruction to the electronic device 100 to instruct the electronic device 100 to perform an action.
在一些实施例中,基于上述步骤①②③④,云服务器200将带有语音数据(语音回复内容)的指令发送到电子设备100,指示电子设备100输出该语音回复内容。In some embodiments, based on the above steps ①②③④, the cloud server 200 sends an instruction with voice data (voice reply content) to the electronic device 100, instructing the electronic device 100 to output the voice reply content.
可选的,云服务器200将带有文本数据(语音回复内容的文字内容)的指令发送到电子设备100,指示电子设备100显示该文本数据。Optionally, the cloud server 200 sends an instruction with text data (text content of the voice reply content) to the electronic device 100 to instruct the electronic device 100 to display the text data.
在一些实施例中,步骤④为可选的,若云服务器200确定的下一步执行的动作无需输出语音回复内容,则无需执行步骤④。基于上述步骤①②③,云服务器200向电子设备100发送指令,指示电子设备100执行界面的跳转。In some embodiments, step ④ is optional, and if the next action determined by the cloud server 200 does not need to output the content of the voice reply, step ④ does not need to be performed. Based on the above steps ①②③, the cloud server 200 sends an instruction to the electronic device 100 to instruct the electronic device 100 to perform interface jumping.
路径二:在电子设备100上处理语音信号。Path 2: Process the voice signal on the electronic device 100 .
电子设备100接收到语音信号,通过语音识别技术对该语音信号进行识别,将该语音信号转化为文本表示信息。然后通过语义理解技术将文本表示信息转换为机器可以理解的语义信息。接着电子设备100基于语义信息,确定下一步执行的动作。若该动作需要和用户进行语音交互,例如输出语音回复内容,电子设备100可以基于自然语言生成技术生成用户可理 解的语言文本,然后将生成的语言文本进行语音合成,生成语音数据。电子设备100输出该语音数据。若电子设备100确定的下一步执行的动作无需输出语音回复内容,则无需使用语音合成技术,电子设备100执行界面的跳转。The electronic device 100 receives the voice signal, recognizes the voice signal through a voice recognition technology, and converts the voice signal into text representation information. The textual representation information is then transformed into machine-understandable semantic information through semantic understanding techniques. Then, the electronic device 100 determines the next action to be performed based on the semantic information. If the action requires voice interaction with the user, such as outputting voice reply content, the electronic device 100 can generate language text that the user can understand based on the natural language generation technology, and then perform speech synthesis on the generated language text to generate voice data. The electronic device 100 outputs the voice data. If the next action determined by the electronic device 100 does not need to output the content of the voice response, the electronic device 100 does not need to use the speech synthesis technology, and the electronic device 100 performs interface jumping.
需要说明的是,基于同一发明构思,图4所示实施例的路径二中的语音识别、语义理解、对话管理、语音合成解决问题的原理与路径一中的相似,因此,电子设备100在路径二中的语音识别、语义理解、对话管理、语音合成的实现过程可以参考云服务器200在上述路径一中步骤①②③④⑤的相应描述,此处不再赘述。It should be noted that, based on the same inventive concept, the principles of speech recognition, semantic understanding, dialogue management, and speech synthesis for problem solving in path 2 of the embodiment shown in FIG. 4 are similar to those in path 1. For the implementation process of speech recognition, semantic understanding, dialogue management, and speech synthesis in the second step, reference may be made to the corresponding descriptions of the cloud server 200 in steps ①②③④⑤ in the above path one, which will not be repeated here.
综上所述,图4所示的实施例详细说明了语音交互系统的实现原理。在一些情况下,电子设备100基于网络状况对语音业务进行分发,若网络状况良好则将语音信号上传到云服务器200处理,即上述路径二;若网络断开或网络质量不佳则在电子设备100处理,即上述路径一。这种情况下,若由于网络原因,使语音业务在处理过程中进行路径的切换,则会导致原语音业务无法继续执行,影响用户体验。To sum up, the embodiment shown in FIG. 4 describes the implementation principle of the voice interaction system in detail. In some cases, the electronic device 100 distributes the voice service based on the network condition. If the network condition is good, the voice signal is uploaded to the cloud server 200 for processing, that is, the above path 2; if the network is disconnected or the network quality is poor, the electronic device 100 processing, that is, the above path one. In this case, if the voice service performs path switching during processing due to network reasons, the original voice service cannot continue to be executed, affecting user experience.
举例来说,云服务器200接收到第一次语音信号,若该语音信号的语义信息不足导致其中一个或多个槽位缺失,则云服务器200需要保存当前的意图和槽位信息,针对缺失的槽位进行进一步的询问。这种需要进行多轮对话的语音业务,若在云服务器200下一次接收到语音信号之前出现网络中断,云服务器200无法接收到下一次的语音信号,电子设备100将下一次的语音信号分发到电子设备100上处理,电子设备100基于该下一次的语音信号的语义信息无法继续执行原语音业务,原语音业务中断,影响用户体验。For example, when the cloud server 200 receives a voice signal for the first time, if the semantic information of the voice signal is insufficient and one or more of the slots are missing, the cloud server 200 needs to save the current intent and slot information. slot for further inquiry. For the voice service that requires multiple rounds of dialogue, if the network is interrupted before the cloud server 200 receives the voice signal next time, the cloud server 200 cannot receive the next voice signal, and the electronic device 100 distributes the next voice signal to the Processing on the electronic device 100, the electronic device 100 cannot continue to perform the original voice service based on the semantic information of the next voice signal, and the original voice service is interrupted, which affects the user experience.
结合本申请实施例的语音交互系统10,本申请实施例还提供了一种语音交互处理方法,云服务器200向电子设备100发送指令,指示电子设备100执行相应的动作时,同步向电子设备100发送语音对话的上下文(意图和槽位信息),若在多轮对话的语音业务中出现网络中断,导致端云切换(电子设备100和云服务器200的切换,即路径一和路径二的切换),电子设备100也能够基于语音对话的上下文和接收到的下一次的语音信号,继续执行原语音业务,从而解决多轮对话的语音业务中断的问题。In combination with the voice interaction system 10 of the embodiment of the present application, the embodiment of the present application further provides a voice interaction processing method. The cloud server 200 sends an instruction to the electronic device 100 to instruct the electronic device 100 to perform a corresponding action, synchronously to the electronic device 100 Send the context (intent and slot information) of the voice dialogue. If the network is interrupted in the voice service of multiple rounds of dialogue, it will lead to the end-cloud switch (the switch between the electronic device 100 and the cloud server 200, that is, the switch between path 1 and path 2) , the electronic device 100 can also continue to execute the original voice service based on the context of the voice dialogue and the received next voice signal, thereby solving the problem of interruption of voice services for multiple rounds of dialogue.
下面具体介绍本申请提供的一种语音交互处理方法的步骤流程,如图5A所示,图5A中示出了在语音交互系统10中的一种语音交互过程。The following specifically introduces the step flow of a voice interaction processing method provided by the present application. As shown in FIG. 5A , FIG. 5A shows a voice interaction process in the voice interaction system 10 .
时刻T1网络质量良好,电子设备100接收到语音1,电子设备100启动语音交互功能。电子设备100基于预设规则对接收到的语音1进行分发控制,例如此时网络质量良好,则电子设备100将接收到的语音1上传到云服务器200进行处理,处理动作包括语音识别、语义理解、对话管理、语音合成等过程。本申请实施例中,语音1也可称为第一语音信号。At time T1, the network quality is good, the electronic device 100 receives the voice 1, and the electronic device 100 starts the voice interaction function. The electronic device 100 performs distribution control on the received voice 1 based on preset rules. For example, at this time, the network quality is good, then the electronic device 100 uploads the received voice 1 to the cloud server 200 for processing. The processing actions include voice recognition and semantic understanding. , dialogue management, speech synthesis and other processes. In this embodiment of the present application, the voice 1 may also be referred to as the first voice signal.
其中,基于同一发明构思,图5A所示实施例中T1时刻的语音识别、语义理解、对话管理、语音合成解决问题的原理与图4所示实施例中路径一相似,因此,云服务器200在T1时刻的语音识别、语义理解、对话管理、语音合成的实现过程可以参考图4中云服务器200在上述路径一中步骤①②③④的相应描述,此处不再赘述。Among them, based on the same inventive concept, the principles of speech recognition, semantic understanding, dialogue management, and speech synthesis at time T1 in the embodiment shown in FIG. 5A to solve problems are similar to those of the first path in the embodiment shown in FIG. 4 . Therefore, the cloud server 200 is For the implementation process of speech recognition, semantic understanding, dialogue management, and speech synthesis at time T1, reference may be made to the corresponding descriptions of steps ①②③④ of the cloud server 200 in the above path 1 in FIG. 4 , and details are not repeated here.
云服务器200确定出下一步执行的动作,向电子设备100发送指令,以指示电子设备100执行的动作1;以及云服务器200同步向电子设备100发送语音对话上下文,该对话上下文指的是云服务器200通过对语音1进行识别和理解得到的意图和槽位信息。执行的动作1包括下列一项或多项:播放针对语音1的语音回复内容(如:提供结果,询问特定限制条件,澄清或确认需求等);显示该语音回复内容的文字内容;跳转到相应的界面;等等。The cloud server 200 determines the next action to be performed, and sends an instruction to the electronic device 100 to instruct the electronic device 100 to perform action 1; and the cloud server 200 synchronously sends the electronic device 100 a voice dialogue context, which refers to the cloud server 200 Intention and slot information obtained by recognizing and understanding voice 1. The executed action 1 includes one or more of the following: playing the voice reply content for voice 1 (such as: providing results, asking specific constraints, clarifying or confirming needs, etc.); displaying the text content of the voice reply content; jumping to the corresponding interface; and so on.
电子设备100接收到指令和对话上下文后,通过对话信息转发模块将指令和对话上下文 进行转发,电子设备100基于该指令执行动作1,并且保存对话上下文。其中,对话信息转发模块可以看作是一个接收云服务器200发送的数据的节点,用于接收并转发数据。After receiving the instruction and the dialog context, the electronic device 100 forwards the instruction and the dialog context through the dialog information forwarding module, and the electronic device 100 executes action 1 based on the instruction and saves the dialog context. The dialogue information forwarding module may be regarded as a node that receives data sent by the cloud server 200, and is used for receiving and forwarding the data.
时刻T2网络质量不佳,电子设备100输出针对语音1的语音回复内容后,接收语音2,由于此时网络质量不佳,电子设备100和云服务器200之间无法实现数据传输,则电子设备100无法将语音2上传到云服务器200。电子设备100调用自身的语音处理能力对语音2进行处理,处理动作包括语音识别、语义理解、对话管理、语音合成等过程。其中,电子设备100在T2时刻的语音识别、语义理解、对话管理、语音合成的实现过程可以参考图4中电子设备100在上述路径二的相应描述,此处不再赘述。本申请实施例中,语音2也可称为第二语音信号。At time T2, the network quality is poor. After the electronic device 100 outputs the voice reply content for Voice 1, it receives Voice 2. Due to the poor network quality at this time, data transmission cannot be achieved between the electronic device 100 and the cloud server 200. Then the electronic device 100 Failed to upload voice 2 to cloud server 200. The electronic device 100 invokes its own speech processing capability to process the speech 2, and the processing actions include speech recognition, semantic understanding, dialogue management, speech synthesis and other processes. The implementation process of speech recognition, semantic understanding, dialogue management, and speech synthesis of the electronic device 100 at time T2 may refer to the corresponding description of the electronic device 100 in the above path 2 in FIG. 4 , which will not be repeated here. In this embodiment of the present application, the voice 2 may also be referred to as the second voice signal.
需要注意的是,与上述路径二不同,本申请实施例中在对话管理这一部分,电子设备100是基于语音2和T1时刻保存的对话上下文,确定下一步执行的动作。电子设备100基于语音2对应的语义信息,以及意图和槽位信息,对缺失的槽位进行填槽。若语音2对应语义信息不足,槽位信息没有填充完整,还有一个或多个槽位缺失,则电子设备100确定下一步执行的动作为针对缺失的槽位进行进一步的询问;若槽位信息没有缺失,则将用户意图转化为用户明确的指令,指示电子设备100执行相应的动作。It should be noted that, different from the above path 2, in the dialog management part of the embodiment of the present application, the electronic device 100 determines the next action to be performed based on the voice 2 and the dialog context saved at time T1. The electronic device 100 fills in the missing slots based on the semantic information corresponding to the speech 2 and the intent and slot information. If the semantic information corresponding to Voice 2 is insufficient, the slot information is not fully filled, and one or more slots are missing, the electronic device 100 determines that the next action to be performed is to further query the missing slots; If there is no missing, the user's intention is converted into an explicit instruction of the user, and the electronic device 100 is instructed to perform a corresponding action.
其中,时刻T2是在电子设备100接收到云服务器200针对语音1发送的指令和对话上下文之后,电子设备100将语音2上传到云服务器200的时间段内,例如可以是在电子设备100接收语音2之前,也可以是在接收到语音2之后,上传到云服务器200之前。即由于T2时刻网络质量不佳,导致语音2无法上传到云服务器200。The time T2 is the time period during which the electronic device 100 uploads the voice 2 to the cloud server 200 after the electronic device 100 receives the instruction and the dialogue context sent by the cloud server 200 for the voice 1. For example, the electronic device 100 receives the voice 2, or after the voice 2 is received and before uploading to the cloud server 200. That is, due to poor network quality at time T2, the voice 2 cannot be uploaded to the cloud server 200.
在一些实施例中,时刻T2还可以是在电子设备100将语音2上传到云服务器200之后,云服务器200将指令下发到电子设备100之前。即由于网络质量不佳,导致云服务器200无法将针对语音2生成的指令下发到电子设备100。如图5B所示,电子设备100将语音2上传到云服务器200,云服务器200对语音2进行处理,处理动作包括语音识别、语义理解、对话管理、语音合成等过程。此时出现了网络质量问题,电子设备100和云服务器200之间无法实现数据传输,云服务器200无法将针对于语音2生成的指令下发到电子设备100。In some embodiments, the time T2 may also be after the electronic device 100 uploads the voice 2 to the cloud server 200 and before the cloud server 200 sends the instruction to the electronic device 100 . That is, due to poor network quality, the cloud server 200 cannot deliver the instruction generated for the voice 2 to the electronic device 100 . As shown in FIG. 5B , the electronic device 100 uploads the voice 2 to the cloud server 200, and the cloud server 200 processes the voice 2, and the processing actions include voice recognition, semantic understanding, dialogue management, and speech synthesis. At this time, a network quality problem occurs, data transmission cannot be realized between the electronic device 100 and the cloud server 200 , and the cloud server 200 cannot deliver the command generated for the voice 2 to the electronic device 100 .
可选的,若在电子设备100将语音2上传到云服务器200后的预设时间内电子设备100没有接收到云服务器200针对语音2下发的指令,电子设备100调用自身的语音处理能力对语音2(例如可以是备份的语音2)进行处理。Optionally, if the electronic device 100 does not receive the instruction issued by the cloud server 200 for the voice 2 within the preset time after the electronic device 100 uploads the voice 2 to the cloud server 200, the electronic device 100 invokes its own voice processing capability to Voice 2 (which may be backup Voice 2, for example) is processed.
可选的,电子设备100将语音2上传到云服务器200后,在接收到云服务器200针对语音2下发的指令之前,检测出当前与云服务器200断开了网络连接,则电子设备100调用自身的语音处理能力对语音2(例如可以是备份的语音2)进行处理。Optionally, after the electronic device 100 uploads the voice 2 to the cloud server 200, before receiving the instruction issued by the cloud server 200 for the voice 2, it detects that the network connection with the cloud server 200 is currently disconnected, and the electronic device 100 calls the The voice 2 (for example, the backup voice 2) is processed by its own voice processing capability.
其中,处理过程可以参考上述图5A中T2时刻电子设备100对语音2的相应描述,此处不再赘述。For the processing process, reference may be made to the corresponding description of the voice 2 by the electronic device 100 at time T2 in FIG. 5A , which will not be repeated here.
通过这种方式,在语音交互过程中,云服务器200每次下发指令时,都将对话上下文同步发送到电子设备100,电子设备100接收并保存对话上下文,在发生网络中断的情况下,原本在云服务器200上处理的语音业务,电子设备100还能够基于保存的对话上下文继续处理,使语音业务不中断,提高语音业务的处理效率,提升用户体验。In this way, in the process of voice interaction, every time the cloud server 200 issues an instruction, it synchronously sends the dialogue context to the electronic device 100, and the electronic device 100 receives and saves the dialogue context. The electronic device 100 can also continue to process the voice service processed on the cloud server 200 based on the saved dialogue context, so that the voice service is not interrupted, the processing efficiency of the voice service is improved, and the user experience is improved.
在一些实施例中,云服务器200每次下发指令时,在一个或多个槽位信息中至少一个槽位信息存在缺失的情况下,才会向电子设备100同步发送对话上下文,即意图和槽位信息。In some embodiments, each time the cloud server 200 issues an instruction, only if at least one slot information in one or more slot information is missing, will the conversation context, that is, the intent and the slot information.
具体的,结合图5A来说,在T1时刻,云服务器200对语音1进行处理的过程中,基于语音1对应的语义信息,确定语音1对应的语义信息表达出的意图和意图对应的槽位信息, 一个意图可以对应一个或多个槽位。云服务器200基于该语义信息填充意图的槽位信息,若槽位全部填充完整,即槽位信息没有缺失,则将用户意图转化为用户明确的指令,云服务器200向电子设备100发送指令,指示电子设备100执行相应的动作。举例来说,云服务器200获取到语音信号“请帮我打开图库”,根据该语音信号的语义信息确定用户的意图为打开(open)一个对象,则该意图对应的槽位为打开的对象,云服务器200根据语音信号的语义信息进行填槽,确定打开的对象为图库。则对话管理基于该语义信息确定一个明确的指令,即打开图库的指令。可以看出,由于此时用户的意图得到了完成,云服务器200判断该意图已经结束,那么此时云服务器200无需向电子设备100发送对话上下文(意图和槽位信息),节约了资源。Specifically, referring to FIG. 5A , at time T1, in the process of processing voice 1, the cloud server 200 determines the intent expressed by the semantic information corresponding to voice 1 and the slot corresponding to the intent based on the semantic information corresponding to voice 1 Information, an intent can correspond to one or more slots. The cloud server 200 fills the intended slot information based on the semantic information. If all the slots are completely filled, that is, the slot information is not missing, the user's intention is converted into an explicit instruction of the user, and the cloud server 200 sends an instruction to the electronic device 100 to indicate The electronic device 100 performs corresponding actions. For example, the cloud server 200 obtains the voice signal "Please help me open the gallery", and determines that the user's intention is to open an object according to the semantic information of the voice signal, and the slot corresponding to the intention is the open object, The cloud server 200 fills the slot according to the semantic information of the speech signal, and determines that the opened object is a gallery. Then the dialog management determines an explicit instruction based on the semantic information, that is, an instruction to open the gallery. It can be seen that since the user's intention is completed at this time and the cloud server 200 determines that the intention has ended, the cloud server 200 does not need to send the dialog context (intent and slot information) to the electronic device 100 at this time, saving resources.
而在语义信息不足导致其中一个或多个槽位缺失的情况下,云服务器200需要保存当前的意图和槽位信息,针对缺失的槽位进行进一步的询问,等到接收到下一次的语音信号,再结合保存的意图的槽位信息,通过该下一次的语音信号进行填槽,确定下一步的执行动作。本申请实施例中,云服务器200基于语音合成技术,生成针对语音1的语音回复内容,向电子设备100发送指令,指示电子设备100输出该语音回复内容,并且同步向电子设备100发送对话上下文(语音1对应的意图和槽位信息),电子设备100接收并保存该对话上下文。这样,即使在电子设备100接收到下一次的语音信号时发生了网络中断的情况,电子设备100也可以通过自身的语音交互能力结合保存的对话上下文,对接收到的该下一次的语音信号进行处理,提高语音业务的处理效率,提升用户体验。In the case where one or more slots are missing due to insufficient semantic information, the cloud server 200 needs to save the current intent and slot information, conduct further inquiries about the missing slots, and wait until the next voice signal is received, Combined with the stored intention slot information, fill the slot with the next voice signal to determine the next execution action. In the embodiment of the present application, the cloud server 200 generates the voice reply content for the voice 1 based on the speech synthesis technology, sends an instruction to the electronic device 100 to instruct the electronic device 100 to output the voice reply content, and synchronously sends the dialogue context ( Intention and slot information corresponding to voice 1), the electronic device 100 receives and saves the dialogue context. In this way, even if the network is interrupted when the electronic device 100 receives the next voice signal, the electronic device 100 can use its own voice interaction capability in combination with the saved dialogue context to perform the next voice signal received. processing, improve the processing efficiency of voice services, and improve user experience.
在一些实施例中,当缺失的槽位在两个或两个以上时,云服务器200向电子设备100同步发送的不只是意图和槽位信息,云服务器200还可以对槽位进行标注,指示电子设备100槽位填充的顺序。这样,电子设备在处理下一次的语音信号时,可以准确对其中一个槽位进行填充。In some embodiments, when there are two or more missing slots, the cloud server 200 sends not only the intention and slot information to the electronic device 100 synchronously, but the cloud server 200 can also mark the slots, indicating The order in which the slots of the electronic device 100 are filled. In this way, the electronic device can accurately fill one of the slots when processing the next voice signal.
接下来,以打电话的应用场景为例,详细说明本申请实施例中在打电话这个场景中实施的语音交互处理方法。Next, taking an application scenario of making a phone call as an example, the voice interaction processing method implemented in the scenario of making a phone call in the embodiment of the present application will be described in detail.
如图6所示,在时刻T1网络质量良好,当用户想要通过语音实现打电话的操作,可以启动语音助手应用(APP),输入语音信号“我要打电话”。电子设备100通过语音助手应用接收到用户输入的语音信号“我要打电话”,基于预设规则对接收到的语音信号进行分发控制,例如此时网络质量良好,则电子设备100将接收到的“我要打电话”上传到云服务器200进行处理。As shown in FIG. 6 , when the network quality is good at time T1, when the user wants to make a phone call by voice, he can start the voice assistant application (APP) and input the voice signal "I want to make a call". The electronic device 100 receives the voice signal "I want to make a call" input by the user through the voice assistant application, and controls the distribution of the received voice signal based on preset rules. For example, when the network quality is good, the electronic device 100 will receive the voice signal. "I want to make a call" is uploaded to the cloud server 200 for processing.
云服务器200接收到“我要打电话”的语音信号后,根据语音识别技术(ASR)将语音信号转化为文本信息,根据语义理解技术(NUL)得到语义信息,识别出用户的意图为打电话。接着,云服务器200确定出打电话这个意图对应的槽位信息包括打电话的对象,云服务器200根据语义信息进行填槽,而云服务器200识别出“我要打电话”的语义信息中不包括打电话的对象,即云服务器200确定该意图(打电话)对应的槽位(打电话的对象)的信息空缺。After receiving the voice signal of "I want to make a call", the cloud server 200 converts the voice signal into text information according to the speech recognition technology (ASR), obtains the semantic information according to the semantic understanding technology (NUL), and recognizes that the user's intention is to make a phone call . Next, the cloud server 200 determines that the slot information corresponding to the intention to make a call includes the object of the call, the cloud server 200 fills the slot according to the semantic information, and the cloud server 200 recognizes that the semantic information of "I want to make a call" does not include The calling object, that is, the cloud server 200 determines that the information of the slot (calling object) corresponding to the intention (calling) is vacant.
云服务器200确定出下一步执行的动作为针对空缺的槽位信息进行询问,云服务器200根据语音合成技术(TTS)生成语音回复内容“您想打给谁”,向电子设备100发送带有该语音回复内容的指令,指示电子设备100播放该语音回复内容。并且,云服务器200将该对话上下文同步发送给电子设备100,该对话上下文包括意图“打电话”和槽位信息“打电话的对象(空缺)”。电子设备100接收到云服务器发送的指令和对话上下文,基于该指令播放语音回复内容“您想打给谁”,并保存该对话上下文。The cloud server 200 determines that the next action to be performed is to inquire about the vacant slot information. The cloud server 200 generates a voice reply content "Who do you want to call" according to the speech synthesis technology (TTS), and sends a message to the electronic device 100 with the The instruction of the voice reply content instructs the electronic device 100 to play the voice reply content. And, the cloud server 200 synchronously sends the dialog context to the electronic device 100, where the dialog context includes the intention "to make a call" and the slot information "the object of the call (vacancy)". The electronic device 100 receives the instruction and the dialogue context sent by the cloud server, plays the voice reply content "Who do you want to call" based on the instruction, and saves the dialogue context.
可选的,云服务器200还可以向电子设备100发送带有该语音回复内容的文本数据的指令,指示电子设备100显示该文本数据(“您想打给谁”的文字内容)。Optionally, the cloud server 200 may also send an instruction of text data with the content of the voice reply to the electronic device 100, instructing the electronic device 100 to display the text data (the text content of "who do you want to call").
电子设备100播放语音回复内容“您想打给谁”之后,用户再次输入语音信号“打给小明”,在T2时刻网络质量不佳,电子设备100调用自身的语音处理能力处理该语音信号“打给小明”。电子设备100根据语音识别技术(ASR)将语音信号转化为文本信息,根据语义理解技术(NUL)得到语义信息。接着,电子设备100基于保存的意图“打电话”和槽位信息“打电话的对象(空缺)”,以及根据“打给小明”对应的语义信息进行填槽。电子设备100识别出“我要打电话”的语义信息中“小明”即为打电话的对象,即云服务器200确定该意图(打电话)对应的槽位(打电话的对象)的信息为“小明”。After the electronic device 100 plays the voice reply content "Who do you want to call", the user inputs the voice signal "Call Xiaoming" again. At T2, the network quality is poor, and the electronic device 100 invokes its own voice processing capability to process the voice signal "Call Xiaoming". To Xiao Ming". The electronic device 100 converts the speech signal into text information according to the speech recognition technology (ASR), and obtains the semantic information according to the semantic understanding technology (NUL). Next, the electronic device 100 fills the slot based on the saved intent "call" and slot information "object to call (vacancy)", and according to the semantic information corresponding to "call Xiaoming". The electronic device 100 recognizes that "Xiao Ming" in the semantic information of "I want to make a call" is the object of the call, that is, the cloud server 200 determines that the information of the slot (the object of the call) corresponding to the intention (call) is " Xiao Ming".
电子设备100确定出下一步执行的动作为打电话给小明,并且输出语音回复内容“正在呼叫小明”。电子设备100根据语音合成技术(TTS)生成语音回复内容“正在呼叫小明”,电子设备100播放该语音回复内容。并且,电子设备100在通讯录中查询联系人小明,调用通话能力呼叫小明。可选的,电子设备100还可以显示该语音回复内容的文本数据(“正在呼叫小明”的文字内容)。The electronic device 100 determines that the next action to be performed is to make a call to Xiaoming, and outputs a voice reply content "calling Xiaoming". The electronic device 100 generates a voice reply content "calling Xiaoming" according to the speech synthesis technology (TTS), and the electronic device 100 plays the voice reply content. In addition, the electronic device 100 queries the contact Xiaoming in the address book, and invokes the call capability to call Xiaoming. Optionally, the electronic device 100 may also display the text data of the voice reply content (the text content of "calling Xiaoming").
以上描述了打电话场景下的语音交互处理方法,下面以智能手机为上述电子设备100举例,结合具体场景示例性示出了一些语音交互过程。在语音交互的过程中,若电子设备100的网络质量由良好变为不佳,语音信号的处理从云服务器200转换到电子设备100,由于云服务器200将语音的对话上下文进行下发,并保存在电子设备100上,这样,即使在多轮对话的过程中发生了网络中断的情况,电子设备100也能够实现语音业务的不中断。如图7A和图7B所示,唤醒词设置为“小艺小艺”。The above describes the voice interaction processing method in a phone call scenario. The following takes a smartphone as an example of the electronic device 100, and exemplarily shows some voice interaction processes in combination with specific scenarios. In the process of voice interaction, if the network quality of the electronic device 100 changes from good to poor, the processing of the voice signal is switched from the cloud server 200 to the electronic device 100, because the cloud server 200 sends the dialogue context of the voice and saves it. On the electronic device 100, in this way, even if the network is interrupted during multiple rounds of conversations, the electronic device 100 can realize uninterrupted voice services. As shown in FIG. 7A and FIG. 7B , the wake-up word is set to "Xiaoyi Xiaoyi".
用户:小艺小艺,我要打电话。User: Xiaoyi Xiaoyi, I want to call.
智能手机(电子设备100):您要打给谁。Smartphone (electronic device 100): Who do you want to call.
用户:小明。User: Xiaoming.
智能手机(电子设备100):好的,正在为您呼叫小明。Smartphone (Electronic Device 100): Okay, I'm calling Xiao Ming for you.
下面结合图8A~图8D,以上述语音对话为示例,说明本申请实施例提供的语音交互处理方法在智能手机的显示界面上的实现形式。The following describes an implementation form of the voice interaction processing method provided by the embodiment of the present application on a display interface of a smart phone with reference to FIGS. 8A to 8D , taking the above-mentioned voice dialogue as an example.
如图8A所示,图8A示出了一种语音交互界面801,其中例如可以是语音助手应用的界面。语音交互界面801中包括状态栏8011、功能栏8012。As shown in FIG. 8A, FIG. 8A shows a voice interaction interface 801, which may be, for example, an interface of a voice assistant application. The voice interaction interface 801 includes a status bar 8011 and a function bar 8012 .
其中,状态栏8011可包括:无线网络信号的一个或多个信号强度指示符8013、电池状态指示符8014、时间指示符8015。信号强度指示符8013指示了当前网络质量(也可指示电子设备100与云服务器200的数据传输速率),图8A中,信号强度指示符8013的信号满格(4格),指示了当前网络质量良好。The status bar 8011 may include: one or more signal strength indicators 8013 of wireless network signals, a battery status indicator 8014, and a time indicator 8015. The signal strength indicator 8013 indicates the current network quality (it may also indicate the data transmission rate between the electronic device 100 and the cloud server 200). In FIG. 8A, the signal strength indicator 8013 is full (4 bars), indicating the current network quality good.
功能栏8012可包括语音输入控件8016等一个或多个功能控件。当电子设备100检测到针对语音输入控件8012的用户操作,电子设备100接收语音信号。如图8A中,电子设备100接收语音信号“小艺小艺,我要打电话”,并显示在语音交互界面801上。 Function bar 8012 may include one or more function controls, such as voice input control 8016. When the electronic device 100 detects a user operation on the voice input control 8012, the electronic device 100 receives a voice signal. As shown in FIG. 8A , the electronic device 100 receives the voice signal “Xiaoyi Xiaoyi, I want to make a call”, and displays it on the voice interaction interface 801 .
如图8B所示,电子设备100接收到语音信号“小艺小艺,我要打电话”,可以将语音信号上传到云服务器200上处理,基于云服务器200返回的指令播放语音回复内容“您要打给谁”,并显示在语音交互界面802上。其中,语音输入控件8016转变为语音输出控件8026,指示当前电子设备100正在输出语音。本申请实施例中,云服务器200返回指令的同时,同步向电子设备100返回语音对话上下文,电子设备100接收并保存该对话上下文。As shown in FIG. 8B , the electronic device 100 receives the voice signal "Xiaoyi Xiaoyi, I want to make a call", and can upload the voice signal to the cloud server 200 for processing, and play the voice reply content "You Who to call" and displayed on the voice interface 802. The voice input control 8016 is transformed into a voice output control 8026, indicating that the electronic device 100 is currently outputting voice. In the embodiment of the present application, when the cloud server 200 returns the instruction, it synchronously returns the voice dialogue context to the electronic device 100, and the electronic device 100 receives and saves the dialogue context.
用户继续输入语音,如图8C所示,电子设备100当前的网络质量不佳,信号强度指示符8033的信号只剩两格,则电子设备100与云服务器200无法进行数据传输,或者数据传输速率太低。电子设备100接收到语音信号“小明”,电子设备100无法将语音信号上传到云服务器200进行处理,或者云服务器200无法将指令下发到电子设备100。此时,电子设备100可以基于保存的对话上下文,继续处理语音信号“小明”,播放语音回复内容“好的,正在为你呼叫小明”,并显示在语音交互界面803上。并且,执行打电话的动作,电子设备100跳转到通话界面,如图8D所示,图8D示出了一种通话界面804,该通话界面804指示了当前电子设备100正在打电话给小明。The user continues to input voice, as shown in FIG. 8C , the current network quality of the electronic device 100 is not good, and the signal of the signal strength indicator 8033 only has two bars left, then the electronic device 100 and the cloud server 200 cannot perform data transmission, or the data transmission rate too low. When the electronic device 100 receives the voice signal "Xiao Ming", the electronic device 100 cannot upload the voice signal to the cloud server 200 for processing, or the cloud server 200 cannot issue an instruction to the electronic device 100 . At this time, the electronic device 100 can continue to process the voice signal "Xiao Ming" based on the saved dialogue context, play the voice reply content "Okay, I'm calling Xiao Ming for you", and display it on the voice interaction interface 803 . And, when the action of making a call is performed, the electronic device 100 jumps to the call interface, as shown in FIG. 8D , which shows a call interface 804 indicating that the electronic device 100 is currently calling Xiao Ming.
上述是一个语音业务为多轮对话(上述具体为两轮对话)的应用场景,在对话的过程中,电子设备100的网络质量由良好变为不佳,语音信号的处理从云服务器200转换到电子设备100,由于云服务器200将语音的对话上下文进行下发,并保存在电子设备100上,这样,即使在多轮对话的过程中发生了网络中断的情况,电子设备100也能够实现语音业务的不中断,提高了语音业务的处理效率。The above is an application scenario in which the voice service is a multi-round dialogue (the above is specifically two rounds of dialogue). During the dialogue, the network quality of the electronic device 100 changes from good to poor, and the processing of the voice signal is converted from the cloud server 200 to In the electronic device 100, because the cloud server 200 delivers the speech dialogue context and saves it on the electronic device 100, in this way, even if the network is interrupted during multiple rounds of dialogue, the electronic device 100 can still implement the voice service. without interruption, which improves the processing efficiency of voice services.
接下来本申请实施例再提供一个三轮对话的应用场景,以发短信的应用场景为例,简要说明本申请实施例中在发短信这个场景中实施的语音交互处理方法。Next, the embodiment of the present application further provides an application scenario of three-round dialogue. Taking the application scenario of sending short messages as an example, the voice interaction processing method implemented in the scenario of sending short messages in the embodiments of the present application is briefly described.
网络质量良好时,电子设备100接收到用户输入的语音信号“我要发短信”,基于预设规则对接收到的语音信号进行分发控制,例如此时网络质量良好,则电子设备100将接收到的“我要发短信”上传到云服务器200进行处理。When the network quality is good, the electronic device 100 receives the voice signal "I want to send a text message" input by the user, and performs distribution control on the received voice signal based on preset rules. The "I want to send a text message" is uploaded to the cloud server 200 for processing.
云服务器200识别出用户的意图为发短信。接着,云服务器200确定出发短信这个意图对应的槽位信息包括发短信的对象、发短信的内容,云服务器200根据语义信息进行填槽,而云服务器200识别出“我要发短信”的语义信息中不包括打电话的对象和发短信的内容,即云服务器200确定该意图(发短信)对应的槽位(发短信的对象、发短信的内容)的信息空缺。The cloud server 200 recognizes that the user's intention is to send a text message. Next, the cloud server 200 determines that the slot information corresponding to the intention to send a text message includes the object to send the text message and the content of the text message, the cloud server 200 fills the slot according to the semantic information, and the cloud server 200 recognizes the semantics of "I want to send a text message" The information does not include the object of the call and the content of the text message, that is, the cloud server 200 determines the information vacancy of the slot (the object of the text message, the content of the text message) corresponding to the intent (send text message).
云服务器200确定出下一步执行的动作为针对空缺的槽位信息进行询问,由于有两个槽位信息空缺,则云服务器200可以根据优先级针对其中一个空缺的槽位信息进行询问,例如首先询问发短信的对象。云服务器200根据语音合成技术(TTS)生成语音回复内容“您想给谁发短信”,向电子设备100发送带有该语音回复内容的指令,指示电子设备100播放该语音回复内容。并且,云服务器200将该对话上下文同步发送给电子设备100,该对话上下文包括意图“发短信”和槽位信息“发短信的对象(空缺)、发短信的内容(空缺)”。电子设备100接收到云服务器发送的指令和对话上下文,基于该指令播放语音回复内容“您想给谁发短信”,并保存该对话上下文。The cloud server 200 determines that the next action to be performed is to inquire about the vacant slot information. Since there are two vacant slot information, the cloud server 200 can inquire about one of the vacant slot information according to the priority. For example, first Ask the person you are texting. The cloud server 200 generates the voice reply content "Who do you want to text" according to the speech synthesis technology (TTS), sends an instruction with the voice reply content to the electronic device 100, and instructs the electronic device 100 to play the voice reply content. And, the cloud server 200 synchronously sends the dialog context to the electronic device 100, where the dialog context includes the intention "send text message" and the slot information "object to send text message (vacancy) and content of text message (vacancy)". The electronic device 100 receives the instruction and the dialogue context sent by the cloud server, plays the voice reply content "Who do you want to text" based on the instruction, and saves the dialogue context.
接下来,电子设备100播放语音回复内容“您想给谁发短信”之后,用户再次输入语音信号“给小明”,若此时网络质量良好,则电子设备100将接收到的“我要发短信”上传到云服务器200进行处理。云服务器200基于保存的意图“发短信”和槽位信息“发短信的对象(空缺)、发短信的内容(空缺)”,以及根据“给小明”对应的语义信息进行填槽。云服务器200识别出“给小明”的语义信息中“小明”即为发短信的对象,即云服务器200确定该意图(打电话)对应的槽位(发短信的对象)的信息为“小明”。Next, after the electronic device 100 plays the voice reply content "who do you want to text", the user inputs the voice signal "to Xiao Ming" again. If the network quality is good at this time, the electronic device 100 will receive the received "I want to send a text message" ” is uploaded to the cloud server 200 for processing. The cloud server 200 fills the slot based on the stored intent "send text message" and slot information "object to send text message (vacancy), content to send text message (vacancy)", and semantic information corresponding to "to Xiaoming". The cloud server 200 recognizes that "Xiao Ming" in the semantic information of "To Xiao Ming" is the object of the text message, that is, the cloud server 200 determines that the information of the slot (the object of the text message) corresponding to the intention (call) is "Xiao Ming" .
由于此时槽位信息“发短信的内容”仍然有空缺,云服务器200保存当前的意图和槽位信息,云服务器200确定出下一步执行的动作为针对空缺的槽位信息(发短信的内容)再次进行询问,云服务器200根据语音合成技术(TTS)生成语音回复内容“您想发什么”,向电子 设备100发送带有该语音回复内容的指令,指示电子设备100播放该语音回复内容。并且,云服务器200将该对话上下文同步发送给电子设备100,此时,该对话上下文包括意图(发短信)、槽位信息“发短信的对象(小明)、发短信的内容(空缺)”。电子设备100接收到云服务器发送的指令和对话上下文,基于该指令播放语音回复内容“您想发什么”,并保存该对话上下文。Since the slot information "content of text messages" is still vacant at this time, the cloud server 200 saves the current intention and slot information, and the cloud server 200 determines that the next action to be performed is for the vacant slot information (content of text messages). ) to inquire again, the cloud server 200 generates a voice reply content "what do you want to send" according to the speech synthesis technology (TTS), sends an instruction with the voice reply content to the electronic device 100, and instructs the electronic device 100 to play the voice reply content. In addition, the cloud server 200 synchronously sends the dialog context to the electronic device 100. At this time, the dialog context includes the intent (sending a text message), and the slot information "the object of the text message (Xiao Ming), and the content of the text message (vacancy)". The electronic device 100 receives the instruction and the dialogue context sent by the cloud server, plays the voice reply content "what do you want to send" based on the instruction, and saves the dialogue context.
在一些实施例中,电子设备100播放语音回复内容“您想给谁发短信”之后,用户再次输入语音信号“给小明”,若此时网络质量不佳,电子设备100调用自身的语音处理能力处理该语音信号“给小明”。电子设备100根据语音识别技术(ASR)将语音信号转化为文本信息,根据语义理解技术(NUL)得到语义信息。接着,电子设备100基于保存的意图“发短信”和槽位信息“发短信的对象(空缺)、发短信的内容(空缺)”,以及根据“给小明”对应的语义信息进行填槽。电子设备100识别出“给小明”的语义信息中“小明”即为发短信的对象,即电子设备100确定该意图(打电话)对应的槽位(发短信的对象)的信息为“小明”。In some embodiments, after the electronic device 100 plays the voice reply content "who do you want to text", the user inputs the voice signal "to Xiao Ming" again. If the network quality is not good at this time, the electronic device 100 invokes its own voice processing capability The voice signal "to Xiaoming" is processed. The electronic device 100 converts the speech signal into text information according to the speech recognition technology (ASR), and obtains the semantic information according to the semantic understanding technology (NUL). Next, the electronic device 100 fills the slot based on the stored intent "send text message" and slot information "object to send text message (vacancy), content to send text message (vacancy)", and semantic information corresponding to "to Xiaoming". The electronic device 100 recognizes that "Xiao Ming" in the semantic information of "To Xiao Ming" is the object to send the text message, that is, the electronic device 100 determines that the information of the slot (the object of the text message) corresponding to the intention (call) is "Xiao Ming" .
在一些实施例中,当缺失的槽位在两个或两个以上时,云服务器200向电子设备100同步发送的不只是意图和槽位信息,云服务器200还可以对槽位进行标注,指示电子设备100槽位填充的顺序。这样,电子设备在处理下一次的语音信号时,可以准确对其中一个槽位进行填充。In some embodiments, when there are two or more missing slots, the cloud server 200 sends not only the intention and slot information to the electronic device 100 synchronously, but the cloud server 200 can also mark the slots, indicating The order in which the slots of the electronic device 100 are filled. In this way, the electronic device can accurately fill one of the slots when processing the next voice signal.
即针对上述示例,云服务器200在向电子设备100同步发送意图和槽位信息时,由于有两个槽位空缺,云服务器200可以对槽位进行标注,确定下一次填充的槽位是哪一个。那么电子设备100在进行槽位填充时,可以无需判断语义信息对应哪一个槽位,而直接进行填充。即电子设备100可以直接确定该意图(打电话)对应的槽位(发短信的对象)的信息为“小明”。That is, for the above example, when the cloud server 200 synchronously sends the intention and slot information to the electronic device 100, since there are two vacancies in the slot, the cloud server 200 can mark the slot to determine which slot is to be filled next time. . Then, when the electronic device 100 fills the slot, it can directly fill the slot without judging which slot the semantic information corresponds to. That is, the electronic device 100 can directly determine that the information of the slot (the object of the text message) corresponding to the intention (call) is "Xiao Ming".
由于此时槽位信息“发短信的内容”仍然有空缺,电子设备100保存当前的意图和槽位信息,电子设备100确定出下一步执行的动作为针对空缺的槽位信息(发短信的内容)再次进行询问,电子设备100根据语音合成技术(TTS)生成语音回复内容“您想发什么”,电子设备100播放该语音回复内容。Since the slot information "content of the text message" is still vacant at this time, the electronic device 100 saves the current intention and slot information, and the electronic device 100 determines that the next action to be performed is for the vacant slot information (content of the text message). ) to ask again, the electronic device 100 generates a voice reply content “what do you want to send” according to the speech synthesis technology (TTS), and the electronic device 100 plays the voice reply content.
电子设备100下一次接收到的语音信号再次进行处理、填槽,直至槽位信息填充完整,生成执行这个意图的指令,则电子设备100判断该意图已经执行结束。The next time the voice signal received by the electronic device 100 is processed and the slot is filled again, until the slot information is completely filled and an instruction to execute the intent is generated, the electronic device 100 determines that the intent has been executed.
本申请提供了一种语音交互处理方法,如图9所示,该方法包括:The present application provides a voice interaction processing method, as shown in FIG. 9 , the method includes:
电子设备100与云服务器200建立连接。步骤S101:电子设备100接收第一语音信号。The electronic device 100 establishes a connection with the cloud server 200 . Step S101: The electronic device 100 receives the first voice signal.
该第一语音信号例如可以是上述图5A或图5B中的语音1,也可以是图6中的语音“我要打电话”。The first voice signal may be, for example, the voice 1 in the above-mentioned FIG. 5A or FIG. 5B , or the voice “I want to make a call” in FIG. 6 .
步骤S102:电子设备100将第一语音信号上传到云服务器200。Step S102 : the electronic device 100 uploads the first voice signal to the cloud server 200 .
步骤S103:云服务器200对第一语音信号进行识别,得到对应的意图和意图对应的一个或多个槽位信息,并基于意图和一个或多个槽位信息确定出第一语音回复内容。Step S103: The cloud server 200 identifies the first voice signal, obtains the corresponding intent and one or more slot information corresponding to the intent, and determines the content of the first voice reply based on the intent and the one or more slot information.
步骤S104:云服务器200向电子设备100发送第一语音回复内容、意图和一个或多个槽位信息。Step S104 : the cloud server 200 sends the first voice reply content, intent and one or more slot information to the electronic device 100 .
步骤S105:电子设备100输出第一语音回复内容,并保存意图和一个或多个槽位信息。Step S105: The electronic device 100 outputs the first voice reply content, and saves the intent and one or more slot information.
该第一语音回复内容例如可以是上述图5A中的动作1中包含的语音回复内容,也可以是图6中的语音“您想打给谁”。The first voice reply content can be, for example, the voice reply content included in Action 1 in FIG. 5A , or the voice “who do you want to call” in FIG. 6 .
电子设备100与云服务器200通信质量不佳。The communication quality between the electronic device 100 and the cloud server 200 is poor.
步骤S106:电子设备100接收第二语音信号。Step S106: The electronic device 100 receives the second voice signal.
该第一语音信号例如可以是上述图5A或图5B中的语音2,也可以是图6中的语音“打给小明”。The first voice signal may be, for example, the voice 2 in the above-mentioned FIG. 5A or FIG. 5B , or may be the voice “call Xiaoming” in FIG. 6 .
步骤S107:电子设备100对第二语音信号进行识别,得到对应的语义信息,并基于意图和一个或多个槽位信息和语义信息,确定出第一操作。Step S107: The electronic device 100 recognizes the second voice signal, obtains corresponding semantic information, and determines the first operation based on the intent and one or more slot information and semantic information.
步骤S108:执行第一操作。Step S108: Execute the first operation.
该第一操作例如可以是图5A或图5B中的动作2,也可以是图6中的播放语音内容和/或显示文字内容“正在呼叫小明”,并执行:呼叫小明”,这三个动作中的一个或多个。The first operation may be, for example, Action 2 in FIG. 5A or FIG. 5B , or may be playing the voice content and/or displaying the text content “Calling Xiaoming” in FIG. 6 , and executing: Calling Xiaoming”, these three actions one or more of.
在一些实施例中,电子设备100与云服务器200通信质量不佳可以发生在步骤S106和步骤S107之间的任意时段。In some embodiments, poor communication quality between the electronic device 100 and the cloud server 200 may occur at any time period between steps S106 and S107.
在一种可能的实现方式中,电子设备10对第二语音信号进行识别,得到对应的语义信息,并基于意图和一个或多个槽位信息和语义信息,确定出第一操作,包括:电子设备100识别出语义信息和一个或多个槽位信息中其中一个缺失的槽位匹配,将语义信息填充为该槽位的值;电子设备基于意图和填充后的一个或多个槽位信息,确定第一操作。这里,具体描述了电子设备基于第二语音信号处理原语音业务的过程,由于电子设备获取了第一语音信号对应的意图和槽位信息,则能够继续基于该意图和槽位信息,对接收到的第二语音信号进行填槽处理,实现继续处理原语音业务的能力。In a possible implementation manner, the electronic device 10 recognizes the second voice signal, obtains corresponding semantic information, and determines the first operation based on the intent and one or more slot information and semantic information, including: electronic The device 100 identifies that the semantic information matches one of the missing slots in the one or more slot information, and fills the semantic information with the value of the slot; the electronic device, based on the intent and the filled one or more slot information, Determine the first operation. Here, the process of processing the original voice service by the electronic device based on the second voice signal is described in detail. Since the electronic device obtains the intent and slot information corresponding to the first voice signal, it can continue to receive the The slot-filling process is performed on the second voice signal of the device, so as to realize the ability to continue processing the original voice service.
在一种可能的实现方式中,第一操作包括以下一项或多项:播放第二语音回复内容;显示第二语音回复内容的文字内容;跳转到相应的界面。该第二语音回复内容例如可以是上述图5A或图5B中的动作2中包含的语音回复内容,也可以是图6中的语音“正在呼叫小明”。In a possible implementation manner, the first operation includes one or more of the following: playing the second voice reply content; displaying the text content of the second voice reply content; jumping to a corresponding interface. The second voice reply content may be, for example, the voice reply content included in Action 2 in FIG. 5A or FIG. 5B , or may be the voice “calling Xiaoming” in FIG. 6 .
在一种可能的实现方式中,方法还包括:电子设备接收云服务器发送的第一指令;电子设备基于第一指令显示第一语音回复内容的文字内容,和/或跳转到相应的界面。In a possible implementation manner, the method further includes: the electronic device receives the first instruction sent by the cloud server; the electronic device displays the text content of the first voice reply content based on the first instruction, and/or jumps to a corresponding interface.
在一种可能的实现方式中,电子设备与云服务器的通信质量不佳,包括:电子设备上传第二语音信号到云服务器失败;或者电子设备将第一语音信号上传到云服务器后,在预设时间内没有接收到云服务器的回复数据。这里说明了通信质量不佳发生的时机,可以是在电子设备上传第二语音信号时电子设备与云服务器的通信质量不佳,也可以是云服务器下发针对第二语音信号的语音回复内容时电子设备与云服务器的通信质量不佳。In a possible implementation manner, the poor communication quality between the electronic device and the cloud server includes: the electronic device fails to upload the second voice signal to the cloud server; or after the electronic device uploads the first voice signal to the cloud server, the The reply data from the cloud server has not been received within the set time. The timing of poor communication quality is described here. It can be when the electronic device uploads the second voice signal and the communication quality between the electronic device and the cloud server is poor, or when the cloud server delivers the voice reply content for the second voice signal. The communication quality between the electronic device and the cloud server is poor.
在一种可能的实现方式中,电子设备接收第一语音信号,包括:电子设备通过语音助手应用接收第一语音信号。In a possible implementation manner, the electronic device receiving the first voice signal includes: the electronic device receives the first voice signal through a voice assistant application.
本申请实施例还提供了一种计算机可读存储介质。上述方法实施例中描述的方法可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。如果在软件中实现,则功能可以作为一个或多个指令或代码存储在计算机可读介质上或者在计算机可读介质上传输。计算机可读介质可以包括计算机存储介质和通信介质,还可以包括任何可以将计算机程序从一个地方传送到另一个地方的介质。存储介质可以是可由计算机访问的任何可用介质。Embodiments of the present application also provide a computer-readable storage medium. The methods described in the above method embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media can include both computer storage media and communication media and also include any medium that can transfer a computer program from one place to another. A storage medium can be any available medium that can be accessed by a computer.
本申请实施例还提供了一种计算机程序产品。上述方法实施例中描述的方法可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。如果在软件中实现,可以全部或者部分得通过计算机程序产品的形式实现。计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行上述计算机程序指令时,全部或部分地产生按照上述方法实施例中描述的流程或功能。上述计算机可以是通用计算机、专用计算机、计算机网络、网络设备、用户设备或者其它可编程装置。The embodiments of the present application also provide a computer program product. The methods described in the above method embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. If implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the above-mentioned computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the above-mentioned method embodiments are generated. The aforementioned computers may be general purpose computers, special purpose computers, computer networks, network equipment, user equipment, or other programmable devices.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包 括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如DVD)、或者半导体介质(例如固态硬盘)等。In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, optical fiber, digital subscriber line) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that includes an integration of one or more available media. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state drives), and the like.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,该流程可以由计算机程序来指令相关的硬件完成,该程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法实施例的流程。而前述的存储介质包括:ROM或随机存储记忆体RAM、磁碟或者光盘等各种可存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented. The process can be completed by instructing the relevant hardware by a computer program, and the program can be stored in a computer-readable storage medium. When the program is executed , which may include the processes of the foregoing method embodiments. The aforementioned storage medium includes: ROM or random storage memory RAM, magnetic disk or optical disk and other mediums that can store program codes.

Claims (17)

  1. 一种语音交互处理方法,其特征在于,所述方法包括:A voice interaction processing method, characterized in that the method comprises:
    电子设备接收第一语音信号;The electronic device receives the first voice signal;
    在所述电子设备与云服务器建立连接的情况下,所述电子设备将所述第一语音信号上传到所述云服务器;When the electronic device establishes a connection with the cloud server, the electronic device uploads the first voice signal to the cloud server;
    所述电子设备接收所述云服务器发送的第一语音回复内容、意图和所述意图对应的一个或多个槽位信息,所述意图和所述一个或多个槽位信息是所述云服务器对所述第一语音信号进行识别得到的,所述第一语音回复内容是所述云服务器基于所述意图和所述一个或多个槽位信息确定出的;The electronic device receives the content of the first voice reply sent by the cloud server, the intent, and one or more slot information corresponding to the intent, where the intent and the one or more slot information are the cloud server Obtained by recognizing the first voice signal, the content of the first voice reply is determined by the cloud server based on the intention and the one or more slot information;
    所述电子设备输出所述第一语音回复内容后,接收第二语音信号;After the electronic device outputs the first voice reply content, it receives a second voice signal;
    在所述电子设备与所述云服务器的通信质量不佳的情况下,所述电子设备对所述第二语音信号进行识别,得到对应的语义信息,并基于所述意图和所述一个或多个槽位信息和所述语义信息,确定出第一操作;In the case that the communication quality between the electronic device and the cloud server is not good, the electronic device recognizes the second voice signal to obtain corresponding semantic information, and based on the intent and the one or more Slot information and the semantic information to determine the first operation;
    所述电子设备执行所述第一操作。The electronic device performs the first operation.
  2. 根据权利要求1所述的方法,其特征在于,所述电子设备基于所述意图和所述一个或多个槽位信息和所述语义信息,确定出第一操作,包括:The method according to claim 1, wherein the electronic device determines the first operation based on the intent, the one or more slot information and the semantic information, comprising:
    所述电子设备识别出所述语义信息和所述一个或多个槽位信息中其中一个缺失的槽位匹配,将所述语义信息填充为该槽位的值;The electronic device identifies that the semantic information matches a missing slot in the one or more slot information, and fills the semantic information with the value of the slot;
    所述电子设备基于所述意图和所述填充后的一个或多个槽位信息,确定第一操作。The electronic device determines a first operation based on the intent and the filled one or more slot information.
  3. 根据权利要求1或2所述的方法,其特征在于,所述第一操作包括以下一项或多项:The method according to claim 1 or 2, wherein the first operation comprises one or more of the following:
    播放第二语音回复内容;Play the second voice reply content;
    显示所述第二语音回复内容的文字内容;displaying the text content of the second voice reply content;
    跳转到相应的界面。Jump to the corresponding interface.
  4. 根据权利要求1-3任一所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1-3, wherein the method further comprises:
    所述电子设备接收所述云服务器发送的第一指令;The electronic device receives the first instruction sent by the cloud server;
    所述电子设备基于所述第一指令显示所述第一语音回复内容的文字内容,和/或跳转到相应的界面。The electronic device displays the text content of the first voice reply content based on the first instruction, and/or jumps to a corresponding interface.
  5. 根据权利要求1-4任一所述的方法,其特征在于,所述电子设备与所述云服务器的通信质量不佳,包括:The method according to any one of claims 1-4, wherein the communication quality between the electronic device and the cloud server is poor, comprising:
    所述电子设备上传所述第二语音信号到云服务器失败;或者The electronic device fails to upload the second voice signal to the cloud server; or
    所述电子设备将所述第一语音信号上传到所述云服务器后,在预设时间内没有接收到所述云服务器的回复数据。After the electronic device uploads the first voice signal to the cloud server, it does not receive reply data from the cloud server within a preset time.
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述电子设备接收第一语音信号,包括:The method according to any one of claims 1-5, wherein the electronic device receives the first voice signal, comprising:
    所述电子设备通过语音助手应用接收所述第一语音信号。The electronic device receives the first voice signal through a voice assistant application.
  7. 一种语音交互处理方法,其特征在于,所述方法包括:A voice interaction processing method, characterized in that the method comprises:
    云服务器接收电子设备上传的第一语音信号;The cloud server receives the first voice signal uploaded by the electronic device;
    所述云服务器对所述第一语音信号进行识别,得到对应的意图和所述意图对应的一个或多个槽位信息,并基于所述意图和所述一个或多个槽位信息确定出第一语音回复内容;The cloud server identifies the first voice signal, obtains the corresponding intent and one or more slot information corresponding to the intent, and determines the first voice signal based on the intent and the one or more slot information. a voice reply content;
    所述云服务器向所述电子设备发送所述第一语音回复内容、所述意图和所述一个或多个槽位信息。The cloud server sends the first voice reply content, the intent and the one or more slot information to the electronic device.
  8. 根据权利要求7所述的方法,其特征在于,所述云服务器向所述电子设备发送所述第一语音回复内容和所述意图和所述一个或多个槽位信息,包括:The method according to claim 7, wherein the cloud server sends the first voice reply content, the intention and the one or more slot information to the electronic device, comprising:
    所述云服务器在所述一个或多个槽位信息中至少一个槽位信息存在缺失的情况下,向所述电子设备发送所述第一语音回复内容和所述意图和所述一个或多个槽位信息。The cloud server sends, to the electronic device, the first voice reply content and the intent and the one or more slot information.
  9. 一种电子设备,其特征在于,包括:一个或多个处理器、一个或多个存储器;所述一个或多个存储器分别与所述一个或多个处理器耦合;所述一个或多个存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令;当所述计算机指令在所述处理器上运行时,使得所述电子设备执行:An electronic device, characterized by comprising: one or more processors and one or more memories; the one or more memories are respectively coupled with the one or more processors; the one or more memories for storing computer program code, the computer program code comprising computer instructions; when the computer instructions run on the processor, cause the electronic device to execute:
    接收第一语音信号;receiving a first voice signal;
    在与云服务器建立连接的情况下,将所述第一语音信号上传到所述云服务器;In the case of establishing a connection with the cloud server, uploading the first voice signal to the cloud server;
    接收所述云服务器发送的第一语音回复内容和意图和所述意图对应的一个或多个槽位信息,所述意图和所述一个或多个槽位信息是所述云服务器对所述第一语音信号进行识别得到的,所述第一语音回复内容是所述云服务器基于所述意图和所述一个或多个槽位信息确定出的;Receive the content of the first voice reply sent by the cloud server and the intent and one or more slot information corresponding to the intent, where the intent and the one or more slot information are the information about the first voice sent by the cloud server. Obtained by recognizing a voice signal, the first voice reply content is determined by the cloud server based on the intention and the one or more slot information;
    输出所述第一语音回复内容后,接收第二语音信号;After outputting the first voice reply content, receive a second voice signal;
    在与所述云服务器的通信质量不佳的情况下,对所述第二语音信号进行识别,得到对应的语义信息,并基于所述意图和所述一个或多个槽位信息和所述语义信息,确定出第一操作;In the case of poor communication quality with the cloud server, the second voice signal is identified to obtain corresponding semantic information, and based on the intent and the one or more slot information and the semantic information, determine the first operation;
    执行所述第一操作。The first operation is performed.
  10. 根据权利要求9所述的电子设备,其特征在于,所述基于所述意图和所述一个或多个槽位信息和所述语义信息,确定出第一操作,包括:The electronic device according to claim 9, wherein the determining the first operation based on the intent and the one or more slot information and the semantic information comprises:
    识别出所述语义信息和所述一个或多个槽位信息中其中一个缺失的槽位匹配,将所述语义信息填充为该槽位的值;Identifying that the semantic information matches one of the missing slots in the one or more slot information, and filling the semantic information with the value of the slot;
    基于所述意图和所述填充后的一个或多个槽位信息,确定第一操作。A first operation is determined based on the intent and the filled one or more slot information.
  11. 根据权利要求9或10所述的电子设备,其特征在于,所述第一操作包括以下一项或多项:The electronic device according to claim 9 or 10, wherein the first operation includes one or more of the following:
    播放第二语音回复内容;Play the second voice reply content;
    显示所述第二语音回复内容的文字内容;displaying the text content of the second voice reply content;
    跳转到相应的界面。Jump to the corresponding interface.
  12. 根据权利要求9-11任一所述的电子设备,其特征在于,所述电子设备还执行:The electronic device according to any one of claims 9-11, wherein the electronic device further executes:
    接收所述云服务器发送的第一指令;receiving the first instruction sent by the cloud server;
    基于所述第一指令显示所述第一语音回复内容的文字内容,和/或跳转到相应的界面。Display the text content of the first voice reply content based on the first instruction, and/or jump to a corresponding interface.
  13. 根据权利要求9-12任一所述的电子设备,其特征在于,所述与所述云服务器的通信质量不佳,包括:The electronic device according to any one of claims 9-12, wherein the quality of the communication with the cloud server is poor, comprising:
    上传所述第二语音信号到云服务器失败;或者Uploading the second voice signal to the cloud server fails; or
    将所述第一语音信号上传到所述云服务器后,在预设时间内没有接收到所述云服务器的回复数据。After the first voice signal is uploaded to the cloud server, no reply data from the cloud server is received within a preset time.
  14. 根据权利要求9-13任一项所述的电子设备,其特征在于,所述电子设备接收第一语音信号,包括:The electronic device according to any one of claims 9-13, wherein the electronic device receives the first voice signal, comprising:
    通过语音助手应用接收所述第一语音信号。The first voice signal is received through a voice assistant application.
  15. 一种云服务器,其特征在于,包括:一个或多个处理器、一个或多个存储器;所述一个或多个存储器分别与所述一个或多个处理器耦合;所述一个或多个存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令;当所述计算机指令在所述处理器上运行时,使得所述云服务器执行:A cloud server, characterized by comprising: one or more processors and one or more memories; the one or more memories are respectively coupled with the one or more processors; the one or more memories For storing computer program code, the computer program code includes computer instructions; when the computer instructions are executed on the processor, the cloud server is made to execute:
    接收电子设备上传的第一语音信号;receiving the first voice signal uploaded by the electronic device;
    对所述第一语音信号进行识别,得到对应的意图和所述意图对应的一个或多个槽位信息,并基于所述意图和所述一个或多个槽位信息确定出第一语音回复内容;Identify the first voice signal, obtain the corresponding intention and one or more slot information corresponding to the intention, and determine the content of the first voice reply based on the intention and the one or more slot information ;
    向所述电子设备发送所述第一语音回复内容、所述意图和所述一个或多个槽位信息。Sending the first voice reply content, the intent, and the one or more slot information to the electronic device.
  16. 根据权利要求15所述的云服务器,其特征在于,所述向所述电子设备发送所述第一语音回复内容和所述意图和所述一个或多个槽位信息,包括:The cloud server according to claim 15, wherein the sending the first voice reply content, the intention and the one or more slot information to the electronic device comprises:
    在所述一个或多个槽位信息中至少一个槽位信息存在缺失的情况下,向所述电子设备发送所述第一语音回复内容和所述意图和所述一个或多个槽位信息。In the case that at least one slot information in the one or more slot information is missing, send the first voice reply content and the intention and the one or more slot information to the electronic device.
  17. 一种计算机可读介质,用于存储一个或多个程序,其中所述一个或多个程序被配置为被所述一个或多个处理器执行,所述一个或多个程序包括指令,所述指令用于执行如权利要求1-8所述的方法。A computer-readable medium storing one or more programs, wherein the one or more programs are configured to be executed by the one or more processors, the one or more programs comprising instructions, the The instructions are for performing the method of claims 1-8.
PCT/CN2021/139631 2020-12-31 2021-12-20 Voice interaction processing method and related apparatus WO2022143258A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011636583.9A CN114694646A (en) 2020-12-31 2020-12-31 Voice interaction processing method and related device
CN202011636583.9 2020-12-31

Publications (1)

Publication Number Publication Date
WO2022143258A1 true WO2022143258A1 (en) 2022-07-07

Family

ID=82134513

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/139631 WO2022143258A1 (en) 2020-12-31 2021-12-20 Voice interaction processing method and related apparatus

Country Status (2)

Country Link
CN (1) CN114694646A (en)
WO (1) WO2022143258A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115910035A (en) * 2023-03-01 2023-04-04 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116662555B (en) * 2023-07-28 2023-10-20 成都赛力斯科技有限公司 Request text processing method and device, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107886948A (en) * 2017-11-16 2018-04-06 百度在线网络技术(北京)有限公司 Voice interactive method and device, terminal, server and readable storage medium storing program for executing
US20180143802A1 (en) * 2016-11-24 2018-05-24 Samsung Electronics Co., Ltd. Method for processing various inputs, and electronic device and server for the same
US20190304456A1 (en) * 2018-03-30 2019-10-03 Fujitsu Limited Storage medium, spoken language understanding apparatus, and spoken language understanding method
CN110444206A (en) * 2019-07-31 2019-11-12 北京百度网讯科技有限公司 Voice interactive method and device, computer equipment and readable medium
CN111104495A (en) * 2019-11-19 2020-05-05 深圳追一科技有限公司 Information interaction method, device, equipment and storage medium based on intention recognition
CN111144128A (en) * 2019-12-26 2020-05-12 北京百度网讯科技有限公司 Semantic parsing method and device
CN111341311A (en) * 2020-02-21 2020-06-26 深圳前海微众银行股份有限公司 Voice conversation method and device
CN111477225A (en) * 2020-03-26 2020-07-31 北京声智科技有限公司 Voice control method and device, electronic equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180143802A1 (en) * 2016-11-24 2018-05-24 Samsung Electronics Co., Ltd. Method for processing various inputs, and electronic device and server for the same
CN107886948A (en) * 2017-11-16 2018-04-06 百度在线网络技术(北京)有限公司 Voice interactive method and device, terminal, server and readable storage medium storing program for executing
US20190304456A1 (en) * 2018-03-30 2019-10-03 Fujitsu Limited Storage medium, spoken language understanding apparatus, and spoken language understanding method
CN110444206A (en) * 2019-07-31 2019-11-12 北京百度网讯科技有限公司 Voice interactive method and device, computer equipment and readable medium
CN111104495A (en) * 2019-11-19 2020-05-05 深圳追一科技有限公司 Information interaction method, device, equipment and storage medium based on intention recognition
CN111144128A (en) * 2019-12-26 2020-05-12 北京百度网讯科技有限公司 Semantic parsing method and device
CN111341311A (en) * 2020-02-21 2020-06-26 深圳前海微众银行股份有限公司 Voice conversation method and device
CN111477225A (en) * 2020-03-26 2020-07-31 北京声智科技有限公司 Voice control method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115910035A (en) * 2023-03-01 2023-04-04 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium

Also Published As

Publication number Publication date
CN114694646A (en) 2022-07-01

Similar Documents

Publication Publication Date Title
WO2021052263A1 (en) Voice assistant display method and device
RU2766255C1 (en) Voice control method and electronic device
CN110910872B (en) Voice interaction method and device
WO2020192456A1 (en) Voice interaction method and electronic device
WO2022052776A1 (en) Human-computer interaction method, and electronic device and system
CN111819533B (en) Method for triggering electronic equipment to execute function and electronic equipment
CN109286725B (en) Translation method and terminal
CN111628916B (en) Method for cooperation of intelligent sound box and electronic equipment
WO2022143258A1 (en) Voice interaction processing method and related apparatus
WO2022161077A1 (en) Speech control method, and electronic device
WO2021104122A1 (en) Method and apparatus for responding to call request, and electronic device
WO2021190225A1 (en) Voice interaction method and electronic device
WO2021031862A1 (en) Data processing method and apparatus thereof
WO2022135254A1 (en) Text editing method, electronic device and system
CN113380240B (en) Voice interaction method and electronic equipment
WO2022033355A1 (en) Mail processing method and electronic device
WO2022135157A1 (en) Page display method and apparatus, and electronic device and readable storage medium
WO2022095983A1 (en) Gesture misrecognition prevention method, and electronic device
CN114489471B (en) Input and output processing method and electronic equipment
CN115206308A (en) Man-machine interaction method and electronic equipment
WO2022052767A1 (en) Method for controlling device, electronic device, and system
CN113470638B (en) Method for slot filling, chip, electronic device and readable storage medium
WO2023051116A1 (en) Distributed implementation method and system, and electronic device and storage medium
WO2023124829A1 (en) Collaborative voice input method, electronic device, and computer-readable storage medium
CN113271577B (en) Media data playing system, method and related device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21913999

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21913999

Country of ref document: EP

Kind code of ref document: A1