WO2023231936A1 - 一种语音交互方法及终端 - Google Patents

一种语音交互方法及终端 Download PDF

Info

Publication number
WO2023231936A1
WO2023231936A1 PCT/CN2023/096683 CN2023096683W WO2023231936A1 WO 2023231936 A1 WO2023231936 A1 WO 2023231936A1 CN 2023096683 W CN2023096683 W CN 2023096683W WO 2023231936 A1 WO2023231936 A1 WO 2023231936A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
speech
signal
intention
voice signal
Prior art date
Application number
PCT/CN2023/096683
Other languages
English (en)
French (fr)
Inventor
陈开济
陈家胜
史舒婷
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023231936A1 publication Critical patent/WO2023231936A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present application relates to the field of human-computer interaction, and in particular, to a voice interaction method and terminal.
  • the terminal picks up the user's voice, uses Automatic Speech Recognition (ASR) technology to convert the user's voice into text, and then uses Natural Language Understanding (NLU) technology to analyze the converted text. Identify the intention, execute the skill corresponding to the intention, and reply to the user with the execution result.
  • ASR Automatic Speech Recognition
  • NLU Natural Language Understanding
  • the terminal does not recognize the user's intention; or the terminal has recognized the user's intention, but the terminal does not support executing the skills corresponding to the intention.
  • the terminal fails to execute the skill corresponding to the user's intention, currently all terminals give a unified "can't understand" reply, which will cause the user to feel that the voice reply is inaccurate, unnatural, and unintelligent, causing the user's frustration.
  • the voice interaction experience is poor.
  • the voice interaction method and terminal provided by this application can distinguish more scenarios and provide different response methods based on different scenarios, improve the accuracy of the recipient recognition results, and make the voice system responses more natural and intelligent .
  • the first aspect provides a voice interaction method, which method includes: detecting a voice signal; converting the voice signal into text, and performing intent recognition on the text to obtain an intent recognition result; and based on the voice signal, text, and intent recognition results.
  • One or more items determine the addressee recognition result, which includes the source, object, and subject of the voice signal; determine the response mode of the voice signal based on the addressee recognition result and the intention execution result.
  • the source of the voice signal includes one of the user, speaker or electronic device, and environment;
  • the object of the voice signal includes one of the voice system, the user, and the environment;
  • the subject of the voice signal includes tasks or meaninglessness.
  • the tasks also include: one or more of execution tasks, chat tasks, encyclopedia tasks, and dialect tasks.
  • tasks can also include dialect tasks.
  • tasks can also be divided into tasks corresponding to different emotions according to the user's emotions. For example, excited emotions correspond to the task of playing cheerful music; nervous emotions correspond to the task of playing soothing light music, etc.
  • the intention execution results include successful execution of the intention and unsuccessful execution of the intention.
  • the intent execution result here may be the result of the terminal itself executing the intent after the terminal requests to execute the intent in the intent recognition result; or, after the terminal requests other devices to execute the intent, the other devices send The execution results fed back by the terminal. That is to say, the intention has been executed before the terminal determines how to respond to the voice signal.
  • the intention execution result here may also be that the terminal determines whether it or other devices (such as servers) support the execution of skills corresponding to the intention based on the intention. The judgment result is the intention execution result. That is to say, the intention is not executed until the terminal determines how to respond to the voice signal.
  • the addressee recognition results are helpful in distinguishing different scenarios of speech, and are helpful in improving rejection recognition scenarios (such as everyone talking, electronic devices playing sounds, users talking to themselves, etc. scene) recognition rate.
  • the voice system can provide different response methods and play different content according to different playback templates. For example, in the case where the intention is not successfully executed, the voice system can distinguish the specific situation through the recipient recognition results, provide more information to the user through different playback content, improve the intelligence of the voice system interaction, and make human-computer interaction more intelligent. More natural.
  • the response mode of the voice signal is determined based on the addressee recognition result and the intention execution result, including: when the source of the voice signal is the user, the object of the voice signal is the voice system, and the subject of the voice signal is Task, when the intention recognition execution result is that the intention is not successfully executed, a first prompt is issued.
  • the first prompt is used to prompt that the voice system does not support the subject of executing the voice signal; the first prompt includes the source of the voice signal, the object of the voice signal, and the voice The subject of the signal; or, when the source of the voice signal is the user, the object of the voice signal is the voice system, the subject of the voice signal is meaningless, and the intention recognition execution result is unsuccessful execution of the intention, a second prompt is issued; the second prompt is To request clarification from the user, the first prompt includes the source of the voice signal, the object of the voice signal, and the subject of the voice signal; or, when the source of the voice signal is a non-user, or the object of the voice signal is a non-voice system, it is determined not to respond voice signal.
  • the response mode of the voice signal is determined based on the addressee recognition result and the intention execution result, including: when the source of the voice signal is a user and the target of the voice signal is another user, the voice signal When the subject is a small talk task, a third prompt is issued, and the third prompt is used to ask whether to perform the first skill associated with the voice signal; or, when the source of the voice signal is the user and the object of the voice signal is air, the subject of the voice signal When it is a small talk task, a fourth prompt is issued, which is used to ask whether to perform the second skill associated with the voice signal; the second skill is the same as or different from the first skill.
  • the voice system can join the conversation between two users to realize intelligent interaction between human (user 1)-human (user 2)-machine (voice system) and improve the user's voice interaction.
  • the voice system can also perform related skills based on the conversation content between User 1 and User 2. For example, if User 1 and User 2 discuss visiting a certain tourist attraction, the voice system can ask whether it is necessary to inquire about the weather, tickets, travel strategies and other information of the tourist attraction.
  • the voice system can also interject in the user's self-talk scenes. For example, when the voice system receives voice from user 1, but the object of the voice is air, the subject of the voice signal is a chat task, and the intention execution result is unsuccessful execution of the intention, the voice system can also interrupt. Alternatively, the voice system can also ask whether to perform related skills based on the content of user 1's chat.
  • the voice system can provide richer functions and improve the intelligence of human-computer interaction of the voice system.
  • the response mode of the voice signal is determined based on the addressee recognition result and the intention execution result, including: based on preset rules, querying the response mode corresponding to the addressee recognition result and the intention execution result; rules When the addressee recognition result or intention execution result is different, the corresponding response method is different; or Then, the addressee recognition results and the intention execution results are input into the pre-trained response model for reasoning, and the response mode of the speech signal is obtained.
  • This provides two specific methods to implement different response modes based on different addressee results and intention execution results.
  • determining the addressee recognition result based on one or more of the speech signal, text, and intent recognition results includes: inputting the speech signal into the speech recognition model for reasoning, and obtaining the speech signal correspondence.
  • Dialogue classification which includes one of person-to-person dialogue, human-machine dialogue, electronic sound, noise and unknown sound; input the text into the text recognition model for reasoning, and obtain the initial source value of the speech signal and the object of the speech signal.
  • the initial value of the speech signal, and the initial value of the topic of the speech signal input the corresponding dialogue classification of the speech signal, the initial value of the source of the speech signal, the initial value of the object of the speech signal, and the initial value of the topic of the speech signal into the first ensemble learning model for reasoning. , obtain the source of the speech signal, the object of the speech signal, and the subject of the speech signal.
  • determining the addressee recognition result based on one or more of the speech signal, text, and intent recognition results also includes: inputting the speech signal into the speech recognition model for reasoning, and obtaining the speech signal.
  • Corresponding dialogue classification which includes human-to-human dialogue, human-computer dialogue, electronic sounds, noise and unknown sounds; input the text into the text recognition model for reasoning, and obtain the initial source value of the speech signal, the value of the speech signal The initial value of the object and the initial value of the topic of the speech signal; according to the probability distribution of each intention corresponding to the text in the intention execution result, map it to the probability of intention and the probability of no intention of the text; the probability of intention and no intention of the text The probability, the dialogue classification corresponding to the speech signal, the initial value of the source of the speech signal, the initial value of the object of the speech signal, and the initial value of the subject of the speech signal are input into the second ensemble learning model for reasoning, and the source of the speech signal, the speech signal the object, and the subject of the speech signal.
  • a terminal including: a processor, a memory and a touch screen.
  • the memory, the touch screen are coupled to the processor.
  • the memory is used to store computer program code.
  • the computer program code includes computer instructions. , when the processor reads the computer instructions from the memory, causing the terminal to execute the method described in the above aspect and any possible implementation manner therein.
  • a third aspect is to provide a device, which is included in a terminal and has the function of realizing the terminal behavior in any of the above aspects and possible implementation methods.
  • This function can be implemented by hardware, or it can be implemented by hardware executing corresponding software.
  • the hardware or software includes at least one module or unit corresponding to the above functions. For example, a receiving module or unit, a display module or unit, a processing module or unit, etc.
  • a fourth aspect is to provide a computer-readable storage medium, which includes computer instructions.
  • the terminal is caused to perform the method described in the above aspect and any possible implementation manner.
  • the fifth aspect is to provide a voice interaction system.
  • the voice system includes one or more processing units. When the one or more processing units execute instructions, the one or more processing units execute the above aspect and wherein Any possible implementation of the method described.
  • a computer program product is provided.
  • the computer program product When the computer program product is run on a computer, it causes the computer to execute the method described in the above aspects and any of the possible implementations.
  • a seventh aspect provides a chip system, including a processor.
  • the processor executes instructions, the processor executes the method described in the above aspects and any of the possible implementations.
  • the terminal provided by the second aspect, the device provided by the third aspect, the computer-readable storage medium provided by the fourth aspect, the voice interaction system provided by the fifth aspect, the computer program product provided by the sixth aspect, and the chip provided by the seventh aspect please refer to the description of the technical effects in the first aspect and any of the possible implementation methods, which will not be described again here.
  • Figure 1 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • Figure 2 is a schematic flow chart of a voice interaction method provided by an embodiment of the present application.
  • Figure 3 is a schematic structural diagram of a language system provided by an embodiment of the present application.
  • Figure 4 is a schematic structural diagram of a natural language generation module provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of some callee identification modules provided by embodiments of the present application.
  • Figure 6 is a schematic structural diagram of some further callee identification modules provided by embodiments of the present application.
  • Figure 7 is a schematic structural diagram of some further callee identification modules provided by embodiments of the present application.
  • Figure 8 is a schematic structural diagram of some further callee identification modules provided by embodiments of the present application.
  • FIG. 9 is a schematic structural diagram of a chip system provided by an embodiment of the present application.
  • first and second are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Therefore, features defined as “first” and “second” may explicitly or implicitly include one or more of these features. In the description of the embodiments of this application, unless otherwise specified, "plurality” means two or more.
  • the voice interaction method provided by the embodiments of the present application can be applied to terminals with voice interaction capabilities.
  • the terminal can install the voice interaction application for providing voice interaction capabilities, such as a voice assistant or virtual assistant on a mobile phone, a voice system on a vehicle terminal, etc.
  • the technical solutions provided by the embodiments of this application can be applied to voice interaction scenarios of continuous dialogue, wake-up-free voice interaction scenarios, and full-duplex voice interaction scenarios.
  • the voice interaction scenario of continuous dialogue means that after waking up the voice interaction application, the user can continuously send multiple voice commands to the voice interaction application within a preset time period, and the voice interaction application can complete multiple voice commands.
  • the wake-up-free voice interaction scenario means that the user does not need to say the wake-up word, and the voice interaction application automatically wakes up, automatically picks up the user's voice instructions, and completes the user's voice instructions.
  • the voice interaction scenario of full-duplex dialogue is different from the single-round or multi-round continuous speech recognition scenario. Full-duplex dialogue can predict what the user is about to say in real time, generate responses in real time, and control the rhythm of the conversation, thereby achieving long-range voice interaction. .
  • the application scenarios are no longer specifically limited in the embodiments of this application.
  • the terminal may be a mobile phone, a tablet computer, a personal computer (personal computer), etc. computer (PC), personal digital assistant (PDA), smart watch, netbook, wearable terminal, augmented reality (AR) device, virtual reality (VR) device, vehicle-mounted device, smart phone screens, smart cars, smart speakers, robots, etc.
  • PC personal computer
  • PDA personal digital assistant
  • AR augmented reality
  • VR virtual reality
  • vehicle-mounted device smart phone screens, smart cars, smart speakers, robots, etc.
  • Figure 1 shows a schematic structural diagram of the terminal 100.
  • the terminal 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone interface 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194, and user Identification module (subscriber identification module, SIM) card interface 195, etc.
  • a processor 110 an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone interface 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen
  • the sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
  • the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the terminal 100.
  • the terminal 100 may include more or fewer components than shown in the figures, or some components may be combined, or some components may be separated, or may be arranged differently.
  • the components illustrated may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units.
  • the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural network processor (neural-network processing unit, NPU), etc.
  • application processor application processor, AP
  • modem processor graphics processing unit
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller video codec
  • digital signal processor digital signal processor
  • DSP digital signal processor
  • baseband processor baseband processor
  • neural network processor neural-network processing unit
  • the controller can generate operation control signals based on the instruction operation code and timing signals to complete the control of fetching and executing instructions.
  • the processor 110 may also be provided with a memory for storing instructions and data.
  • the memory in processor 110 is cache memory. This memory may hold instructions or data that have been recently used or recycled by processor 110 . If the processor 110 needs to use the instructions or data again, it can be called directly from the memory. Repeated access is avoided and the waiting time of the processor 110 is reduced, thus improving the efficiency of the system.
  • the charging management module 140 is used to receive charging input from the charger.
  • the charger can be a wireless charger or a wired charger.
  • the charging management module 140 may receive charging input from the wired charger through the USB interface 130 .
  • the charging management module 140 may receive wireless charging input through the wireless charging coil of the terminal 100 . While charging the battery 142, the charging management module 140 can also provide power to the terminal through the power management module 141.
  • the power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110.
  • the power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 110, the internal memory 121, the display screen 194, the camera 193, the wireless communication module 160, and the like.
  • the power management module 141 can also be used to monitor battery capacity, battery cycle times, battery health status (leakage, impedance) and other parameters.
  • the power management module 141 may also be provided in the processor 110 .
  • the power management module 141 and the charging management module 140 may also be provided in the same device.
  • the wireless communication function of the terminal 100 can be implemented through the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor and the baseband processor.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in terminal 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
  • Antenna 1 can be reused as a diversity antenna for a wireless LAN. In other embodiments, antennas may be used in conjunction with tuning switches.
  • the mobile communication module 150 can provide wireless communication solutions including 2G/3G/4G/5G applied to the terminal 100.
  • the mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA), etc.
  • the mobile communication module 150 can receive electromagnetic waves through the antenna 1, perform filtering, amplification and other processing on the received electromagnetic waves, and transmit them to the modem processor for demodulation.
  • the mobile communication module 150 can also amplify the signal modulated by the modem processor and convert it into electromagnetic waves through the antenna 1 for radiation.
  • at least part of the functional modules of the mobile communication module 150 may be disposed in the processor 110 .
  • at least part of the functional modules of the mobile communication module 150 and at least part of the modules of the processor 110 may be provided in the same device.
  • a modem processor may include a modulator and a demodulator.
  • the modulator is used to modulate the low-frequency baseband signal to be sent into a medium-high frequency signal.
  • the demodulator is used to demodulate the received electromagnetic wave signal into a low-frequency baseband signal.
  • the demodulator then transmits the demodulated low-frequency baseband signal to the baseband processor for processing.
  • the application processor outputs voice signals through audio devices (not limited to speaker 170A, receiver 170B, etc.), or displays images or videos through display screen 194.
  • the modem processor may be a stand-alone device.
  • the modem processor may be independent of the processor 110 and may be provided in the same device as the mobile communication module 150 or other functional modules.
  • the wireless communication module 160 can provide applications on the terminal 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) network), Bluetooth (bluetooth, BT), and global navigation satellite system. (global navigation satellite system, GNSS), frequency modulation (FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
  • WLAN wireless local area networks
  • BT Bluetooth
  • GNSS global navigation satellite system
  • FM frequency modulation
  • NFC near field communication technology
  • infrared technology infrared, IR
  • the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 .
  • the wireless communication module 160 can also receive the signal to be sent from the processor 110, frequency modulate it, amplify it, and convert it into electromagnetic waves through the antenna 2 for radiation
  • the antenna 1 of the terminal 100 is coupled to the mobile communication module 150, and the antenna 2 is coupled to the wireless communication module 160, so that the terminal 100 can communicate with the network and other devices through wireless communication technology.
  • the wireless communication technology may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband Code division multiple access (wideband code division multiple access, WCDMA), time-division code division multiple access (TD-SCDMA), long term evolution (long term evolution, LTE), BT, GNSS, WLAN, NFC , FM, and/or IR technology, etc.
  • the GNSS may include a global positioning system (GPS), a global Navigation satellite system (global navigation satellite system, GLONASS), Beidou navigation satellite system (BDS), quasi-zenith satellite system (QZSS) and/or satellite based augmentation system (satellite based augmentation) systems, SBAS).
  • GPS global positioning system
  • GLONASS global navigation satellite system
  • BDS Beidou navigation satellite system
  • QZSS quasi-zenith satellite system
  • SBAS satellite based augmentation system
  • the terminal 100 implements the display function through the GPU, the display screen 194, and the application processor.
  • the GPU is an image processing microprocessor and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
  • the display screen 194 is used to display images, videos, etc.
  • Display 194 includes a display panel.
  • the display panel can use a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active matrix organic light emitting diode or an active matrix organic light emitting diode (active-matrix organic light emitting diode).
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • AMOLED organic light-emitting diode
  • FLED flexible light-emitting diode
  • Miniled MicroLed, Micro-oLed, quantum dot light emitting diode (QLED), etc.
  • the terminal 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.
  • the terminal 100 can implement the shooting function through the ISP, camera 193, video codec, GPU, display screen 194, application processor, etc.
  • the ISP is used to process the data fed back by the camera 193. For example, when taking a photo, the shutter is opened, the light is transmitted to the camera sensor through the lens, the optical signal is converted into an electrical signal, and the camera sensor passes the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye. ISP can also perform algorithm optimization on image noise, brightness, and skin color. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP may be provided in the camera 193.
  • Camera 193 is used to capture still images or video.
  • the object passes through the lens to produce an optical image that is projected onto the photosensitive element.
  • the photosensitive element can be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then passes the electrical signal to the ISP to convert it into a digital image signal.
  • ISP outputs digital image signals to DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other format image signals.
  • the terminal 100 may include 1 or N cameras 193, where N is a positive integer greater than 1.
  • Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the terminal 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy.
  • Video codecs are used to compress or decompress digital video.
  • Terminal 100 may support one or more video codecs.
  • the terminal 100 can play or record videos in multiple encoding formats, such as moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, etc.
  • MPEG moving picture experts group
  • NPU is a neural network (NN) computing processor.
  • NN neural network
  • the NPU can realize intelligent cognitive applications of the terminal 100, such as image recognition, face recognition, speech recognition, text understanding, etc.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to implement an expansion terminal. 100 storage capacity.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to implement the data storage function. Such as saving music, videos, etc. files in external memory card.
  • Internal memory 121 may be used to store computer executable program code, which includes instructions.
  • the internal memory 121 may include a program storage area and a data storage area.
  • the stored program area can store an operating system, at least one application program required for a function (such as a sound playback function, an image playback function, etc.).
  • the storage data area may store data created during use of the terminal 100 (such as audio data, phone book, etc.).
  • the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, universal flash storage (UFS), etc.
  • the processor 110 executes various functional applications and data processing of the terminal 100 by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.
  • the terminal 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playback, recording, etc.
  • the audio module 170 is used to convert digital audio information into analog audio signal output, and is also used to convert analog audio input into digital audio signals. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
  • Speaker 170A also called “speaker” is used to convert audio electrical signals into voice signals.
  • the terminal 100 can listen to music through the speaker 170A, or listen to a hands-free call.
  • Receiver 170B also called “earpiece” is used to convert audio electrical signals into voice signals.
  • the terminal 100 answers a call or a voice message, the voice can be heard by bringing the receiver 170B close to the human ear.
  • Microphone 170C also called “microphone” or “microphone” is used to convert voice signals into electrical signals. When making a call or sending a voice message, the user can speak close to the microphone 170C with the human mouth and input the voice signal to the microphone 170C.
  • the terminal 100 may be provided with at least one microphone 170C. In other embodiments, the terminal 100 may be provided with two microphones 170C, which in addition to collecting voice signals, may also implement a noise reduction function. In other embodiments, the terminal 100 can also be equipped with three, four or more microphones 170C to collect voice signals, reduce noise, identify sound sources, and implement directional recording functions, etc.
  • the headphone interface 170D is used to connect wired headphones.
  • the headphone interface 170D can be a USB interface 130, or a 3.5mm open mobile terminal platform (OMTP) standard interface, or a cellular telecommunications industry association of the USA (CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association of the USA
  • the buttons 190 include a power button, a volume button, etc.
  • Key 190 may be a mechanical key. It can also be a touch button.
  • the terminal 100 may receive key input and generate key signal input related to user settings and function control of the terminal 100.
  • the motor 191 can generate vibration prompts.
  • the motor 191 can be used for vibration prompts for incoming calls and can also be used for touch vibration feedback.
  • touch operations for different applications can correspond to different vibration feedback effects.
  • the motor 191 can also respond to different vibration feedback effects for touch operations in different areas of the display screen 194 .
  • Different application scenarios such as time reminders, receiving information, alarm clocks, games, etc.
  • the touch vibration feedback effect can also be customized.
  • the indicator 192 can be an indicator light, which can be used to indicate the charging status, changes in battery power, or can also be used to indicate Messages, missed calls, notifications and more.
  • FIG. 2 it is a schematic flow chart of a voice interaction method provided by an embodiment of the present application.
  • the process includes:
  • the terminal detects a voice signal.
  • the terminal converts the speech signal into text, performs intent recognition based on the text, and obtains the intent recognition result of the text.
  • the terminal may install a voice interaction application for providing voice interaction capabilities.
  • applications such as voice assistants or virtual assistants on mobile phones can provide users with corresponding services (also called skills in voice interaction) based on the picked-up user voices.
  • This skill can operate functions on the mobile phone, or request related services from the server (third-party skill provider).
  • the voice system of a vehicle-mounted terminal can pick up the voice of the driver or passengers, provide the driver or passengers with car control functions, in-car audio and video entertainment playback functions, and request related services from the server (third-party skills provider).
  • the voice software of a smart speaker can pick up the voice commands of users in the room and execute the user's voice commands, such as playing related audio and video resources, controlling other smart home devices through the smart speaker, etc.
  • Figure 3 shows a software structure diagram of a speech system. This is explained in conjunction with the software structure of the speech system shown in Figure 3.
  • the vehicle-mounted terminal After receiving the wake-up word spoken by the user, the vehicle-mounted terminal turns on the sound pickup device (such as a microphone) to pick up the sounds in the car. Or, if the voice system of the vehicle-mounted terminal supports wake-up-free, then the vehicle-mounted terminal always turns on the sound pickup device (such as a microphone) to pick up the sounds in the car.
  • the picked up sound is input to the sound preprocessing module for preprocessing, including, for example, speech signal sampling, anti-aliasing filtering, speech enhancement, etc.
  • the processed speech signal is input into the automatic speech recognition module, and the automatic speech recognition module converts the speech signal into text.
  • the text is input into the natural language understanding module for intent recognition, and the intent recognition result is obtained.
  • the intent recognition result includes the intent and slot identified for the text. It can be understood that the intention recognition process is essentially a classification result, so the intention recognition result also includes the probability distribution of the text corresponding to each intention.
  • the terminal may pick up the voices of people around, sounds played by other electronic devices in the car, environmental noise, etc.
  • the terminal may pick up the voice input of non-target users.
  • the voice input of non-target users will cause interference to the subsequent recognition of user intentions and affect the accuracy of the voice system in executing user instructions.
  • the terminal after the terminal recognizes the speech signal as text, it will also reject the text as invalid text, that is, input the recognized text into the addressee recognition (Addressee Detection, AD) module (also known as AD). It is called the rejection recognition module, or the rejection module for short).
  • AD Addressee recognition
  • the addressee recognition module outputs a binary classification result, that is, whether the text is a rejection object of the speech system.
  • a binary classification result that is, whether the text is a rejection object of the speech system.
  • the voice system of the vehicle-mounted terminal will not correspond to the intention of the text.
  • the voice system of the vehicle-mounted terminal will execute the intention of the text, etc. Therefore, the rejection processing of text will help reduce the probability of misrecognition by the speech system and improve the processing efficiency and accuracy of the speech system.
  • this technical solution performs rejection processing on the basis of the text converted by the speech signal by the speech system. Therefore, the speech recognition capability of the speech system directly affects the accuracy of the rejection processing. Rate.
  • the embodiment of the present application also provides a technical solution.
  • the identified addressee recognition result is not a simple binary classification result (that is, whether it is a rejection object), but contains multiple speech signal features, including but not Limited to the source of the speech signal, the object of the speech signal, and the subject of the speech signal. It is understandable that the features of multiple speech signals can improve the accuracy of speech recognition. That is, the following step S203 and subsequent steps are performed.
  • the terminal determines the addressee recognition result based on one or more of the voice signal, text, and intention recognition result.
  • the addressee recognition result includes multiple features of the speech signal, such as the source of the speech signal (from), the object of the speech signal (to), and the subject of the speech signal (subject).
  • Examples of sources of voice signals include, but are not limited to, users, speakers, and the environment. Among them, when the source of the voice signal is the user, the voice can be confirmed to be a human voice. When the source of the voice signal is a speaker, the voice can be confirmed as the sound emitted by the electronic device and is non-human voice. When the source of the speech signal is the environment, the speech can be identified as noise. It is understandable that identifying the source of the voice signal is helpful to distinguish whether the voice is issued by the user, and is helpful to distinguish whether the voice is an object of rejection by the voice system. In some examples, the user source of the user may also include the driver, the co-driver, the user in the back row 1, and the user in the back row 2. Then, the source of the voice signal is also helpful to distinguish the specific user who issued the voice, and then different response methods can be implemented for different users.
  • the objects of voice signals include but are not limited to voice systems, users, and environments.
  • the speech can be considered as the content of human-computer interaction.
  • the target of the voice signal is a user
  • the voice can be considered as a conversation between users and is a rejected object of the voice system.
  • the voice system supports the interrupt function, then in this scenario, the voice will not be rejected by the voice system.
  • the object of the speech signal is the environment, the speech can be considered to be the user's self-talk or singing, etc., and is an object of rejection by the speech system. It is understandable that identifying a voice conversation is helpful for distinguishing whether the addressee of the voice is a voice system, and is also helpful for distinguishing whether the voice is a rejection object of the voice system.
  • Speech signal topics include task and nonsense classes.
  • the task means that the voice contains the skills that the user wants the voice system to perform.
  • the meaningless category means that the voice does not contain the skills that the user wants the voice system to perform, that is, the user does not need the voice system to perform the skills.
  • tasks can be further divided into execution tasks, chat tasks, and encyclopedia tasks according to the type of tasks.
  • tasks can also include dialect tasks.
  • tasks can also be divided into tasks corresponding to different emotions according to the user's emotions. For example, excited emotions correspond to the task of playing cheerful music; nervous emotions correspond to the task of playing soothing light music, etc.
  • the characteristics of the subject of the speech signal extracted here can be identified and extracted based on the meaning of the speech itself. It does not rely on the text recognized by the automatic speech recognition module of the speech system, nor does it rely on the intention of the text by the natural language understanding module. Recognition, therefore, here the ability to extract the subject of the speech signal does not rely on the recognition capabilities of the automatic speech recognition module and natural language understanding module of the speech system.
  • the speech signal output by the sound preprocessing module is also input to the addressee recognition module.
  • the addressee recognition module is used to identify the characteristics of the speech, including the source of the speech signal, the object of the speech signal, and the subject of the speech signal.
  • the text recognized by the automatic speech recognition module can also be input to the addressee recognition module, and can also be used to recognize the characteristics of the speech.
  • the intention recognition result after the natural language understanding module performs intention recognition on the text can also be input to the addressee recognition module through the dialogue management module to identify the characteristics of the speech.
  • the dialogue management module can also input the context of the speech into the addressee recognition module to identify the characteristics of the speech.
  • the voice system can also start the camera to collect the user's image.
  • the image is input to the addressee recognition module after the image preprocessing module.
  • the addressee recognition module can also identify the characteristics of the voice based on the image information. Among them, the information of the image includes but is not limited to the number of passengers in the car, face orientation, character movements, etc. It can be understood that the voice system can identify the current speaker based on the number of passengers in the car, face orientation, character movements, etc., and whether Talking to other people, whether you are on the phone, whether you are playing electronic devices, etc., are used to identify the characteristics of speech.
  • the addressee recognition module can also identify the characteristics of the voice based on the data collected by the sensor (such as the number of passengers, vehicle speed, etc.).
  • the embodiments of the present application provide the addressee recognition module to recognize the input multi-modal data (such as voice, text, intent recognition results, conversation context, image data, sensor data, etc.), and recognize the voice features to improve recognition accuracy.
  • input multi-modal data such as voice, text, intent recognition results, conversation context, image data, sensor data, etc.
  • the terminal requests to execute the intention in the intention recognition result.
  • the embodiment of the present application does not limit the execution order of the above-mentioned steps S202 to step S204. It can be understood that the above-mentioned steps S202 to step S204 can be executed sequentially or in parallel, or some steps can be executed sequentially and some The steps are executed in parallel. For example, while performing the conversion of the voice signal into text in step S202, the terminal may also perform the step of performing addressee recognition on the voice signal in step S203. For another example, after performing the intent recognition based on the text in step S202, the terminal may also perform the step of determining the addressee recognition based on the text in step S203.
  • step S202 After the terminal executes step S202 and obtains the intention recognition result, it executes the step of performing addressee recognition according to the intention recognition result in step S203. At the same time, the terminal may also execute the intention in the intention recognition result of step S204. In short, on the premise that the above steps are not inconsistent, the execution order of the above steps S202 to S204 can be changed.
  • the terminal determines the response mode of the voice signal based on the addressee recognition result and the intention execution result.
  • the intention execution results include successful execution of the intention and unsuccessful execution of the intention.
  • the intent execution result here may be the result of the terminal itself executing the intent after the terminal requests to execute the intent in the intent recognition result; or, after the terminal requests other devices to execute the intent, the other devices send The execution results fed back by the terminal. That is to say, the intention has been executed before the terminal determines how to respond to the voice signal.
  • the intention execution result here may also be that the terminal determines whether it or other devices (such as servers) support the execution of skills corresponding to the intention based on the intention. The judgment result is the intention execution result. That is to say, the intention is not executed until the terminal determines how to respond to the voice signal.
  • the intention recognition result of the text includes an intention, that is, the speech system recognizes the intention, and the speech system searches for the skill corresponding to the specific intention from the skills it supports.
  • the voice system requests the execution of the skill, or the voice system requests other systems in the vehicle-mounted terminal to execute the skill.
  • the voice system can also request the execution of the skill from other devices (such as servers) outside the vehicle-mounted terminal.
  • the voice system feedbacks the execution result of the skill to the user.
  • the intention execution results include successful execution of the intention and unsuccessful execution of the intention. It is understandable that there may be various situations that cause the voice system to feedback "unsuccessful execution of intent" to the user.
  • the natural language understanding module of the speech system recognizes the user's intention, but the speech system does not support the execution of the skills corresponding to the intention; or the natural language understanding module of the speech system recognizes the user's intention and requests the server to execute the intention. Corresponding skills, but the server does not respond, or there is an error when the server executes the skills (such as lack of slot information, etc.). If the intention recognition result of the text includes no intention, that is, the speech system does not recognize the intention, The voice system reports back to the user that "the intent was not successfully executed.” It is understandable that there may be various situations that cause the voice system to feedback "unsuccessful execution of intention" to the user.
  • the natural language understanding module of the voice system does not recognize the user's intention; or the voice collected by the voice system does not include the user's intention. intentions, etc.
  • the voice system does not distinguish specific application scenarios and feedbacks a unified "unsuccessful execution intention" (i.e., "no result") to the user, which will cause the user to have inaccurate, unnatural, and unnatural voice responses.
  • the feeling of intelligence results in poor voice interaction experience for users.
  • the embodiment of the present application also provides another response method of the voice system, which combines the identified addressee recognition results (such as the source of the voice signal, the object of the voice signal, and the subject of the voice signal) and the execution of the intention.
  • the identified addressee recognition results such as the source of the voice signal, the object of the voice signal, and the subject of the voice signal
  • different response methods are determined and different response results are fed back to the user. It is understandable that when the speech system recognizes multiple features of the speech signal, it will help the speech system identify more subdivided application scenarios, and it will help the speech system give different response results according to the more subdivided application scenarios, thereby improving Intelligent human-computer interaction and natural and smooth voice responses enhance the voice interaction experience.
  • the natural language understanding module outputs an intention recognition result, where the intention recognition result includes the recognized intention or no intention.
  • the intention recognition result is input to the dialogue management module, and the dialogue management module outputs the execution result of the intention.
  • the intention recognition result includes an intention
  • the dialogue management module executes the intention and feeds back the intention execution result to the user, where the intention execution result includes successful execution of the intention and unsuccessful execution of the intention.
  • the intention execution result is an unsuccessful execution of the intention.
  • the dialogue management module inputs the determined intention execution result to the natural language generation module, and the addressee identification module inputs the addressee identification result to the natural language generation module.
  • the natural language generation module determines the voice and determines the final response method based on the intent execution result and the characteristics of multiple voice signals in the addressee recognition result.
  • the vehicle-mounted terminal adopts a method based on rules and corpus templates to implement different response methods in different application scenarios.
  • the corpus template is used by the speech system to play the speech execution results to the user, which can also be called the playback template.
  • the corpus template is used by the speech system to present the speech execution results to the user in the form of a graphical interface (including text content).
  • the natural speech generation module When the natural speech generation module obtains the addressee recognition result (including the source of the speech signal, the object of the speech signal, and the subject of the speech signal) and the intention execution result, it can use the addressee recognition result and the intention execution result as keywords.
  • the voice system performs relevant operations according to the found response method, and the voice system parses the playback template corresponding to the scene.
  • the playback template includes placeholders and text, and fills in the placeholders according to the current voice and the context of this voice. content, the filled content and the original text in the playback template are combined into the final playback text.
  • the speech system plays the playback text.
  • the voice system can further determine whether the task corresponds to a skill that the voice system can handle. When it is determined that the task does not correspond to a skill that the speech system can handle, the speech system adopts a blanket response method.
  • the voice system When it is determined that the task corresponds to a certain skill that the voice system can handle, the voice system will respond with a failure prompt, such as prompting "User 1, the voice system did not successfully execute your task request.”
  • the voice is User 1's voice. If the system sends meaningless speech, and the intention execution result is that the intention is not executed successfully, it will adopt the response method of requesting clarification and play "User 1, the voice system has received your request, please say it again in a different way!
  • the voice executes the corresponding intention. If the intent execution result is successful execution of the intent, "User 1, the voice system has completed the task!" will be played.
  • the voice When the source of the voice is a non-user, or the target of the voice is a non-voice system, the voice will not be responded to.
  • the voice received by the voice system comes from user 1
  • the object of the voice signal is user 2
  • the subject of the voice signal is a task
  • the intention execution result is unsuccessful execution of the intention
  • the voice system does not respond.
  • the voice system receives voice from user 1 but the object of the voice is air
  • the subject of the voice signal is a task
  • the intention execution result is unsuccessful execution of the intention
  • the voice system does not respond.
  • the speech system receives speech from an electronic device, but the object of the speech is air
  • the subject of the speech signal is a task
  • the intention execution result is unsuccessful execution of the intention
  • the addressee recognition results are helpful in distinguishing different scenarios of speech, and are helpful in improving rejection recognition scenarios (such as everyone talking, electronic devices playing sounds, users talking to themselves, etc. scene) recognition rate.
  • the voice system can provide different response methods and play different content according to different playback templates. For example, in the case where the intention is not successfully executed, the voice system can distinguish the specific situation through the recipient recognition results, provide more information to the user through different playback content, improve the intelligence of the voice system interaction, and make human-computer interaction more intelligent. More natural.
  • different speech signal themes can also be set based on the needs of different scenarios.
  • tasks are further divided into execution tasks, chat tasks, and encyclopedia tasks.
  • Table 2 it is another example of the correspondence between the addressee recognition results (including the source of the speech signal, the object of the speech signal, and the subject of the speech signal), the intention execution result, the response mode, and the playback template.
  • the voice is an execution task requested by user 1 from the voice system
  • the intent execution result is successful execution of the intent
  • "User 1 the voice system has completed your task request”
  • the voice is an execution task requested by user 1 from the voice system
  • the intention execution result is that the intention is not successfully executed, a non-committal response method is used, playing "User 1, the voice system cannot complete your task request yet, please give me Study time!”.
  • the voice system can interrupt.
  • the voice system can join the conversation between user 1 and user 2 to realize intelligent interaction between human (user 1)-human (user 2)-machine (voice system) and improve the user's voice interaction.
  • the voice system can also perform related skills based on the conversation content between User 1 and User 2. For example, if User 1 and User 2 discuss visiting a certain tourist attraction, the voice system can ask whether it is necessary to inquire about the weather, tickets, travel strategies and other information of the tourist attraction.
  • the voice system when the voice system receives the voice from user 1, but the object of the voice is air, the subject of the voice signal is a chat task, and the intention execution result is unsuccessful execution of the intention, the voice system can also interrupt. Alternatively, the voice system can also ask whether to perform related skills based on the content of user 1's chat.
  • the voice system can provide richer functions and improve the intelligence of human-computer interaction of the voice system.
  • the vehicle-mounted terminal can also implement different response methods in different application scenarios based on machine learning methods. That is, machine learning methods are used to train the model of the natural language generation module.
  • machine learning methods are used to train the model of the natural language generation module.
  • a pre-trained language model is used as the encoder, for example, a Bidirectional Encoder Representation from Transformers (BERT) model based on transformers is used as the encoder.
  • BBT Bidirectional Encoder Representation from Transformers
  • a large number of training samples are input into the encoder for training, for example, using an autoregressive method for training to obtain a natural language generation module.
  • the training sample can be the content of the corpus-response written by the developer or generated by the machine based on certain rules.
  • the corpus is speech, and the developer annotates each speech.
  • the annotated content includes the source, object, subject, speech to text, recognized intention, slot, intention execution result, and response method. Developers can manually annotate, or input speech into the model shown in Figure 3.
  • the addressee recognition module identifies the source, object, and subject characteristics of the speech; the automatic speech recognition module converts the speech into text;
  • the natural language understanding module identifies the intention and slot of the text; the dialogue management module outputs the intention execution result; and determines the desired response method and playback content, and annotates the speech.
  • the natural language generation module obtained by training the above training samples can achieve different response modes (and playback content) in different application scenarios. As shown in Figure 4, it is an example of the natural language generation module obtained after training.
  • the natural language generation module After inputting parameters such as the source, object, subject of the voice, voice conversion into text, recognized intention, slot, intention execution result, etc. into the natural language generation module, the natural language generation module outputs the response mode and playback content of the response after running. It can be seen that the ability to use pre-trained language models can achieve more diverse and flexible response methods and playback content, and improve the intelligence of human-computer interaction.
  • the addressee recognition module includes a voice addressee recognition (Sound-based Addressee Detection, SAD) model. That is to say, the speech recipient recognition model receives the speech signal processed by the sound preprocessing module, recognizes the speech signal, and identifies multiple features of the speech signal, such as the source of the speech signal, The object of the speech signal and the subject of the speech signal.
  • the speech recipient recognition model includes a speech recognition model.
  • the speech recognition model is, for example, a Transformer speech recognition model, or, more specifically, a convolution-based enhancement model.
  • Transformer speech recognition model Convolution-augmented Transformer for Speech Recognition, Conformer
  • training samples can be input into the pre-trained model for training.
  • the training samples include speech, as well as the source, object and theme of the speech annotation.
  • annotators can annotate the source, object and theme of the speech based on the meaning of the speech itself.
  • the trained speech recognition model can reason about the input speech and deduce the source, object, theme and other characteristics of the speech. It is understandable that the characteristics of the subject of the speech signal extracted here are identified and extracted based on the meaning of the speech itself. It does not rely on the text recognized by the automatic speech recognition module of the speech system, nor does it rely on the natural language understanding module's intention recognition of the text. , Therefore, the ability to extract the subject of the speech signal here does not rely on the recognition capabilities of the automatic speech recognition module and natural language understanding module of the speech system.
  • the speech recipient recognition model may include a speech recognition model (such as a Transformer speech recognition model) and an integrated learning model.
  • the training samples can be input into the pre-trained model.
  • the training samples include speech, and dialogue classification of speech annotations.
  • the dialogue classification includes, for example, human-to-human dialogue, human-computer dialogue, electronic sound (that is, sound played by electronic equipment), noise, and unknown sounds. It should be noted that annotators can classify conversations based on the meaning of the speech itself. Subsequently, the trained speech recognition model can infer the input speech and deduce the probability distribution of each dialogue category corresponding to the speech.
  • training samples can be input into the pre-trained ensemble learning model.
  • the training samples include the probability distribution of each dialogue category corresponding to the speech, as well as the source, object and theme of the labeled speech.
  • the trained ensemble learning model can infer the probability distribution of the dialogue classification of the input speech, and infer the source, object, and topic of the speech. That is to say, when the speech is input into the speech recipient recognition model, through the reasoning of the speech recognition model and the integrated learning model, the source, object, subject and other characteristics of the speech can be obtained.
  • the addressee recognition module includes a Text-to-Speech Addressee Detection (TAD) model. That is to say, as shown in (1) in Figure 6, the text addressee recognition model receives the text converted by the automatic speech recognition module, and by recognizing the text, it identifies multiple features of the speech signal, such as the source of the speech signal. , the object of the speech signal and the subject of the speech signal.
  • the text addressee recognition model includes a text recognition model, which includes a splicing module, a BERT encoder, and a decoder. Among them, the splicing module is used to splice the speech-converted text and the preset template.
  • the preset template includes multiple prompts, one prompt corresponding to one voice feature.
  • the source, object, and subject of the speech signal For example, the spliced content is "[Source of voice signal] says [Subject of voice signal] to [Object of voice signal]: text after voice conversion”.
  • training samples can be input into the pre-trained model for training. Among them, the training samples include the text after speech conversion, as well as the source, object and theme of the speech annotated in the text.
  • the trained text recognition model can reason about the speech-converted text and deduce the source, object, theme and other characteristics of the speech.
  • the addressee recognition model includes a voice addressee recognition model, a text addressee recognition model, and an integrated learning model.
  • the voice addressee recognition model can refer to the speech recognition model shown in (1) in Figure 5 above
  • the text addressee recognition model can refer to (1) and (2) in Figure 6 above.
  • the text recognition model shown will not be described again here.
  • the ensemble learning model can integrate the recognition results of the speech addressee recognition model and the text addressee recognition model, and finally output features such as the source, object, and topic of the speech.
  • voice is streaming data, which is a set of data sequences that arrive sequentially, in large quantities, quickly, and continuously.
  • Text is non-streaming data. Therefore, when fusing the recognition results of the speech addressee recognition model and the recognition results of the text addressee recognition model, the Voice Activity Detection (VAD) method can be used to cut the speech stream into multiple speech segments. Each speech segment is input into the speech addressee recognition model for recognition, and the text corresponding to the speech segment is input into the recognition result of the text addressee recognition model to achieve the effect of aligning speech and text, and the corresponding two The recognition results are fused.
  • VAD Voice Activity Detection
  • the voice system includes a voice activation detection module.
  • the voice activation detection module cuts the audio stream into multiple voice segments.
  • each speech segment is input into the speech addressee recognition model for recognition, and the probability of conversation classification corresponding to each speech segment is output.
  • the converted text of each speech segment is input into the text addressee recognition model for recognition, and the source, object, and theme of the speech signal corresponding to each speech segment is output.
  • the speech segment processed by the speech addressee recognition model is aligned with the text processed by the text addressee recognition model.
  • the recognition results output by the two models are input into the integrated learning model for reasoning, and the source, object, and theme of the fused speech are obtained.
  • the voice addressee recognition model includes a voice activation detection module. Specifically, when the speech addressee recognition model receives the voice stream, the voice activation detection module cuts the audio stream into multiple speech segments and sends the period breaks to the text addressee recognition model. The period break is used to trigger the text addressee recognition model to identify the corresponding text content. At this point, the speech segment processed by the speech addressee recognition model is aligned with the text processed by the text addressee recognition model. Then, the recognition results output by the two models are input into the integrated learning model for reasoning, and the source, object, and theme of the fused speech are obtained.
  • the addressee identification model further includes an intended addressee identification model.
  • the intended addressee recognition model includes an intent mapping module, which is used to map the intent probability distribution of the speech output by the natural language understanding module into intentional probability and unintentional probability. For example: after speech 1 is input into the natural language understanding module, the probability distribution of the intention of speech 1 is obtained: the probability of intention 1 is probability 1, the probability of intention 2 is probability 2, the probability of intention 3 is, and the probability of no intention is probability 4 . Then, the probability of intention after mapping by the intention mapping module is probability 1 + probability 2 + probability 3; the probability of no intention is probability 4.
  • the intended addressee recognition model is beneficial to improving the subject of the recognized speech signal.
  • the ensemble learning model can integrate the recognition results of the voice addressee recognition model, the recognition results of the text addressee recognition model, and the recognition results of the intended addressee recognition model, and finally output the source, object, topic and other features of the speech.
  • the dialogue management module can also input the context of the speech into the intended addressee recognition model to assist in outputting features such as the source, object, and subject of the speech.
  • the speech activation detection module can still be used to align the text processed by the intended addressee recognition model, the speech processed by the speech addressee recognition model, and the text processed by the text addressee recognition model.
  • alignment methods please refer to the speech alignment method processed by the speech addressee recognition model and the text alignment method processed by the text addressee recognition model mentioned above, which will not be described again here.
  • the addressee recognition module may also include an image addressee recognition module, which is used to identify the characteristics of the voice corresponding to the image through the user's image.
  • the addressee recognition module can also include any combination of the above sub-models (speech recognition model, text recognition model, intended addressee recognition model, image being model, etc.).
  • the addressee recognition module may include a voice addressee recognition model and an intended addressee recognition model, or a text addressee recognition model and an intended addressee recognition model, etc.
  • the embodiments of the present application use the addressee recognition module to recognize the input multi-modal data (such as: voice, text, intent recognition results, conversation context, image data, sensor data, etc.), and the recognition results include multiple characteristics of a speech signal. It is understandable that the characteristics of multiple speech signals are conducive to improving the accuracy of addressee recognition, and are also conducive to distinguishing more application scenarios, making it easier for the speech system to give different response results according to more subdivided application scenarios, thereby improving the accuracy of human speech recognition. Intelligent computer interaction and natural and smooth voice responses enhance the voice interaction experience.
  • the input multi-modal data such as: voice, text, intent recognition results, conversation context, image data, sensor data, etc.
  • the chip system includes at least one processor 1101 and at least one interface circuit 1102.
  • the processor 1101 and the interface circuit 1102 may be interconnected by wires.
  • interface circuitry 1102 may be used to receive signals from other devices, such as the memory of terminal 100.
  • interface circuit 1102 may be used to send signals to other devices (eg, processor 1101).
  • the interface circuit 1102 can read instructions stored in the memory and send the instructions to the processor 1101.
  • the terminal can be caused to perform various steps performed by the terminal 100 (such as a mobile phone) in the above embodiment.
  • the chip system may also include other discrete devices, which are not specifically limited in the embodiments of this application.
  • An embodiment of the present application also provides a device, which is included in a terminal and has the function of realizing the terminal behavior in any of the methods in the above embodiments.
  • This function can be implemented by hardware, or it can be implemented by hardware executing corresponding software.
  • the hardware or software includes at least one module or unit corresponding to the above functions. For example, detection module or unit, display module or unit, determination module or unit, and calculation module or unit, etc.
  • Embodiments of the present application also provide a computer storage medium that includes computer instructions.
  • the computer instructions When the computer instructions are run on a terminal, the terminal is caused to perform any of the methods in the above embodiments.
  • Embodiments of the present application also provide a computer program product.
  • the computer program product When the computer program product is run on a computer, it causes the computer to perform any of the methods in the above embodiments.
  • the above-mentioned terminals include hardware structures and/or software modules corresponding to each function.
  • Persons skilled in the art should easily realize that, in conjunction with the units and algorithm steps of each example described in the embodiments disclosed herein, the embodiments of the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is performed by hardware or computer software driving the hardware depends on the specific application and design constraints of the technical solution. Professionals and technicians may use different methods to implement the described functions for each specific application, but such implementations should not be considered to be beyond the scope of the embodiments of the present invention.
  • Embodiments of the present application can divide the above terminals into functional modules according to the above method examples.
  • each functional module can be divided corresponding to each function, or two or more functions can be integrated into one processing module.
  • the above integrated modules can be implemented in the form of hardware or software function modules. It should be noted that the division of modules in the embodiment of the present invention is schematic and is only a logical function division. In actual implementation, there may be other division methods.
  • Each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above integrated units can be implemented in the form of hardware or software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium.
  • the technical solutions of the embodiments of the present application are essentially or contribute to the existing technology, or all or part of the technical solution can be embodied in the form of a software product.
  • the computer software product is stored in a storage device.
  • the medium includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the methods described in various embodiments of this application.
  • the aforementioned storage media include: flash memory, mobile hard disk, read-only memory, random access memory, magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音交互方法及终端,涉及人机交互领域,可以区分更多的场景,并基于不同的场景给出不同的响应方式,提升受话人识别结果的准确性,以及使得语音系统回复更加自然和智能,该方法包括:在语音交互过程中,根据检测到的语音信号、语音信号转换后的文本、针对文本进行意图识别的结果中的一项或多项,确定受话人识别结果,其中受话人识别结果包括语音信号的来源、对象和主题;根据受话人识别结果,以及意图执行结果,确定语音信号的响应方式;当受话人识别结果和意图执行结果不同时,语音信号的响应方式不同。还提供了一种计算机可读存储介质、芯片系统及语音交互系统。

Description

一种语音交互方法及终端
本申请要求于2022年6月1日提交国家知识产权局、申请号为202210629293.4、申请名称为“一种语音交互方法及终端”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人机交互领域,尤其涉及一种语音交互方法及终端。
背景技术
在语音交互过程中,终端拾取用户的语音,采用自动语音识别(Automatic Speech Recognition,ASR)技术将用户的语音转换为文字,然后采用自然语言理解(Natural Language Understanding,NLU)技术对转换后的文字进行意图识别,再执行该意图对应的技能,并向用户回复执行结果。
可以理解的是,在终端真实的处理过程中,可能存在多种情况造成终端最终没有执行相应的技能。例如:终端未识别出用户的意图;或者,终端已识别出用户的意图,但终端并不支持执行该意图对应的技能等。然而,针对终端最终没有执行用户的意图对应的技能的情况,目前终端全部给出统一的“听不懂”回复,会给用户造成语音回复不准确、不自然、不智能的感受,造成用户的语音交互体验不佳。
发明内容
本申请提供的一种语音交互方法及终端,可以区分更多的场景,并基于不同的场景给出不同的响应方式,提升受话人识别结果的准确性,以及使得语音系统回复更加自然和智能。
为了实现上述目的,本申请实施例提供了以下技术方案:
第一方面、提供一种语音交互方法,该方法包括:检测到语音信号;将语音信号转换为文本,并对文本进行意图识别,得到意图识别结果;根据语音信号、文本、意图识别结果中的一项或多项,确定受话人识别结果,受话人识别结果包括语音信号的来源、对象和主题;根据受话人识别结果,以及意图执行结果,确定语音信号的响应方式。
其中,语音信号的来源包括用户、扬声器或电子设备、环境中一项;语音信号的对象包括语音系统、用户、环境中一项;语音信号的主题包括任务或无意义。可选的,任务还包括:执行类任务、闲聊任务、百科类任务、方言类任务中一项或多项。
可选的,任务还可以包括方言类任务。另一些示例中,还可以根据用户的情感将任务划分为不同情感对应的任务,例如,兴奋的情感对应的播放欢快类音乐的任务;紧张的情感对应播放舒缓类的轻音乐的任务等。
其中,意图执行结果包括成功执行意图和未成功执行意图。需要说明的是,一些示例中,这里的意图执行结果可以是终端请求执行意图识别结果中的意图后,终端自身已执行该意图的结果;或者,终端请求其他设备执行该意图后,其他设备向该终端反馈的执行结果。也就是说,在终端确定语音信号的响应方式之前,该意图已经执行。 另一些示例中,这里的意图执行结果也可以终端根据意图判断自身或者其他设备(如服务器)是否支持执行该意图对应的技能,该判断结果即为意图执行结果。也就是说,在终端确定语音信号的响应方式之前,该意图没有被执行。
由此可见,受话人识别结果(语音的来源,对象和主题)有利于区分语音的不同场景,有利于提升拒绝识别场景(例如人人对话,电子设备播放声音、用户的自言自语等场景)的识别率。另外,基于不同场景,语音系统可以提供不同的响应方式,按照不同的播放模板播放不同的内容。例如,针对未成功执行意图的情况,语音系统可以通过受话人识别结果区分具体的情况,通过不同的播放内容向用户提供更多的信息,提升语音系统的交互的智能化,使得人机交互更加自然。
一种可能的实现方式中,根据受话人识别结果,以及意图执行结果,确定语音信号的响应方式,包括:当语音信号的来源为用户,语音信号的对象为语音系统,语音信号的主题为任务,意图识别执行结果为未成功执行意图时,发出第一提示,第一提示用于提示语音系统不支持执行语音信号的主题;第一提示包括语音信号的来源、语音信号的对象、以及语音信号的主题;或者,当语音信号的来源为用户,语音信号的对象为语音系统,语音信号的主题为无意义,意图识别执行结果为未成功执行意图时,发出第二提示;第二提示用于请求用户澄清,第一提示包括语音信号的来源、语音信号的对象、以及语音信号的主题;或者,当语音信号的来源为非用户,或者,语音信号的对象为非语音系统,确定不响应语音信号。
由此可见,提供几种不同场景下不同响应方式的具体实现。
一种可能的实现方式中,根据受话人识别结果,以及意图执行结果,确定语音信号的响应方式,包括:当语音信号的来源为用户,且语音信号的对象为另一个用户,语音信号的主题为闲聊任务时,发出第三提示,第三提示用于询问是否执行与语音信号关联的第一技能;或者,当语音信号的来源为用户,且语音信号的对象为空气,语音信号的主题为闲聊任务时,发出第四提示,第四提示用于询问是否执行与语音信号关联的第二技能;第二技能与第一技能相同或不同。
也就是说,语音系统可以加入两个用户的对话,实现人(用户1)-人(用户2)-机(语音系统)的智能交互,提升用户的语音交互。或者,语音系统还可以根据用户1和用户2的交谈内容,执行相关的技能。比如,用户1和用户2商量去某个旅游景点游玩,则语音系统可以询问是否需要查询该旅游景点的天气、车票、旅游攻略等信息。
语音系统还可以在用户的自言自语的场景中进行插话。例如,当语音系统接收到的语音来自用户1,但语音的对象为空气,语音信号的主题为闲聊任务,意图执行结果为未成功执行意图,则语音系统也可以进行插话。或者,语音系统还可以根据用户1闲聊的内容,询问是否执行相关的技能。
由此可见,当基于受话人识别结果(语音的来源,对象和主题)区分出细分的场景后,语音系统可以提供更加丰富的功能,提升了语音系统的人机交互的智能化。
一种可能的实现方式中,根据受话人识别结果,以及意图执行结果,确定语音信号的响应方式,包括:基于预设规则,查询受话人识别结果以及意图执行结果对应的响应方式;规则中受话人识别结果或意图执行结果不同时,对应的响应方式不同;或 者,将受话人识别结果以及意图执行结果输入到预先训练好的响应模型中进行推理,得到语音信号的响应方式。
由此提供了两种实现基于不同受话人结果和意图执行结果,实现不同响应方式的具体方法。
一种可能的实现方式中,根据语音信号、文本、意图识别结果中的一项或多项,确定受话人识别结果,包括:将语音信号输入到语音识别模型中进行推理,得到语音信号对应的对话分类,对话分类包括人人对话、人机对话、电子音、噪声和未知声音中的一项;将文本输入到文本识别模型中进行推理,得到语音信号的来源初值、语音信号的对象初值、以及语音信号的主题初值;将语音信号对应的对话分类、语音信号的来源初值、语音信号的对象初值、以及语音信号的主题初值输入到第一集成学习模型中进行推理,得到语音信号的来源、语音信号的对象、以及语音信号的主题。由此提供了一种受话人识别方法的具体实现。
一种可能的实现方式中,根据语音信号、文本、意图识别结果中的一项或多项,确定受话人识别结果,还包括:将语音信号输入到语音识别模型中进行推理,得到语音信号对应的对话分类,对话分类包括人人对话、人机对话、电子音、噪声和未知声音中的多项;将文本输入到文本识别模型中进行推理,得到语音信号的来源初值、语音信号的对象初值、以及语音信号的主题初值;根据意图执行结果中文本对应的各个意图的概率分布,映射为文本的有意图的概率和无意图的概率;将文本的有意图的概率和无意图的概率,语音信号对应的对话分类、语音信号的来源初值、语音信号的对象初值、以及语音信号的主题初值输入到第二集成学习模型中进行推理,得到语音信号的来源、语音信号的对象、以及语音信号的主题。由此提供了又一种受话人识别方法的具体实现。
第二方面、提供一种终端,包括:处理器、存储器和触摸屏,所述存储器、所述触摸屏与所述处理器耦合,所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令,当所述处理器从所述存储器中读取所述计算机指令,使得终端执行如上述方面及其中任一种可能的实现方式中所述的方法。
第三方面、提供一种装置,该装置包含在终端中,该装置具有实现上述方面及可能的实现方式中任一方法中终端行为的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。硬件或软件包括至少一个与上述功能相对应的模块或单元。例如,接收模块或单元、显示模块或单元、以及处理模块或单元等。
第四方面、提供一种计算机可读存储介质,包括计算机指令,当计算机指令在终端上运行时,使得终端执行如上述方面及其中任一种可能的实现方式中所述的方法。
第五方面、提供一种语音交互系统,所述语音系统包括一个或多个处理单元,当所述一个或多个处理单元执行指令时,所述一个或多个处理单元执行如上述方面及其中任一种可能的实现方式中所述的方法。
第六方面、提供一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算机执行如上述方面中及其中任一种可能的实现方式中所述的方法。
第七方面、提供一种芯片系统,包括处理器,当处理器执行指令时,处理器执行如上述方面中及其中任一种可能的实现方式中所述的方法。
上述第二方面提供的终端、第三方面提供的装置、第四方面提供的计算机可读存储介质、第五方面提供的语音交互系统、第六方面提供的计算机程序产品以及第七方面提供的芯片系统所能达到的技术效果可以参考第一方面以及其中任一种可能的实现方式中关于技术效果的描述,这里不再赘述。
附图说明
图1为本申请实施例提供的一种终端的结构示意图;
图2为本申请实施例提供的一种语音交互方法的流程示意图;
图3为本申请实施例提供的一种语言系统的结构示意图;
图4为本申请实施例提供的一种自然语言生成模块的结构示意图;
图5为本申请实施例提供的一些受话人识别模块的结构示意图;
图6为本申请实施例提供的又一些受话人识别模块的结构示意图;
图7为本申请实施例提供的又一些受话人识别模块的结构示意图;
图8为本申请实施例提供的又一些受话人识别模块的结构示意图;
图9为本申请实施例提供的一种芯片系统的结构示意图。
具体实施方式
在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。
在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
本申请实施例提供的语音交互方法可应用于具备语音交互能力的终端。一些示例中,终端可以安装用于提供语音交互能力的该语音交互类应用,例如手机上的语音助手或虚拟助理等,车载终端上的语音系统等。本申请实施例提供的技术方案可应用于连续对话的语音交互场景、免唤醒的语音交互场景、以及全双工语音交互场景中。其中,连续对话的语音交互场景是指用户在唤醒语音交互类应用后,在预设时长内可以连续向语音交互类应用发送多条语音指令,语音交互类应用可以完成多条语音指令。免唤醒语音交互场景是指用户无需说出唤醒词,语音交互类应用自动唤醒,自动拾取用户的语音指令,并完成用户的语音指令。其中,全双工对话的语音交互场景,与单轮或者多轮连续语音识别场景不同,全双工对话可实时预测用户即将说出的内容,实时生成回应并控制对话节奏,从而实现长程语音交互。本申请实施例对应用场景不再具体限定。
示例性的,本申请实施例中终端例如可以为手机、平板电脑、个人计算机(personal  computer,PC)、个人数字助理(personal digital assistant,PDA)、智能手表、上网本、可穿戴终端、增强现实技术(augmented reality,AR)设备、虚拟现实(virtual reality,VR)设备、车载设备、智慧屏、智能汽车、智能音响、机器人等,本申请对该终端的具体形式不做特殊限制。
图1示出了终端100的结构示意图。
终端100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。
可以理解的是,本发明实施例示意的结构并不构成对终端100的具体限定。在本申请另一些实施例中,终端100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。在一些有线充电的实施例中,充电管理模块140可以通过USB接口130接收有线充电器的充电输入。在一些无线充电的实施例中,充电管理模块140可以通过终端100的无线充电线圈接收无线充电输入。充电管理模块140为电池142充电的同时,还可以通过电源管理模块141为终端供电。
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,显示屏194,摄像头193,和无线通信模块160等供电。电源管理模块141还可以用于监测电池容量,电池循环次数,电池健康状态(漏电,阻抗)等参数。在其他一些实施 例中,电源管理模块141也可以设置于处理器110中。在另一些实施例中,电源管理模块141和充电管理模块140也可以设置于同一个器件中。
终端100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
天线1和天线2用于发射和接收电磁波信号。终端100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
移动通信模块150可以提供应用在终端100上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块150的至少部分功能模块可以被设置于处理器110中。在一些实施例中,移动通信模块150的至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。
调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。随后解调器将解调得到的低频基带信号传送至基带处理器处理。低频基带信号经基带处理器处理后,被传递给应用处理器。应用处理器通过音频设备(不限于扬声器170A,受话器170B等)输出语音信号,或通过显示屏194显示图像或视频。在一些实施例中,调制解调处理器可以是独立的器件。在另一些实施例中,调制解调处理器可以独立于处理器110,与移动通信模块150或其他功能模块设置在同一个器件中。
无线通信模块160可以提供应用在终端100上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块160经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。
在一些实施例中,终端100的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得终端100可以通过无线通信技术与网络以及其他设备通信。所述无线通信技术可以包括全球移动通讯系统(global system for mobile communications,GSM),通用分组无线服务(general packet radio service,GPRS),码分多址接入(code division multiple access,CDMA),宽带码分多址(wideband code division multiple access,WCDMA),时分码分多址(time-division code division multiple access,TD-SCDMA),长期演进(long term evolution,LTE),BT,GNSS,WLAN,NFC,FM,和/或IR技术等。所述GNSS可以包括全球卫星定位系统(global positioning system,GPS),全球 导航卫星系统(global navigation satellite system,GLONASS),北斗卫星导航系统(beidou navigation satellite system,BDS),准天顶卫星系统(quasi-zenith satellite system,QZSS)和/或星基增强系统(satellite based augmentation systems,SBAS)。
终端100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,终端100可以包括1个或N个显示屏194,N为大于1的正整数。
终端100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,终端100可以包括1个或N个摄像头193,N为大于1的正整数。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当终端100在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。
视频编解码器用于对数字视频压缩或解压缩。终端100可以支持一种或多种视频编解码器。这样,终端100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现终端100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展终端 100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储终端100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。处理器110通过运行存储在内部存储器121的指令,和/或存储在设置于处理器中的存储器的指令,执行终端100的各种功能应用以及数据处理。
终端100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。
扬声器170A,也称“喇叭”,用于将音频电信号转换为语音信号。终端100可以通过扬声器170A收听音乐,或收听免提通话。
受话器170B,也称“听筒”,用于将音频电信号转换成语音信号。当终端100接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。
麦克风170C,也称“话筒”,“传声器”,用于将语音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将语音信号输入到麦克风170C。终端100可以设置至少一个麦克风170C。在另一些实施例中,终端100可以设置两个麦克风170C,除了采集语音信号,还可以实现降噪功能。在另一些实施例中,终端100还可以设置三个,四个或更多麦克风170C,实现采集语音信号,降噪,还可以识别声音来源,实现定向录音功能等。
耳机接口170D用于连接有线耳机。耳机接口170D可以是USB接口130,也可以是3.5mm的开放移动终端平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。
按键190包括开机键,音量键等。按键190可以是机械按键。也可以是触摸式按键。终端100可以接收按键输入,产生与终端100的用户设置以及功能控制有关的键信号输入。
马达191可以产生振动提示。马达191可以用于来电振动提示,也可以用于触摸振动反馈。例如,作用于不同应用(例如拍照,音频播放等)的触摸操作,可以对应不同的振动反馈效果。作用于显示屏194不同区域的触摸操作,马达191也可对应不同的振动反馈效果。不同的应用场景(例如:时间提醒,接收信息,闹钟,游戏等)也可以对应不同的振动反馈效果。触摸振动反馈效果还可以支持自定义。
指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示 消息,未接来电,通知等。
以下实施例中所涉及的技术方案均可以在具有上述架构的终端100中实现。
下面结合附图对本申请实施例提供的技术方案进行详细说明。
如图2所示,为本申请实施例提供的一种语音交互方法的流程示意图,该流程包括:
S201、终端检测到语音信号。
S202、终端将语音信号转换为文本,并根据文本进行意图识别,得到文本的意图识别结果。
在步骤S201-步骤S202中,在本申请的一些示例中,终端可以安装用于提供语音交互能力的语音交互类应用。例如,手机上的语音助手或虚拟助理等应用,可以根据拾取的用户语音为用户提供相应的服务(在语音交互中也称为技能)。该技能可以操作手机上的功能,或者向服务器(第三方技能提供商)请求相关的服务等。又例如,车载终端的语音系统可以拾取驾驶员或乘客的语音,为驾驶员或乘客提供汽车的控制功能,车内影音娱乐播放功能,向服务器(第三方技能提供商)请求相关的服务等。又例如,智能音箱的语音软件可以拾取房间内用户的语音指令,执行用户语音指令,例如播放相关的音视频资源,通过智能音箱控制其他的智能家居设备等。
下文以车载终端安装的语音系统为例进行说明。示例性的,图3示出了一种语音系统的软件结构图。这里结合图3所示的语音系统的软件结构进行说明。
当接收到用户说出的唤醒词后,车载终端开启拾音装置(如麦克风)拾取车内的声音。或者,车载终端的语音系统支持免唤醒,那么车载终端一直开启拾音装置(如麦克风)拾取车内的声音。拾取的声音输入声音预处理模块进行预处理,例如包括语音信号的采样、反混叠滤波、语音增强等。处理后的语音信号输入自动语音识别模块,自动语音识别模块将语音信号转换为文本。文本输入到自然语言理解模块进行意图识别,得到意图识别结果,意图识别结果包括针对文本识别出的意图和槽位。可以理解的是,意图识别过程实质上是一个分类结果,因此意图识别结果还包括文本对应各个意图的概率分布。
可以理解的,在终端拾取用户(例如驾驶员)的语音的过程中,终端可能拾取到周围人的说话声、车内其他电子设备播放的声音、环境的噪声等。也就是说,终端可能拾取到非目标用户的语音输入,非目标用户的语音输入会对后续识别出的用户意图造成干扰,影响语音系统执行用户指令的准确性。为此,在一些技术方案中,终端在将语音信号识别为文本后,还会对文本进行无效文本的拒识处理,即将识别的文本输入到受话人识别(Addressee Detection,AD)模块(也称为拒绝识别模块,简称拒识模块)。受话人识别模块输出二分类结果,即文本是否为语音系统的拒识对象。当识别出文本为拒识对象(即语音信号的受话人不是车载终端的语音系统)时,车载终端的语音系统将不相应该文本的意图。当识别出文本不为拒识对象(即语音信号的受话人是车载终端的语音系统)时,车载终端的语音系统才执行该文本的意图等。由此,针对文本进行拒识处理后,有利于降低语音系统误识别的概率,提升语音系统的处理效率以及正确性。但可以注意到,该技术方案是在语音系统对语音信号转换后的文本的基础上进行拒识处理的,因此语音系统的语音识别的能力直接影响拒识处理的准确 率。
因此,本申请实施例还提供了一种技术方案,识别出的受话人识别结果不是简单的二分类结果(即是否是拒识对象),而是包含多个语音信号的特征,包括但不限于语音信号的来源、语音信号的对象和语音信号的主题。可以理解的,多个语音信号的特征能够提升语音识别的准确率。即,执行下述步骤S203以及后续步骤。
S203、终端根据语音信号、文本、意图识别结果中的一项或多项,确定受话人识别结果。其中受话人识别结果包括多个语音信号的特征,例如语音信号的来源(from)、语音信号的对象(to)和语音信号的主题(subject)。
示例性的,语音信号的来源包括但不限于用户,扬声器,环境。其中,当语音信号的来源为用户时,该语音可以被确认为是人声。当语音信号的来源为扬声器时,该语音可以被确认为电子设备发出的声音,为非人声。当语音信号的来源为环境时,该语音可以被确认为噪音。可以理解的是,识别出语音信号的来源有利于区分语音是否是用户发出的,有利于区分语音是否是语音系统的拒识对象。在一些示例中,用户的用户来源还可以包括驾驶员、副驾、后排1的用户、后排2的用户。那么,语音信号的来源还有利于区分发出语音的具体用户,后续可以针对不同的用户执行不同的响应方式等。
语音信号的对象包括但不限于语音系统、用户、环境。当语音信号的对象为语音系统时,该语音可以认为是人机交互的内容。当语音信号的对象为用户时,该语音可认为是用户间的交谈,为语音系统的拒识对象。或者,在其他一些示例中,语音系统如支持插话功能,则在该场景中,该语音不为语音系统的拒识对象。当语音信号的对象为环境时,该语音可认为是用户的自言自语或吟唱等,为语音系统的拒识对象。可以理解的是,识别出语音对话有利于区分语音的受话人是否为语音系统,也有利于区分语音是否是语音系统的拒识对象。
语音信号的主题包括任务和无意义类。其中,任务,是指语音包含了用户希望语音系统执行技能。无意义类,是指语音未包含用户希望语音系统执行技能,即用户无需语音系统执行技能。一些示例中,还可以根据任务的类型将任务进一步划分为执行类任务、闲聊任务、百科类任务。可选的,任务还可以包括方言类任务。另一些示例中,还可以根据用户的情感将任务划分为不同情感对应的任务,例如,兴奋的情感对应的播放欢快类音乐的任务;紧张的情感对应播放舒缓类的轻音乐的任务等。
需要说明的是,这里提取语音信号的主题的特征,可以根据语音本身的含义进行识别和提取的,不依赖语音系统的自动语音识别模块识别的文本,也不依赖自然语言理解模块对文本的意图识别,因此,这里提取语音信号的主题的能力,不依赖语音系统的自动语音识别模块和自然语言理解模块的识别能力。
示例性的,这里继续结合图3所示的语音系统的软件结构进行说明。声音预处理模块输出的语音信号在输入到自动语音识别模块外,还输入到受话人识别模块。受话人识别模块用于识别出语音的特征,包含语音信号的来源、语音信号的对象以及语音信号的主题。可选的,自动语音识别模块识别出的文本也可以输入到受话人识别模块,也用于识别出语音的特征。可选的,自然语言理解模块对文本进行意图识别后的意图识别结果,也可以通过对话管理模块输入到受话人识别模块,用于识别出语音的特征。 可选的,对话管理模块还可以将该语音的上下文也输入到受话人识别模块中,用于识别出语音的特征。可选的,语音系统还可以启动摄像头采集用户的图像,图像经过图像预处理模块后输入到受话人识别模块,受话人识别模块还可以基于图像的信息,识别出该语音的特征。其中,图像的信息包括但不限于车内乘客的数量、人脸朝向,人物动作等,可以理解,语音系统可以基于车内乘客的数量、人脸朝向、人物动作等识别当前说话的人,是否和其他人交谈,是否正在打电话,是否播放电子设备等,用于识别语音的特征。可选的,受话人识别模块还可以基于传感器采集的数据(如乘客的数量,车速等)识别语音的特征。
由此可见,本申请实施例提供了受话人识别模块对输入的多模态的数据(例如:语音、文本、意图识别结果、对话上下文、图像数据、传感器数据等)进行识别,识别出语音的特征,提升识别准确率。
S204、终端请求执行意图识别结果中的意图。
需要说明的是,本申请实施例中并不限制上述步骤S202至步骤S204的执行顺序,可以理解的是,上述步骤S202至步骤S204可以顺序执行,也可以并行执行,或者部分步骤顺序执行,部分步骤并行执行。例如,终端在执行步骤S202中的语音信号转换为文本的同时,也可以同时执行步骤S203中对语音信号执行受话人识别的步骤。又例如,终端在执行完步骤S202中根据文本进行意图识别的同时,也可以执行步骤S203中对文本确定受话人识别的步骤。又例如,终端在执行完步骤S202,得到意图识别结果后,执行步骤S203中的根据意图识别结果执行受话人识别的步骤,同时终端也可以执行步骤S204的执行意图识别结果中的意图。总而言之,在上述步骤不矛盾的前提下,上述步骤S202至步骤S204的执行顺序可以进行变换。
S205、终端根据受话人识别结果,以及意图执行结果,确定语音信号的响应方式。
其中,意图执行结果包括成功执行意图和未成功执行意图。需要说明的是,一些示例中,这里的意图执行结果可以是终端请求执行意图识别结果中的意图后,终端自身已执行该意图的结果;或者,终端请求其他设备执行该意图后,其他设备向该终端反馈的执行结果。也就是说,在终端确定语音信号的响应方式之前,该意图已经执行。另一些示例中,这里的意图执行结果也可以终端根据意图判断自身或者其他设备(如服务器)是否支持执行该意图对应的技能,该判断结果即为意图执行结果。也就是说,在终端确定语音信号的响应方式之前,该意图没有被执行。
在现有技术中,若文本的意图识别结果包括意图,即语音系统识别出意图,语音系统从其支持的技能中查找该具体意图对应的技能。当查找到相应的技能时,语音系统请求执行该技能,或者语音系统向车载终端中其他系统请求执行该技能,语音系统还可以向车载终端之外的其他设备(例如服务器)请求执行该技能。而后,语音系统向用户反馈该技能的执行结果。该场景中,意图执行结果包括成功执行意图和未成功执行意图。可以理解,可能存在多种情况造成语音系统向用户反馈“未成功执行意图”。例如,语音系统的自然语言理解模块识别出用户的意图,但语音系统并不支持执行该意图对应的技能;或者,语音系统的自然语言理解模块识别出用户的意图,且向服务器请求执行该意图对应的技能,但服务器无响应,或者服务器执行技能出现错误(例如缺少槽位信息等)。若文本的意图识别结果包括无意图,即语音系统未识别出意图, 语音系统向用户反馈“未成功执行意图”。可以理解,可能存在多种情况造成语音系统向用户反馈“未成功执行意图”,例如:语音系统的自然语言理解模块未识别出用户的意图;又或者,语音系统采集的语音中本身不包含用户的意图等。综上可见,现有技术中,语音系统不区分具体的应用场景,向用户反馈统一的“未成功执行意图”(即“无结果”),会给用户造成语音回复不准确、不自然、不智能的感受,造成用户的语音交互体验不佳。
为此,本申请实施例还给出了语音系统的另一种响应方法,结合识别出的受话人识别结果(例如语音信号的来源、语音信号的对象以及语音信号的主题)和意图的执行结果,确定不同的响应方式,向用户反馈不同的响应结果。可以理解的,当语音系统识别出语音信号的多个特征后,有利于语音系统识别出更细分的应用场景,有利于语音系统根据更细分的应用场景给出不同的响应结果,从而提升人机交互的智能化、语音回复的自然流畅,提升语音交互体验。
示例性的,这里继续结合图3所示的语音系统的软件结构进行说明。自然语言理解模块输出意图识别结果,其中,意图识别结果包括识别出的意图或无意图。意图识别结果输入到对话管理模块,对话管理模块输出意图的执行结果。例如,当意图识别结果包括意图时,对话管理模块执行该意图,并向用户反馈意图执行结果,意图执行结果包括成功执行意图和未成功执行意图。当意图识别结果为无意图时,意图执行结果是未成功执行意图。进一步的,对话管理模块将确定的意图执行结果输入到自然语言生成模块,并且受话人识别模块将受话人识别结果输入到自然语言生成模块。自然语言生成模块根据意图执行结果以及受话人识别结果中多个语音信号的特征确定该语音确定最终的响应方式。
在一个具体的实现方式中,车载终端采用基于规则和语料模板的方法实现不同应用场景下的响应方式不同。也就是说,在规则中为不同的应用场景设置不同的响应方式,可选的,还可以为不同的应用场景设置不同的语料模板。其中,语料模板用于语音系统向用户播放语音的执行结果,也可称为播放模板,或者,语料模板用于语音系统采用图形界面(包括文字内容)的形式向用户呈现语音的执行结果。
如表一所示,为受话人识别结果(包括语音信号的来源、语音信号的对象、语音信号的主题)、意图执行结果、响应方式、以及播放模板的对应关系的一个示例。
表一

当自然语音生成模块获取到受话人识别结果(包括语音信号的来源、语音信号的对象、语音信号的主题)和意图执行结果后,可以以受话人识别结果和意图执行结果为关键字,在表一中查找相对应的响应方式以及播放模板。语音系统按照查找到的响应方式执行相关操作,且,语音系统解析该场景对应的播放模板,播放模板中包括占位符和文字,根据本次的语音、本次语音的上下文填充占位符的内容,填充后的内容和播放模板中原有的文字组合成最终的播放文本。语音系统播放该播放文本。
例如,当语音为用户1向语音系统请求的执行任务时,若意图执行结果为未成功执行意图,则采用兜底的响应方式,播放“用户1,语音系统还无法完成您的任务请求,请给我学习时间吧!”。或者,在确定意图执行结果为未成功执行意图后,语音系统还可以进一步确定任务是否对应语音系统能够处理的技能。当确定任务不对应语音系统能够处理的某项技能,那么语音系统采用兜底的响应方式。当确定任务对应语音系统能够处理的某项技能,那么语音系统采用失败提示的响应方式,比如提示“用户1,语音系统未成功执行您的任务的请求。”又例如,语音为用户1向语音系统发送无意义的语音,且意图执行结果为未成功执行意图,则采用请求澄清的响应方式,播放“用户1,语音系统收到您的请求,请换个方式再说一遍吧!”又例如,当语音为用户1向语音系统请求的执行任务时,语音执行相应的意图。若意图执行结果为成功执行意图,并播放“用户1,语音系统已完成任务!”。
当语音的来源为非用户,或者,语音的对象为非语音系统时,则不响应该语音。 例如,当语音系统接收到的语音来自用户1,语音信号的对象为用户2,语音信号的主题为任务,意图执行结果为未成功执行意图,则语音系统不响应。又例如,当语音系统接收到的语音来自用户1,但语音的对象为空气,语音信号的主题为任务,意图执行结果为未成功执行意图,则语音系统不响应。又例如,当语音系统接收到的语音来自电子设备,但语音的对象为空气,语音信号的主题为任务,意图执行结果为未成功执行意图,则语音系统不响应。
由此可见,受话人识别结果(语音的来源,对象和主题)有利于区分语音的不同场景,有利于提升拒绝识别场景(例如人人对话,电子设备播放声音、用户的自言自语等场景)的识别率。另外,基于不同场景,语音系统可以提供不同的响应方式,按照不同的播放模板播放不同的内容。例如,针对未成功执行意图的情况,语音系统可以通过受话人识别结果区分具体的情况,通过不同的播放内容向用户提供更多的信息,提升语音系统的交互的智能化,使得人机交互更加自然。
在其他一些示例中,还可以基于不同场景的需求,设置不同的语音信号的主题。例如,将任务进一步划分为执行类任务、闲聊任务、百科类任务。如表二所示,为受话人识别结果(包括语音信号的来源、语音信号的对象、语音信号的主题)、意图执行结果、响应方式、以及播放模板的对应关系的另一个示例。
表二
例如,当语音为用户1向语音系统请求的执行类任务时,若意图执行结果为成功执行意图,则播放“用户1,语音系统已完成您的任务请求”。当语音为用户1向语音系统请求的执行类任务时,若意图执行结果为未成功执行意图,则采用兜底的响应方式,播放“用户1,语音系统还无法完成您的任务请求,请给我学习时间吧!”。
又例如,当语音系统接收到的语音来自用户1,语音信号的对象为用户2,语音信号的主题为执行类任务,意图执行结果为未成功执行意图,则语音系统不响应。又例如,当语音系统接收到的语音来自用户1,语音信号的对象为用户2,语音信号的主题为闲聊任务,意图执行结果为未成功执行意图,则语音系统可以进行插话。比如,语音系统可以加入用户1和用户2的对话,实现人(用户1)-人(用户2)-机(语音系统)的智能交互,提升用户的语音交互。或者,语音系统还可以根据用户1和用户2的交谈内容,执行相关的技能。比如,用户1和用户2商量去某个旅游景点游玩,则语音系统可以询问是否需要查询该旅游景点的天气、车票、旅游攻略等信息。
又例如,当语音系统接收到的语音来自用户1,但语音的对象为空气,语音信号的主题为闲聊任务,意图执行结果为未成功执行意图,则语音系统也可以进行插话。或者,语音系统还可以根据用户1闲聊的内容,询问是否执行相关的技能。
由此可见,当基于受话人识别结果(语音的来源,对象和主题)区分出细分的场景后,语音系统可以提供更加丰富的功能,提升了语音系统的人机交互的智能化。
在另一个具体实现的方式中,车载终端还可以基于机器学习的方法实现不同应用场景下的响应方式不同。也即,采用机器学习的方法训练出自然语言生成模块的模型。具体的,使用一个预训练的语言模型作为编码器,例如,使用基于转换器的双向编码表征(Bidirectional Encoder Representation from Transformers,BERT)模型作为编码器。将大量训练样本输入到该编码器中进行训练,例如,采用自回归方式进行训练,得到自然语言生成模块。其中,训练样本可以为开发人员收到编写的或基于一定规则由机器生成的语料-响应的内容。其中,语料为语音,并且开发人员对每条语音进行标注,标注的内容包括该语音的来源、对象、主题、语音转换为文本、识别出的意图、槽位、意图执行结果、以及响应方式。开发人员可以手动进行标注,也可以将语音输入到图3所示的模型中,由受话人识别模块识别出语音的来源、对象、主题的特征;由自动语音识别模块将语音转换为文本;由自然语言理解模块识别出文本的意图和槽位;由对话管理模块输出意图执行结果;并确定希望的响应方式和播放内容,对语音进行标注。可以理解的是,通过对上述训练样本进行训练得到的自然语言生成模块可以实现不同应用场景下的响应方式(和播放内容)不同。如图4所示,为训练后得到的自然语言生成模块的示例。当向自然语言生成模块输入语音的来源、对象、主题、语音转换为文本、识别出的意图、槽位、意图执行结果等参数后,自然语言生成模块运行后输出响应的响应方式和播放内容。可见,利用预训练的语言模型的能力可实现更多样更灵活的响应方式以及播放内容,提升人机交互的智能化。
下面,对语音系统中的受话人识别模块的实现进行详细说明。
在一些示例中,受话人识别模块包括语音受话人识别(Sound-based Addressee Detection,SAD)模型。也就是说,语音受话人识别模型接收声音预处理模块处理后的语音信号,对语音信号进行识别,识别出语音信号的多个特征,如语音信号的来源、 语音信号的对象和语音信号的主题。在一个具体实现方式中,如图5中(1)所示,语音受话人识别模型包括语音识别模型,该语音识别模型例如为Transformer语音识别模型,或者,更具体的,为基于卷积增强的Transformer语音识别模型(Convolution-augmented Transformer for Speech Recognition,Conformer)。在训练语音识别模型时,可以将训练样本输入到预训练的模型中进行训练。其中,训练样本包括语音,以及对语音标注的来源、对象以及主题等。需要说明的是,标注人员可以根据语音本身的含义标注语音的来源、对象以及主题。后续,训练好的语音识别模型可以对输入的语音进行推理,推理出该语音的来源、对象以及主题等特征。可以理解的,这里提取语音信号的主题的特征,是根据语音本身的含义进行识别和提取的,不依赖语音系统的自动语音识别模块识别的文本,也不依赖自然语言理解模块对文本的意图识别,因此,这里提取语音信号的主题的能力,不依赖语音系统的自动语音识别模块和自然语言理解模块的识别能力。
在另一个具体实现方式中,如图5中(2)所示,语音受话人识别模型可以包括语音识别模型(例如Transformer语音识别模型)和集成学习模型。其中,在训练语音识别模型时,可以将训练样本输入到预训练的模型中。其中,训练样本包括语音,以及对语音标注的对话分类,该对话分类例如包括人人对话、人机对话、电子音(即电子设备播放声音)、噪声以及未知声音中多项。需要说明的是,标注人员可以根据语音本身的含义标注的对话分类。后续,训练好的语音识别模型可以对输入的语音进行推理,推理出该语音对应的各个对话分类的概率分布。在训练集成学习模型时,可以将训练样本输入到预训练的集成学习模型中。其中,训练样本包括语音对应的各个对话分类的概率分布,以及标注的语音的来源、对象和主题等。训练好的集成学习模型可以对输入的语音的对话分类的概率分布进行推理,推理出该语音的来源、对象和主题等。也就是说,当语音输入到语音受话人识别模型中后,通过语音识别模型和集成学习模型的推理,可以得到该语音的来源、对象以及主题等特征。
在又一些示例中,受话人识别模块包括文本受话人识别(Text-to-Speech Addressee Detection,TAD)模型。也就是说,如图6中(1)所示,文本受话人识别模型接收自动语音识别模块转换后的文本,通过对文本进行识别,识别出语音信号的多个特征,如语音信号的来源、语音信号的对象和语音信号的主题。在一个具体实现方式中,如图6中(2)所示,文本受话人识别模型包括文本识别模型,该文本识别模型包括拼接模块、BERT编码器和解码器。其中,拼接模块,用于将语音转换后文本和预设模板进行拼接。该预设模板包括多个提示符,一个提示符对应一个语音的特征。例如,语音信号的来源、对象和主题。例如,拼接后的内容为“【语音信号的来源】对【语音信号的对象】说【语音信号的主题】:语音转换后的文本”。在训练文本识别模型时,可以将训练样本输入到预训练的模型中进行训练。其中,训练样本包括语音转换后的文本,以及对文本标注的语音的来源、对象以及主题等。训练好的文本识别模型可以对语音转换后的文本进行推理,推理出该语音的来源、对象以及主题等特征。
在又一些示例中,如图6中(2)所示,受话人识别模型包括语音受话人识别模型、文本受话人识别模型以及集成学习模型。其中,语音受话人识别模型可以参考上述图5中(1)所示的语音识别模型,文本受话人识别模型可以参考上述图6中(1)和(2) 所示的文本识别模型,这里不再赘述。集成学习模型可以融合语音受话人识别模型和文本受话人识别模型的识别结果,最终输出语音的来源、对象以及主题等特征。
需要说明的是,由于语音和文本的数据特征不同,语音为流式数据,是一组顺序、大量、快速、连续到达的数据序列。文本为非流式数据。因此,在融合语音受话人识别模型的识别结果和文本受话人识别模型的识别结果时,可以采用语音激活检测(Voice Activity Detection,VAD)方法,将语音流切断为多个语音片段。针对每一个语音片段输入到语音受话人识别模型进行识别,并将该语音片段对应的文本输入到文本受话人识别模型的识别结果,以达到对齐语音和文本的效果,将相对应的两个识别结果进行融合。
在一个具体的实现方式中,如图7中(1)所示,语音系统包括一个语音激活检测模块。在接收到语音后,语音激活检测模块将音频流切断为多个语音片段。然后,将各个语音片段分别输入到语音受话人识别模型中进行识别,输出各个语音片段对应的对话分类的概率。同时,将各个语音片段转换后的文本输入到文本受话人识别模型中进行识别,输出各个语音片段对应的语音信号的来源、对象和主题等。可以注意到,此时语音受话人识别模型处理语音片段,与文本受话人识别模型处理的文本已对齐。而后,将两个模型输出的识别结果输入到集成学习模型进行推理,得到融合后的语音的来源、对象和主题等。
在另一个具体的实现方式中,如图7中(2)所示,语音受话人识别模型中包括语音激活检测模块。具体的,当语音受话人识别模型接收到语音流后,其中的语音激活检测模块将音频流切断为多个语音片段,并将断句点发送给文本受话人识别模型。断句点用于触发文本受话人识别模型对相应的文本内容进行识别。此时语音受话人识别模型处理语音片段,与文本受话人识别模型处理的文本已对齐。而后,将两个模型输出的识别结果输入到集成学习模型进行推理,得到融合后的语音的来源、对象和主题等。
在又一些示例中,如图8中(1)或图8中(2)所示,受话人识别模型还包括意图受话人识别模型。意图受话人识别模型包括意图映射模块,用于将自然语言理解模块输出的语音的意图概率分布映射为有意图概率和无意图概率。例如:语音1输入到自然语言理解模块后,得到语音1的意图概率分布为:意图1的概率为概率1,意图2的概率为概率2,意图3的概率为,无意图的概率为概率4。那么,意图映射模块映射后的有意图的概率为概率1+概率2+概率3;无意图的概率为概率4。可见,意图受话人识别模型有利于提升识别出的语音信号的主题。而后,集成学习模型可以融合语音受话人识别模型的识别结果、文本受话人识别模型的识别结果以及意图受话人识别模型的识别结果,最终输出语音的来源、对象以及主题等特征。可选的,在一些其他示例中,对话管理模块还可以将该语音的上下文输入到意图受话人识别模型中,用于辅助输出语音的来源、对象以及主题等特征。
可以理解的是,仍然可以采用语音激活检测模块,对齐意图受话人识别模型处理的文本,语音受话人识别模型处理的语音以及文本受话人识别模型处理的文本。具体对齐方式可参考前文中语音受话人识别模型处理的语音以及文本受话人识别模型处理的文本对齐方法,这里不再赘述。
需要说明的是,本申请实施例对受话人识别模块的具体实现不做具体限定。例如,受话人识别模块还可以包括图像受话人识别模块,用于通过用户的图像识别该图像对应的语音的特征。受话人识别模块还可以包括上述各个子模型(语音识别模型、文本识别模型、意图受话人识别模型、图像是被模型等)的任意组合。例如,受话人识别模块可以包括语音受话人识别模型和意图受话人识别模型,或者,包括文本受话人识别模型和意图受话人识别模型等。
综上,本申请实施例通过受话人识别模块对输入的多模态的数据(例如:语音、文本、意图识别结果、对话上下文、图像数据、传感器数据等)进行识别,且识别结果包括多个语音信号的特征。可以理解的,多个语音信号的特征有利于提升受话人识别准确率,还有利于区分更多的应用场景,便于语音系统根据更细分的应用场景给出不同的响应结果,从而提升人机交互的智能化、语音回复的自然流畅,提升语音交互体验。
本申请实施例还提供一种芯片系统,如图9所示,该芯片系统包括至少一个处理器1101和至少一个接口电路1102。处理器1101和接口电路1102可通过线路互联。例如,接口电路1102可用于从其它装置(例如终端100的存储器)接收信号。又例如,接口电路1102可用于向其它装置(例如处理器1101)发送信号。示例性的,接口电路1102可读取存储器中存储的指令,并将该指令发送给处理器1101。当所述指令被处理器1101执行时,可使得终端执行上述实施例中的终端100(比如,手机)执行的各个步骤。当然,该芯片系统还可以包含其他分立器件,本申请实施例对此不作具体限定。
本申请实施例还提供一种装置,该装置包含在终端中,该装置具有实现上述实施例中任一方法中终端行为的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。硬件或软件包括至少一个与上述功能相对应的模块或单元。例如,检测模块或单元、显示模块或单元、确定模块或单元、以及计算模块或单元等。
本申请实施例还提供一种计算机存储介质,包括计算机指令,当计算机指令在终端上运行时,使得终端执行如上述实施例中任一方法。
本申请实施例还提供一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算机执行如上述实施例中任一方法。
可以理解的是,上述终端等为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请实施例能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明实施例的范围。
本申请实施例可以根据上述方法示例对上述终端等进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本发明实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请实施例各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:快闪存储器、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (12)

  1. 一种语音交互方法,其特征在于,所述方法包括:
    检测到语音信号;
    将所述语音信号转换为文本,并对所述文本进行意图识别,得到意图识别结果;
    根据所述语音信号、所述文本、所述意图识别结果中的一项或多项,确定受话人识别结果,所述受话人识别结果包括所述语音信号的来源、对象和主题;
    根据所述受话人识别结果,以及意图执行结果,确定所述语音信号的响应方式。
  2. 根据权利要求1所述的方法,其特征在于,所述语音信号的来源包括用户、扬声器或电子设备、环境中一项;所述语音信号的对象包括语音系统、用户、环境中一项;所述语音信号的主题包括任务或无意义。
  3. 根据权利要求2所述的方法,其特征在于,所述任务还包括:执行类任务、闲聊任务、百科类任务、方言类任务中一项或多项。
  4. 根据权利要求2或3所述的方法,其特征在于,所述根据所述受话人识别结果,以及意图执行结果,确定所述语音信号的响应方式,包括:
    当所述语音信号的来源为用户,所述语音信号的对象为语音系统,所述语音信号的主题为任务,所述意图识别执行结果为未成功执行意图时,发出第一提示,所述第一提示用于提示所述语音系统不支持执行所述语音信号的主题;所述第一提示包括语音信号的来源、所述语音信号的对象、以及所述语音信号的主题;
    或者,当所述语音信号的来源为用户,所述语音信号的对象为语音系统,所述语音信号的主题为无意义,所述意图识别执行结果为未成功执行意图时,发出第二提示;所述第二提示用于请求用户澄清,所述第一提示包括语音信号的来源、所述语音信号的对象、以及所述语音信号的主题;
    或者,当所述语音信号的来源为非用户,或者,所述语音信号的对象为非语音系统,确定不响应所述语音信号。
  5. 根据权利要求3所述的方法,其特征在于,所述根据所述受话人识别结果,以及意图执行结果,确定所述语音信号的响应方式,包括:
    当语音信号的来源为用户,且语音信号的对象为另一个用户,语音信号的主题为闲聊任务时,发出第三提示,所述第三提示用于询问是否执行与所述语音信号关联的第一技能;
    或者,当语音信号的来源为用户,且语音信号的对象为空气,语音信号的主题为闲聊任务时,发出第四提示,所述第四提示用于询问是否执行与所述语音信号关联的第二技能;所述第二技能与所述第一技能相同或不同。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述根据所述受话人识别结果,以及意图执行结果,确定所述语音信号的响应方式,包括:
    基于预设规则,查询所述受话人识别结果以及意图执行结果对应的响应方式;所述规则中所述受话人识别结果或意图执行结果不同时,对应的响应方式不同;
    或者,将所述受话人识别结果以及意图执行结果输入到预先训练好的响应模型中进行推理,得到所述语音信号的响应方式。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述根据所述语音信号、 所述文本、所述意图识别结果中的一项或多项,确定受话人识别结果,包括:
    将所述语音信号输入到语音识别模型中进行推理,得到所述语音信号对应的对话分类,所述对话分类包括人人对话、人机对话、电子音、噪声和未知声音中的一项;
    将所述文本输入到文本识别模型中进行推理,得到所述语音信号的来源初值、语音信号的对象初值、以及语音信号的主题初值;
    将所述语音信号对应的对话分类、所述语音信号的来源初值、语音信号的对象初值、以及语音信号的主题初值输入到第一集成学习模型中进行推理,得到所述语音信号的来源、所述语音信号的对象、以及所述语音信号的主题。
  8. 根据权利要求7所述的方法,其特征在于,所述根据所述语音信号、所述文本、所述意图识别结果中的一项或多项,确定受话人识别结果,还包括:
    将所述语音信号输入到语音识别模型中进行推理,得到所述语音信号对应的对话分类,所述对话分类包括人人对话、人机对话、电子音、噪声和未知声音中的多项;
    将所述文本输入到文本识别模型中进行推理,得到所述语音信号的来源初值、所述语音信号的对象初值、以及所述语音信号的主题初值;
    根据所述意图执行结果中所述文本对应的各个意图的概率分布,映射为所述文本的有意图的概率和无意图的概率;
    将所述文本的有意图的概率和无意图的概率,所述语音信号对应的对话分类、所述语音信号的来源初值、所述语音信号的对象初值、以及所述语音信号的主题初值输入到第二集成学习模型中进行推理,得到所述语音信号的来源、所述语音信号的对象、以及所述语音信号的主题。
  9. 一种终端,其特征在于,包括:处理器、存储器和触摸屏,所述存储器、所述触摸屏与所述处理器耦合,所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令,当所述处理器从所述存储器中读取所述计算机指令,以使得所述终端执行如权利要求1-8中任一项所述的语音交互方法。
  10. 一种计算机可读存储介质,其特征在于,包括计算机指令,当所述计算机指令在终端上运行时,使得所述终端执行如权利要求1-8中任一项所述的语音交互方法。
  11. 一种芯片系统,其特征在于,包括一个或多个处理器,当所述一个或多个处理器执行指令时,所述一个或多个处理器执行如权利要求1-8中任一项所述的语音交互方法。
  12. 一种语音交互系统,其特征在于,包括一个或多个处理单元,当所述一个或多个处理单元执行指令时,所述一个或多个处理单元执行如权利要求1-8中任一项所述的语音交互方法。
PCT/CN2023/096683 2022-06-01 2023-05-26 一种语音交互方法及终端 WO2023231936A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210629293.4A CN117198286A (zh) 2022-06-01 2022-06-01 一种语音交互方法及终端
CN202210629293.4 2022-06-01

Publications (1)

Publication Number Publication Date
WO2023231936A1 true WO2023231936A1 (zh) 2023-12-07

Family

ID=88994861

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/096683 WO2023231936A1 (zh) 2022-06-01 2023-05-26 一种语音交互方法及终端

Country Status (2)

Country Link
CN (1) CN117198286A (zh)
WO (1) WO2023231936A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20150066882A (ko) * 2013-12-09 2015-06-17 포항공과대학교 산학협력단 다중 사용자 기반의 대화 처리 방법 및 이를 수행하는 장치
US20180293221A1 (en) * 2017-02-14 2018-10-11 Microsoft Technology Licensing, Llc Speech parsing with intelligent assistant
CN109360557A (zh) * 2018-10-10 2019-02-19 腾讯科技(北京)有限公司 语音控制应用程序的方法、装置和计算机设备
CN109920436A (zh) * 2019-01-28 2019-06-21 武汉恩特拉信息技术有限公司 一种提供辅助服务的装置及方法
US20200050426A1 (en) * 2018-08-08 2020-02-13 Samsung Electronics Co., Ltd. Feedback method and apparatus of electronic device for confirming user's intention
KR20210074649A (ko) * 2019-12-12 2021-06-22 서울대학교산학협력단 음향정보와 텍스트정보를 이용하여 자연어 문장에서 응대 여부를 판단하는 음성인식 방법

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20150066882A (ko) * 2013-12-09 2015-06-17 포항공과대학교 산학협력단 다중 사용자 기반의 대화 처리 방법 및 이를 수행하는 장치
US20180293221A1 (en) * 2017-02-14 2018-10-11 Microsoft Technology Licensing, Llc Speech parsing with intelligent assistant
US20200050426A1 (en) * 2018-08-08 2020-02-13 Samsung Electronics Co., Ltd. Feedback method and apparatus of electronic device for confirming user's intention
CN109360557A (zh) * 2018-10-10 2019-02-19 腾讯科技(北京)有限公司 语音控制应用程序的方法、装置和计算机设备
CN109920436A (zh) * 2019-01-28 2019-06-21 武汉恩特拉信息技术有限公司 一种提供辅助服务的装置及方法
KR20210074649A (ko) * 2019-12-12 2021-06-22 서울대학교산학협력단 음향정보와 텍스트정보를 이용하여 자연어 문장에서 응대 여부를 판단하는 음성인식 방법

Also Published As

Publication number Publication date
CN117198286A (zh) 2023-12-08

Similar Documents

Publication Publication Date Title
WO2020221072A1 (zh) 一种语义解析方法及服务器
US20220310095A1 (en) Speech Detection Method, Prediction Model Training Method, Apparatus, Device, and Medium
CN110784830B (zh) 数据处理方法、蓝牙模块、电子设备与可读存储介质
CN111724775A (zh) 一种语音交互方法及电子设备
US20230089566A1 (en) Video generation method and related apparatus
CN109286725B (zh) 翻译方法及终端
US11636852B2 (en) Human-computer interaction method and electronic device
US20220116758A1 (en) Service invoking method and apparatus
WO2020057624A1 (zh) 语音识别的方法和装置
CN113488042B (zh) 一种语音控制方法及电子设备
WO2022143258A1 (zh) 一种语音交互处理方法及相关装置
WO2022042274A1 (zh) 一种语音交互方法及电子设备
WO2021190225A1 (zh) 一种语音交互方法及电子设备
CN114520002A (zh) 一种处理语音的方法及电子设备
CN114691839A (zh) 一种意图槽位识别方法
WO2023040658A1 (zh) 语音交互方法及电子设备
WO2023231936A1 (zh) 一种语音交互方法及终端
WO2022161077A1 (zh) 语音控制方法和电子设备
CN114090986A (zh) 一种公用设备上识别用户的方法及电子设备
WO2022135254A1 (zh) 一种编辑文本的方法、电子设备和系统
CN114999496A (zh) 音频传输方法、控制设备及终端设备
WO2024051611A1 (zh) 人机交互方法及相关装置
CN115910052A (zh) 智能语音交互处理方法以及移动终端
WO2023065854A1 (zh) 分布式语音控制方法及电子设备
WO2023005844A1 (zh) 设备唤醒方法、相关装置及通信系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23815119

Country of ref document: EP

Kind code of ref document: A1