WO2021217527A1 - 一种车内语音交互方法及设备 - Google Patents

一种车内语音交互方法及设备 Download PDF

Info

Publication number
WO2021217527A1
WO2021217527A1 PCT/CN2020/087913 CN2020087913W WO2021217527A1 WO 2021217527 A1 WO2021217527 A1 WO 2021217527A1 CN 2020087913 W CN2020087913 W CN 2020087913W WO 2021217527 A1 WO2021217527 A1 WO 2021217527A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
response content
privacy
output
voice
Prior art date
Application number
PCT/CN2020/087913
Other languages
English (en)
French (fr)
Inventor
黄佑佳
聂为然
高益
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP20933148.7A priority Critical patent/EP4138355A4/en
Priority to PCT/CN2020/087913 priority patent/WO2021217527A1/zh
Priority to CN202080004874.8A priority patent/CN112673423A/zh
Publication of WO2021217527A1 publication Critical patent/WO2021217527A1/zh
Priority to US17/976,339 priority patent/US20230048330A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60RVEHICLES, VEHICLE FITTINGS, OR VEHICLE PARTS, NOT OTHERWISE PROVIDED FOR
    • B60R16/00Electric or fluid circuits specially adapted for vehicles and not otherwise provided for; Arrangement of elements of electric or fluid circuits specially adapted for vehicles and not otherwise provided for
    • B60R16/02Electric or fluid circuits specially adapted for vehicles and not otherwise provided for; Arrangement of elements of electric or fluid circuits specially adapted for vehicles and not otherwise provided for electric constitutive elements
    • B60R16/037Electric or fluid circuits specially adapted for vehicles and not otherwise provided for; Arrangement of elements of electric or fluid circuits specially adapted for vehicles and not otherwise provided for electric constitutive elements for occupant comfort, e.g. for automatic adjustment of appliances according to personal settings, e.g. seats, mirrors, steering wheel
    • B60R16/0373Voice control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/013Eye tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/04Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/08Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
    • H04L9/0894Escrow, recovery or storing of secret information, e.g. secret key escrow or cryptographic key storage
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/40Network security protocols
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2209/00Additional information or applications relating to cryptographic mechanisms or cryptographic arrangements for secret or secure communication H04L9/00
    • H04L2209/42Anonymization, e.g. involving pseudonyms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2209/00Additional information or applications relating to cryptographic mechanisms or cryptographic arrangements for secret or secure communication H04L9/00
    • H04L2209/84Vehicles

Definitions

  • the embodiments of the present application relate to the field of intelligent voice interaction, and in particular, to an in-vehicle voice interaction method and device.
  • Human-computer intelligent voice interaction is a major research direction in the field of human-computer interaction science and artificial intelligence, which is used to realize effective information transfer between humans and computers in natural language.
  • the user sends out a voice signal, and the device recognizes the voice and converts the voice into text.
  • the text is sent to a natural language understanding (NLU) module for semantic analysis to obtain user intent.
  • NLU natural language understanding
  • the feedback text can also be generated according to the user's intention parsed by the NLU module.
  • the natural language generation (NLU) module converts the feedback text content into voice, plays the voice to the user, and completes the human-machine intelligent voice interaction.
  • the embodiments of the present application provide an in-vehicle voice interaction method and device.
  • the device can provide differentiated feedback for privacy-related response content to protect privacy and security.
  • an in-vehicle voice interaction method includes: obtaining user voice information; wherein, the user voice information may be an analog signal collected by an audio collection device (for example, a microphone array), or it may be an analog signal collected by an audio collection device (for example, a microphone array). Text information obtained by processing the collected analog signal.
  • the user instruction can also be determined according to the user's voice information; further, according to the user instruction, it is determined whether the response content to the user instruction involves privacy; according to whether the response content involves privacy, it is determined whether to output the response content through the privacy protection mode.
  • the embodiment of the present application provides an in-vehicle voice interaction method, which can provide differentiated feedback on user instructions of the user in different scenarios. In particular, it can identify response content involving privacy, give differentiated feedback on response content involving privacy, and output response content in a privacy protection mode to protect privacy as much as possible.
  • the method further includes: acquiring a user image.
  • the determination of the user instruction according to the user's voice information specifically includes: judging the user's gaze direction according to the user image; when it is judged that the user's gaze direction is the target direction, the user's intention is determined to be human-computer interaction; and the user's gaze direction is the target
  • the user's voice information issued during the direction determines the user's instruction.
  • the user image may be an image captured by an image acquisition component (for example, a camera module) integrated with a smart device that the user performs human-computer interaction, or may be an image captured by a camera in the car and then transmitted to the smart device.
  • the target direction can be a preset direction.
  • the direction may be a direction pointing to a certain device in the vehicle, for example, the target direction may be a direction toward a smart device; or the target direction may be a direction toward a collection device, for example, a target direction may be a direction toward a camera.
  • the user's gaze direction can be used to determine whether the user performs human-computer interaction. If it is determined that the user's intention is to perform human-computer interaction, that is, only the user's voice information acquired by the device needs to be processed and responded by the smart device, then the subsequent steps are performed to determine the user's instruction and determine whether the response content involves privacy. In non-wake-up scenarios or long-term wake-up scenarios, it is possible to prevent the chat voice between the user and other people from frequently touching the smart device's response by mistake.
  • said determining whether to output the response content in a privacy protection mode according to whether the response content involves privacy Specifically: it is determined that the response content involves privacy and the user is in a single-person scene, then the response content is output in a non-privacy mode.
  • the response content of the user instruction involves privacy
  • the response content of the user instruction can be output in a non-privacy mode, for example, through the public
  • the device outputs the response content of the user's instruction.
  • the response content in the third possible implementation manner of the first aspect, according to whether the response content involves privacy, it is determined whether to output the response content through the privacy protection mode, which is specifically : It is judged that the response content involves privacy, and the user is in a multi-person scene, then the response content is output through the privacy protection mode.
  • the privacy protection mode which is specifically : It is judged that the response content involves privacy, and the user is in a multi-person scene, then the response content is output through the privacy protection mode.
  • the response content of the user instruction involves privacy, and because the user is in a multi-person scene, there is a risk of privacy leakage.
  • the response content of the user instruction can be output through the privacy protection mode, for example, non-public devices output user instructions. Response content. Non-public devices are only for users, which can effectively ensure that privacy is not leaked.
  • the fourth possible implementation manner of the first aspect according to whether the response content involves privacy, it is determined whether to output the response content through the privacy protection mode, which is specifically : It is judged that the response content involves privacy, and the response content is output through the privacy protection mode.
  • the response content of the user instruction can be output through the privacy protection mode, for example, a non-public device outputs the response content of the user instruction.
  • Non-public devices are only for users, which can effectively ensure that privacy is not leaked.
  • the output of the response content through the privacy protection mode is specifically: when the response content is output through a public device , Hide the private content included in the response content; or, output the response content through non-public equipment.
  • user instructions can be responded to in the above two ways, and while responding to user instructions, privacy leakage can also be effectively prevented.
  • a device including: an acquiring unit, configured to acquire user voice information;
  • the processing unit is used to determine the user instruction according to the user's voice information; the processing unit is also used to determine whether the response content to the user instruction involves privacy according to the user instruction; determine whether to output the response content through the privacy protection mode according to whether the response content involves privacy.
  • the acquisition unit is further configured to acquire a user image; the processing unit is specifically configured to determine the gaze direction of the user based on the user image; when determining the gaze direction of the user If it is the target direction, it is determined that the user's intention is to perform human-computer interaction; the user's instruction is determined according to the user's voice information issued when the user's gaze direction is the target direction.
  • the processing unit is specifically configured to determine that the response content involves privacy and the user is in a single-person scenario, Then the response content is output through the non-privacy mode.
  • the output is output through the privacy protection mode Response content.
  • the processing unit is specifically configured to determine that the response content involves privacy, and then output the response through the privacy protection mode content.
  • the processing unit is specifically configured to hide the privacy included in the response content when the response content is output through a public device Content; or, output the response content through non-public equipment.
  • a device in a third aspect, includes at least one processor and a memory, at least one processor is coupled to the memory; the memory is used to store a computer program; and at least one processor is used to execute a computer stored in the memory A program to make the apparatus execute the method described in the first aspect or any one of the possible implementation manners of the first aspect.
  • the device can be a terminal device or a server.
  • the terminal devices here include but are not limited to smart phones, vehicle-mounted devices (such as autonomous driving equipment), personal computers, artificial intelligence devices, tablets, personal digital assistants, smart wearable devices (such as smart watches or bracelets, smart glasses), Intelligent voice devices (such as smart speakers, etc.), virtual reality/mixed reality/enhanced display devices, or network access devices (such as gateways, etc.), etc.
  • the server may include a storage server or a computing server.
  • this application discloses a computer-readable storage medium, including: instructions stored in the computer-readable storage medium; When running on the device described in the three aspects, the device is caused to execute the method described in the foregoing first aspect and any one of the implementation manners of the first aspect.
  • the present application provides a chip, including an interface and a processor, the processor is configured to obtain a computer program through the interface and implement the aforementioned first aspect or any one of the possible implementations of the first aspect. method.
  • the present application provides a chip including a plurality of circuit modules, and the plurality of circuit modules are configured to implement the method described in the foregoing first aspect or any one of the possible implementation manners of the first aspect.
  • the multiple circuit modules and the software program implement the method described in the foregoing first aspect or any one of the possible implementation manners of the first aspect.
  • Figure 1 is a human-machine voice interaction scenario provided by an embodiment of the application
  • Figure 2 is a structural block diagram of a smart device provided by an embodiment of the application.
  • FIG. 3 is a schematic diagram of a human-machine voice interaction scenario provided by an embodiment of the application.
  • FIG. 4 is a schematic flowchart of an in-vehicle voice interaction method provided by an embodiment of the application.
  • FIGS. 5-9 are schematic diagrams of in-vehicle voice interaction methods provided by embodiments of this application.
  • FIG. 10 is a schematic flowchart of a voice interaction method provided by an embodiment of this application.
  • FIG. 11 is another structural block diagram of a smart device provided by an embodiment of this application.
  • Fig. 12 is another structural block diagram of a smart device provided by an embodiment of the application.
  • the user's intention is used to describe the user's needs, purpose, and so on.
  • the user's intention is to perform human-computer interaction with the smart device, and the user can wake up the smart device through a wake-up word.
  • the user's intention is to perform human-computer interaction, which can be understood as the user sending instructions to the smart device in the form of voice, and the smart device is expected to respond to the user's instruction.
  • the user voice information may be an analog signal received by the device, or text information obtained by the device according to the analog signal.
  • a user instruction refers to an instruction initiated by the user and requiring a response from the smart device. For example, “open text messages”, “answer calls”, etc.
  • the method provided by the embodiment of the present application is a human-machine voice interaction scene in a vehicle.
  • a user for example, a driver
  • the smart device can receive the user's voice signal.
  • the smart device can also extract the user's voice information according to the user's voice signal, and determine the user's instruction based on the user's voice information, thereby responding to the user's instruction.
  • the user sends out a voice signal "play a song"
  • the smart device receives the voice signal and converts the voice signal into text information. It is also possible to perform semantic analysis on the text information, determine user instructions, and finally respond to user instructions, for example, run music playing software and play songs.
  • the working mode of the smart device includes a wake-up mode and a wake-up-free mode.
  • the wake-up mode the user needs to issue a wake-up word to wake up the smart device before the smart device receives the user's voice signal; in the no-wake-up mode, the smart device can receive the user's voice signal without the user's wake-up word to wake up.
  • the smart device 10 includes an output module 101, an input module 102, a processor 103 and a memory 104.
  • the output module 101 may communicate with the processor 103 and output the processing result of the processor.
  • the output module 101 may be a liquid crystal display (LCD), a light emitting diode (LED) display device, a cathode ray tube (CRT) display device, a projector or Audio and so on.
  • LCD liquid crystal display
  • LED light emitting diode
  • CRT cathode ray tube
  • the input module 102 can communicate with the processor 103, and can receive user input in a variety of ways.
  • the input module 102 may be a mouse, a keyboard, a touch screen device, a sensor device, a microphone array, or the like.
  • the processor 103 may be a general-purpose central processing unit (central processing unit, CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more programs for controlling the execution of the program of this application. integrated circuit.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • the memory 104 may be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), or other types that can store information and instructions
  • the dynamic storage device can also be electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disk storage, optical disc storage (Including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures and can be used by a computer Any other media accessed, but not limited to this.
  • the memory can exist independently, or it can be connected to the processor.
  • the memory can also be integrated with the processor.
  • the storage 104 can also access various cloud services and cloud service management modules through the network interface of the smart device.
  • the processor 103 may run a software module stored in the memory 104 to process the voice signal received by the input module 102, determine a user instruction, and respond to the user instruction through the output module 10.
  • the software modules stored in the memory 104 include an addressee detection (AD) module, a natural language generation (NLG) module, a text to speech (text to speech, TTS) module, and an automatic speech recognition (automatic speech recognition) module. , ASR) module and dialogue management (dialogue management, DM) module, etc.
  • the AD module is used to classify the voice received by the input module 102 to identify whether the voice is the voice that the user makes during human-computer interaction, that is, the voice that the user makes to the smart device.
  • the AD module can also filter out the voice made by the user during human-computer interaction, and input the voice made by the user during human-computer interaction into the ASR module.
  • the ASR module can convert the voice signal received from the AD module into text information, and can also input the text information into the DM module;
  • the DM module can determine the user instruction based on the text information received from the ASR module.
  • the DM module is also used for dialogue management, for example, to determine answers or feedback based on questions. Therefore, the DM module can also generate response content to user instructions. Among them, the response content of the user instruction may be text information.
  • the DM module can also input the response content of the user's instruction into the NLG module.
  • the NLG module is used to generate text information conforming to natural language habits according to the response content of the user instruction, and the text information can also be displayed through the output module 101.
  • the TTS module is used to convert the text information generated by the NLG module into voice, and the voice can also be played through the output module 10.
  • the vehicle may also include other devices.
  • the vehicle also includes a head-up display screen 20 in a driving position, an earphone 30 worn by the driver, a central control display screen 40, an in-car audio 50, a camera 60, and a micro speaker 70 in a driving position.
  • the smart device 10 can be integrated with the central control display 40, and the head-up display 20, the headset 30 worn by the driver, the in-car audio 50, and the camera 60 can exist independently.
  • Various devices in the vehicle can interact with each other.
  • the camera 60 can transmit the captured image to the smart device 10 for processing.
  • the equipment in the vehicle can be divided into public equipment and non-public equipment.
  • the content output by the public device is for most people, and most people can receive the content output by the public device. For example, most people can receive the voice played by the public device or the text and image displayed by the public device.
  • Non-public equipment is for designated persons (for example, drivers), and the designated persons can receive the content output by the non-public equipment.
  • the designated persons can receive the voice or displayed text and images played by the non-public equipment.
  • the public equipment can be the in-car audio 50 or the central control display 40 in the car; the non-public equipment can be the headset 30 worn by the driver or the micro-speaker 70 in the driving position. It may be a head-up display 20 in the driving position.
  • the feedback mode of the smart device has a great impact on the user experience.
  • Merely understanding the user's intention or responding to the user's instructions cannot make a differentiated response to the different scenes the user is in, and it will also cause a bad experience for the user.
  • the voice interaction solution between the device and the user has not paid much attention to this aspect, and most of the focus is still on the semantic understanding.
  • the feedback of the device to the user's voice usually only corresponds to the literal meaning of the user's instruction, and the difference in different scenarios is not considered.
  • the embodiment of the present application provides an in-vehicle voice interaction method, which can provide differentiated feedback on user instructions of the user in different scenarios. In particular, it can identify response content involving privacy, give differentiated feedback on response content involving privacy, and output response content in a privacy protection mode to protect privacy as much as possible.
  • the terminal device and/or the network device can perform some or all of the steps in the embodiments of the present application. These steps or operations are only examples, and the embodiments of the present application may also perform other operations or various operations. Deformation of the operation. In addition, each step may be executed in a different order presented in the embodiments of the present application, and it may not be necessary to perform all the operations in the embodiments of the present application.
  • the embodiment of the present application provides an in-vehicle voice interaction method, which is suitable for the in-vehicle scene shown in FIG. 3, and the execution subject of the method may be the smart device 10 in the vehicle.
  • the method includes the following steps:
  • the input module 20 of the smart device can receive voice (ie, analog signal).
  • the analog signal received by the input module 20 may be the user voice information described in the embodiment of the present application.
  • the input module 20 may input the received voice into the processor 30 of the smart device, and the processor 30 (for example, the ASR module) may obtain text information according to the simulation, and the text information may also be the user voice described in the embodiment of the present application. information.
  • the input module 20 may be a microphone array, the microphone array may pick up the voice uttered by the user, and the user voice information may be the voice picked up by the microphone array.
  • the ASR module converts the analog signal into text information, and the text information can also be input into the DM module.
  • the DM module can perform semantic analysis on text information to determine user instructions.
  • the DM module can also generate response content to user instructions in accordance with natural dialog habits.
  • the response content of the user instruction generated by the DM module may be text information.
  • the DM module performs semantic analysis on the text information input by the ASR module to determine the slot of the user instruction.
  • the slot of the user instruction can be considered as the parameter of the user instruction.
  • the user instruction is: adjust the temperature of the air conditioner to 26 degrees, and "26 degrees" is the slot (or parameter) of the user instruction.
  • the response content generated by the DM module includes privacy content, and if the response content of the user instruction includes privacy content, it is determined that the response content to the user instruction involves privacy.
  • the memory 104 of the smart device may store a private content list, including at least one private content.
  • the processor 103 queries the private content list stored in the memory 104, and if the response content of the user instruction includes one or more private content in the private content list, it determines that the response content to the user instruction involves privacy.
  • the privacy content related to WeChat is recorded as privacy content 1
  • the privacy content related to the memo is recorded as privacy content 2.
  • the private content list may include private content 1 and private content 2.
  • the response content of the user instruction includes private content 1 or private content 2, it is determined that the response content to the user instruction involves privacy.
  • the response content of the user instruction when the response content of the user instruction involves privacy, it is judged whether to output the response content of the user instruction through the privacy protection mode to protect user privacy.
  • the response content of the user instruction does not involve privacy, the response content of the user instruction is output in a normal manner, for example, the response content of the user instruction is output in a non-privacy mode.
  • the processor 103 of the smart device determines that the response content of the user instruction involves privacy, but the user is in a single-person scene, the response content is output in a non-privacy mode.
  • the processor 103 of the smart device determines that the response content of the user instruction involves privacy, and the user is in a multi-person scene, the response content is output through the privacy protection mode.
  • the processor 103 of the smart device determines that the response content of the user instruction involves privacy, the response content is output through the privacy protection mode.
  • the camera 60 in the vehicle can take a user image and send the user image to the smart device 10.
  • the processor 103 of the smart device 10 may also analyze and process the user image. If multiple person images are parsed in the user image, it is determined that the user's current scene includes multiple people, that is, the user is in a multi-person scene. If one person image is parsed in the user image, it is determined that the user is currently in a single person scene.
  • the processor 103 may use the yolo algorithm to perform face target detection on the user image, and then determine the number of people in the scene according to the number of recognized face targets, for example, the number of people in the car. According to the number of people in the scene, it is judged whether the user is in a single-player scene or a multi-player scene.
  • the smart device can output the response content of the user instruction through the following two privacy protection modes.
  • “output” refers to the response content presented by the smart device to the user's instruction.
  • the response content is text information
  • the response content can be displayed on the display screen;
  • the response content is voice
  • the response content can be played through the audio.
  • the two privacy protection modes are as follows:
  • the smart device when the smart device outputs the response content through a public device, it hides the private content included in the response content.
  • the response content of the user's instruction can be output on the public device. Since the public device faces most people, it may lead to the leakage of user privacy. Therefore, when the public device outputs the response content of the user instruction, the private content included in the response content can be hidden.
  • outputting the response content of the user instruction through a public device may be displaying the response content of the user instruction through a public display screen (for example, a vehicle-mounted central control display), but private content needs to be hidden, for example, hiding key names, locations and other information.
  • a public display screen for example, a vehicle-mounted central control display
  • hiding the private content can be a special image (for example, mosaic) to cover the private content. It is also possible to not display the private content, replace the private content with special characters, and only display the content that does not involve privacy.
  • a special image for example, mosaic
  • the response content of the user instruction is output through public equipment, or the response content of the user instruction is played through the public audio system (for example, car audio), but the private content in the response content cannot be played, for example, hidden key
  • the private content in the response content cannot be played, for example, hidden key
  • the smart device outputs the response content through a non-public device.
  • the response content of the user's instruction can be output in a non-public module. Because the non-public module is only for users of smart devices (for example, drivers). When the non-public module outputs the response content of the user instruction, the privacy content of the user can be protected.
  • the output of the response content of the user instruction through the non-public module may be the display of the response content of the user instruction through the non-public display screen (for example, a head-up display in the driving position).
  • the response content of the user's instruction is played through a non-public audio system (for example, a headset worn by the driver).
  • the voice received by the input module 20 of the smart device has two possibilities: one is the voice signal input by the real user to the device (that is, what the user says to the device) Another possibility is chat voices between users. These voices are noise for the smart device to determine the real user instructions.
  • the voice signal after the user wakes up the smart device through the wake-up word is effective.
  • the smart device receives the wake-up word sent by the user and receives the user's voice after waking up. Determine the user instruction according to the received user voice, and respond to the user instruction.
  • the received voice can be judged and the user can be extracted for human-computer interaction.
  • the voice made at the time. Specifically, the received voice can be judged in the following two ways:
  • the first type is to determine whether the voice received by the input module 20 is the voice that is emitted when the user performs human-computer interaction according to the AD module.
  • the AD module can use these differences to distinguish whether the user's voice is the voice uttered when the user performs human-computer interaction or the chat voice between the user and other people.
  • the AD module is a two-class model based on the input voice signal.
  • the voice received by the input module 20 is input to the AD module, and the AD module can output a result value.
  • This result value represents that the voice received by the input module 20 is the voice made by the user during human-computer interaction, or the voice received by the input module 20 is not the voice made by the user during human-computer interaction.
  • the result value may also represent the probability that the voice received by the input module 20 is the voice that the user makes during human-computer interaction. When the probability is greater than the corresponding threshold, it can be considered that the voice received by the input module 20 is issued when the user conducts human-computer interaction. Voice.
  • the AD module can be obtained by training training samples.
  • the training samples of the AD module can be AD discriminant samples, intent recognition (NLU) samples, part-of-speech tagging (POS) samples, text pair adversarial samples, etc.
  • the AD discrimination sample may include a voice signal, and the AD discrimination result of the voice information indicates that the receiving object of the voice signal is a smart device or the receiving object of the voice signal is not a smart device.
  • Intent recognition (NLU) samples may include text information and user intentions (or user instructions) corresponding to the text information.
  • the part-of-speech tagging (POS) samples can include words and parts of speech.
  • Text pair adversarial examples include text pairs and the amount of interference between text pairs.
  • the loss function of AD discriminant samples, intent recognition (NLU) samples, and part-of-speech tagging (POS) samples is cross-entropy loss, and the loss function of text versus adversarial samples is the Euclidean distance between the vectors corresponding to two texts. It should be noted that the loss function is used to calculate the error of the training sample, and the error of the AD module can be determined according to the loss function of each training sample.
  • the second method is to determine whether the receiving object of the user's voice is the smart device according to the user's gaze object.
  • the smart device can also obtain user images.
  • the camera 60 in the car can take a user image and send the user image to the processor 103 of the smart device 10.
  • the processor 103 determines the gaze direction of the user according to the user image; when determining that the gaze direction of the user is the target direction, it determines that the user's intention is to perform human-computer interaction. Further, the user instruction may also be determined according to user voice information issued when the gaze direction of the user is the target direction.
  • the target direction may be a preset direction.
  • the direction may be a direction pointing to a certain device in the vehicle, for example, the target direction may be a direction toward a smart device; or the target direction may be a direction toward a collection device, for example, a target direction may be a direction toward a camera.
  • the line of sight is tracked by using the posture of the human head. Specifically, first, the yolo algorithm is used for face target detection, and after the face target is detected, 2D face key point detection is performed. Then the 3D face model matching is performed according to the detected 2D face key points. After matching the 3D face model, the posture angle of the face can be calculated according to the rotation relationship between the 3D face key points and the 2D face key points, and this angle is regarded as the user's line of sight angle. According to the user's gaze angle, it is determined whether the user is gazing at the smart device. If the user's gaze object is the smart device, it can be determined that the user's intention is to perform human-computer interaction.
  • the method described in the embodiment of the present application further includes: when the smart device determines that the received voice signal is a chat voice between the user and other people, displaying a dynamic waveform on the display screen, indicating that the smart device is receiving external Voice does not display the recognition result of the voice signal in real time.
  • the voice signal is converted into text information through the ASR module, and the text information can also be displayed on the display screen so that the user can judge whether the recognition result is accurate.
  • the microphone array of the smart device collects the voice signal 1 to voice signal 4, analyzes the voice signal 1 to voice signal 4, and judges the voice signal 1 to voice signal 4 as the passenger and driving according to the intonation, speed of speech or language emotion of the voice signal
  • the chat voice between employees will not undergo subsequent processing, that is, the voice signal will not be converted into text information to determine the user's instruction.
  • the smart device determines the gaze object of the user (driver) according to the camera 60, and if the gaze object of the user is not the smart device, no subsequent processing is performed.
  • the central control display screen 40 may display a waveform to indicate that the user's voice is being received.
  • the driver issued a voice signal 5 "Turn on the air conditioner and turn it to 24 degrees.”
  • the microphone array of the smart device collects the voice signal 5, analyzes the voice signal 5, and judges that the language signal 5 is sent by the driver to the device according to the intonation, speech rate or language emotional color of the voice signal. Translated into text information, it is determined that the user's instruction is "turn on the air conditioner and adjust to 24 degrees.”
  • the smart device judges that the response content of the user's command "turn on the air conditioner and adjust to 24 degrees" does not involve privacy, and then feedbacks the intention, turns on the air conditioner in the car, and adjusts the temperature to 24 degrees Celsius.
  • the driver issues a voice signal 6 "Check today's schedule”.
  • the microphone array of the smart device collects the voice signal 6, analyzes the voice signal 6, and judges that the language signal 6 is sent to the smart device 10 when the driver performs human-computer interaction. Perform follow-up processing, convert the voice signal into text information, and determine the user's instruction as "view today's schedule" according to the text information.
  • the smart device determines that the response content of the user's command "view today's schedule” is "schedule", which involves privacy, and determines that the user's current scene includes multiple people based on the user image, that is, the user is currently in a multi-person scene. Then, the response content of the user's instruction, that is, the user's schedule, is output through the non-public module. Or, when outputting the response content of the user instruction through the public module, hide the key person's name and location.
  • the user's schedule is "to participate in the bidding meeting of Company A at the High-tech Hotel at 14:40 today.”
  • the central control display screen 40 displays "You are participating in the company's bidding meeting at the ** Hotel at 14:40 today.”
  • the head-up display screen 20 displays "You are participating in the bidding meeting of Company A at the High-tech Hotel at 14:40 today.”
  • the AD module is added to the smart device to filter many voice signals without invalid voice, reduce the feedback caused by the false trigger of invalid voice, and improve the user experience.
  • feedback mode decisions can be made, and the feedback mode can be dynamically adjusted based on user intentions and user scenarios. Not only supports the adjustment of the feedback device, but also supports the adjustment of the feedback content, which can better protect the privacy of users.
  • the embodiment of the present application also provides a voice interaction method. As shown in FIG. 10, the method includes the following steps:
  • the user's multi-modal information can be user voice information and user images.
  • the user's voice information can be an analog signal received by the smart device; the user's image can be an image taken by a camera in the car.
  • the user voice input after the system smart device is awakened by the wake-up word is effective, that is, after the wake-up word wakes up the system, the received voice is the voice emitted by the user during human-computer interaction.
  • the smart device is in a wake-up state for a long time, and in the case of a long-term wake-up, the voice received by the device may include a chat voice between the user and other people. Therefore, it can be determined based on the AD module that the received voice is the voice that the user makes during human-computer interaction.
  • the camera can also be used to determine the user's gaze object.
  • the user's gaze object is in the target direction, for example, the user's gaze direction points to the smart device, it may be determined that the received voice is the voice emitted by the user during human-computer interaction.
  • step 1003 is executed; otherwise, only a waveform displayed on the display screen of the smart device indicates that the device is receiving the user voice.
  • step 402 For specific implementation, refer to the related description of step 402, which is not repeated here.
  • a private content list can be defined.
  • Common private content includes: SMS, WeChat, memo, etc.
  • the response content involving privacy can be text message content, WeChat content, or memo content.
  • the user can be determined whether the user is in a multi-person scene based on the user image obtained by the camera. For example, it is possible to determine whether there are multiple people in the car based on the user image. Privacy issues will only arise when there are multiple people.
  • the feedback content is broadcast through the in-car audio or the feedback content is presented through the central control display screen. There will be some privacy leakage risks.
  • step 1006 is executed to protect privacy. Otherwise, step 1007 is executed to output the response content of the user instruction in a conventional manner.
  • the response content of the user instruction can be output through the non-public device of the smart device.
  • the response content of the user instruction is played through the headset worn by the driver user, or the response content of the user instruction is displayed through the driving position display screen.
  • it can first detect whether there are hardware conditions required for privacy mode, such as a driving position display screen, or whether the driver is wearing a headset, etc.
  • hardware conditions required by the privacy mode such as a driving position display screen, or whether the driver is wearing a headset, etc.
  • the hardware conditions required by the privacy mode are met, for example, the driver wears a headset, the response content of the user's instruction can be played through the headset.
  • the feedback content is adjusted to hide the user's private information.
  • the response content is displayed on the central control display screen, but private information such as key locations and names are hidden.
  • the normal mode outputs the response content of the user instruction, that is, the response content of the user instruction is output through the public device of the smart device.
  • the response content of the user's instruction is played through the in-car audio, or the response content of the user's instruction is displayed on the central control display screen.
  • FIG. 11 shows a possible schematic structural diagram of the device involved in the above embodiment (for example, the smart device described in the embodiment of the present application).
  • the device shown in FIG. 11 may be the smart device described in the embodiment of the present application, or may be a component in the smart device that implements the foregoing method.
  • the device includes an obtaining unit 1101, a processing unit 1102, and a transceiver unit 1103.
  • the processing unit may be one or more processors, and the transceiving unit may be a transceiver.
  • the obtaining unit 1101 is used to support the smart device to perform step 401 and/or other processes used in the technology described herein.
  • the data processing unit 1102 is used to support the smart device to execute steps 401 to 404, and/or other processes used in the technology described herein.
  • the transceiver unit 1103 is used to support communication between the smart device and other devices or devices, and/or other processes used in the technology described herein. It can be the interface circuit or network interface of a smart device.
  • the structure shown in FIG. 11 may also be a structure of a chip applied to a smart device.
  • the chip may be a System-On-a-Chip (SOC) or a baseband chip with communication function.
  • SOC System-On-a-Chip
  • the device includes: a processing module 1201 and a communication module 1202.
  • the processing module 1201 is used to control and manage the actions of the device, for example, to perform the steps performed by the above-mentioned obtaining unit 1101 and the processing unit 1102, and/or to perform other processes of the technology described herein.
  • the communication module 1202 is configured to perform the steps performed by the above-mentioned transceiver unit 1103, and support interaction between the device and other devices, such as interaction with other terminal devices.
  • the device may further include a storage module 1203, and the storage module 1203 is used to store program codes and data of the device.
  • the processing module 1201 is a processor
  • the communication module 1202 is a transceiver
  • the storage module 1203 is a memory
  • the device is the device shown in FIG. 2.
  • the embodiment of the present application provides a computer-readable storage medium, and the computer-readable storage medium stores instructions; the instructions are used to execute the method shown in FIG. 4 or FIG. 10.
  • the embodiment of the present application provides a computer program product including instructions, which when running on a device, enables the device to implement the method shown in FIG. 4 or FIG. 10.
  • a wireless device includes: instructions stored in the wireless device; when the wireless device runs on the device shown in FIG. 2, FIG. 11, and FIG. method.
  • the device can be a chip or the like.
  • the division of modules in the embodiments of this application is illustrative, and it is only a logical function division. In actual implementation, there may be other division methods.
  • the functional modules in the various embodiments of this application can be integrated into one process. In the device, it can also exist alone physically, or two or more modules can be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or software functional modules.
  • the methods provided in the embodiments of the present application may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • software When implemented by software, it can be implemented in the form of a computer program product in whole or in part.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, network equipment, user equipment, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a digital video disc (digital video disc, DVD for short)), or a semiconductor medium (for example, SSD).
  • the embodiments can be mutually cited.
  • the methods and/or terms between the method embodiments can be mutually cited, such as the functions and/or functions between the device embodiments.
  • Or terms may refer to each other, for example, functions and/or terms between the device embodiment and the method embodiment may refer to each other.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Acoustics & Sound (AREA)
  • Mechanical Engineering (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种车内语音交互方法和设备,方法包括:获取用户语音信息;根据用户语音信息确定用户指令;根据用户指令判断针对用户指令的响应内容是否涉及隐私;根据响应内容是否涉及隐私,确定是否通过隐私保护模式输出响应内容。由此,能够保护隐私不被泄露。

Description

一种车内语音交互方法及设备 技术领域
本申请实施例涉及智能语音交互领域,尤其涉及一种车内语音交互方法及设备。
背景技术
人机智能语音交互是人机交互科学领域与人工智能领域的一个主要的研究方向,用于实现人与计算机之间用自然语言进行有效地信息传递。现有的人机智能语音交互技术中,用户发出语音信号,设备识别语音并将语音转换成文本,该文本被送入自然语言理解(natural language understanding,NLU)模块中进行语义解析获取用户意图,还可以根据NLU模块解析的用户意图生成反馈文本。之后自然语言生成(natural language generation,NLU)模块会将反馈文本内容转换成语音,向用户播放语音,完成人机智能语音交互。
目前用户应用场景比较复杂,现有技术对用户语音的反馈往往仅与用户指令的字面意思对应,不会考虑隐私安全,很有可能造成隐私泄露。
发明内容
本申请实施例提供一种车内语音交互方法及设备,在人机语音交互中,设备能够针对涉及隐私的响应内容进行区别反馈,保护隐私安全。
第一方面,提供了一种车内语音交互方法,所述方法包括:获取用户语音信息;其中,用户语音信息可以是音频采集设备(例如,麦克风阵列)采集到的模拟信号,也可以是对采集到的模拟信号进行处理获得的文本信息。还可以根据用户语音信息确定用户指令;进一步,根据用户指令判断针对用户指令的响应内容是否涉及隐私;根据响应内容是否涉及隐私,确定是否通过隐私保护模式输出响应内容。
本申请实施例提供一种车内语音交互方法,能够针对用户在不同场景下的用户指令进行区别反馈。尤其是可以识别涉及隐私的响应内容,对涉及隐私的响应内容进行区别反馈,采用隐私保护模式输出响应内容,尽可能保护隐私安全。
结合第一方面,在第一方面的第一种可能的实现方式中,所述方法还包括:获取用户图像。所述根据用户语音信息确定用户指令,具体为:根据用户图像判断用户的注视方向;当判断用户的注视方向为目标方向,则确定用户的意图为进行人机交互;根据用户的注视方向为目标方向时所发出的用户语音信息确定用户指令。其中,获取用户图像可以是用户进行人机交互的智能设备集成的图像采集部件(例如,摄像头模组)拍摄的图像,也可以是车内的摄像头拍摄图像后将图像传输给智能设备。目标方向可以是预先设定的方向。该方向可以是指向车内某个设备的方向,例如,目标方向可以是指向智能设备的方向;或者目标方向可以是指向采集设备的方向,例如,目标方向可以是指向摄像头的方向。
本申请实施例提供的方法中,可以借助用户的注视方向来判断用户是否进行人机交互。若确定用户意图为进行人机交互,即只能设备获取到的用户语音信息是需要智能设备进行处理并响应的,则进行后续步骤,确定用户指令、判别响应内容是否涉及隐私等。在免唤醒场景或长时间唤醒场景下,可以避免用户与其他人之间的聊天语音 频繁地误触智能设备的响应。
结合第一方面或第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现方式中,所述根据响应内容是否涉及隐私,确定是否通过隐私保护模式输出响应内容,具体为:判断响应内容涉及隐私,且用户处于单人场景,则通过非隐私模式输出响应内容。
本申请实施例中,虽然判断用户指令的响应内容涉及隐私,但是由于用户处于单人场景,不存在隐私泄露的风险,可以通过非隐私模式输出用户指令的响应内容,例如,通过车内的公共设备输出用户指令的响应内容。
结合第一方面或第一方面的第一种可能的实现方式,在第一方面的第三种可能的实现方式中,根据响应内容是否涉及隐私,确定是否通过隐私保护模式输出响应内容,具体为:判断响应内容涉及隐私,且用户处于多人场景,则通过隐私保护模式输出响应内容。
本申请实施例中,判断用户指令的响应内容涉及隐私,且由于用户处于多人场景,存在隐私泄露的风险,可以通过隐私保护模式输出用户指令的响应内容,例如,非公共设备输出用户指令的响应内容。非公共设备仅面向用户,可以有效保证隐私不被泄露。
结合第一方面或第一方面的第一种可能的实现方式,在第一方面的第四种可能的实现方式中,根据响应内容是否涉及隐私,确定是否通过隐私保护模式输出响应内容,具体为:判断响应内容涉及隐私,则通过隐私保护模式输出响应内容。
本申请实施例中,一旦判断用户指令的响应内容涉及隐私,存在隐私泄露的风险,可以通过隐私保护模式输出用户指令的响应内容,例如,非公共设备输出用户指令的响应内容。非公共设备仅面向用户,可以有效保证隐私不被泄露。
结合第一方面的第三或第四种可能的实现方式,在第一方面的第五种可能的实现方式中,所述通过隐私保护模式输出响应内容,具体为:通过公共设备输出响应内容时,隐藏响应内容包括的隐私内容;或,通过非公共设备输出响应内容。
本申请实施例中,可以通过以上两种方式响应用户指令,在响应用户指令的同时,还可以有效防止隐私泄露。
第二方面,提供了一种设备,包括:获取单元,用于获取用户语音信息;
处理单元,用于根据用户语音信息确定用户指令;处理单元还用于,根据用户指令判断针对用户指令的响应内容是否涉及隐私;根据响应内容是否涉及隐私,确定是否通过隐私保护模式输出响应内容。
结合第二方面,在第二方面的第一种可能的实现方式中,获取单元还用于,获取用户图像;处理单元具体用于,根据用户图像判断用户的注视方向;当判断用户的注视方向为目标方向,则确定用户的意图为进行人机交互;根据用户的注视方向为目标方向时所发出的用户语音信息确定用户指令。
结合第二方面或第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现方式中,处理单元具体用于,判断响应内容涉及隐私,且用户处于单人场景,则通过非隐私模式输出响应内容。
结合第二方面或第二方面的第一种可能的实现方式,在第二方面的第三种可能的 实现方式中,判断响应内容涉及隐私,且用户处于多人场景,则通过隐私保护模式输出响应内容。
结合第二方面或第二方面的第一种可能的实现方式,在第二方面的第四种可能的实现方式中,处理单元具体用于,判断响应内容涉及隐私,则通过隐私保护模式输出响应内容。
结合第二方面的第三或第四种可能的实现方式,在第二方面的第五种可能的实现方式中,处理单元具体用于,通过公共设备输出响应内容时,隐藏响应内容包括的隐私内容;或,通过非公共设备输出响应内容。
第三方面,提供了一种装置,所述装置包括至少一个处理器和存储器,至少一个处理器与存储器耦合;存储器,用于存储计算机程序;至少一个处理器,用于执行存储器中存储的计算机程序,以使得装置执行如上述第一方面或第一方面的任意一种可能的实现方式所述的方法。
该装置可以为终端设备或服务器等。这里的终端设备包括但不限于智能手机、车载装置(例如自动驾驶设备)、个人计算机、人工智能设备、平板电脑、个人数字助理、智能穿戴式设备(例如智能手表或手环、智能眼镜)、智能语音设备(例如智能音箱等)、虚拟现实/混合现实/增强显示设备或网络接入设备(例如网关等)等。服务器可以包括存储服务器或计算服务器等。
第四方面,本申请公开了一种计算机可读存储介质,包括:计算机可读存储介质中存储有指令;当计算机可读存储介质在上述第二方面以及第二方面任意一种实现方式、第三方面所述的装置上运行时,使得装置执行如上述第一方面以及第一方面任意一种实现方式所述的方法。
第五方面,本申请提供一种芯片,包括接口和处理器,所述处理器用于通过所述接口获取计算机程序并实现前述第一方面或第一方面的任意一种可能的实现方式所述的方法。
第六方面,本申请提供一种芯片,包括多个电路模块,所述多个电路模块用于实现前述第一方面或第一方面的任意一种可能的实现方式所述的方法。在一些实现方式下,所述多个电路模块与软件程序一起实现前述第一方面或第一方面的任意一种可能的实现方式所述的方法。
附图说明
图1为本申请实施例提供的人机语音交互场景;
图2为本申请实施例提供的智能设备的结构框图;
图3为本申请实施例提供的人机语音交互场景示意图;
图4为本申请实施例提供的车内语音交互方法的流程示意图;
图5~图9为本申请实施例提供的车内语音交互方法的示意图;
图10为本申请实施例提供的语音交互方法的流程示意图;
图11为本申请实施例提供的智能设备的另一结构框图;
图12为本申请实施例提供的智能设备的另一结构框图。
具体实施方式
下面将结合附图,对本申请实施例中的技术方案进行描述。
首先对本发明实施例涉及的术语进行解释说明:
(1)意图
本申请实施例中,用户的意图用于描述用户的需求、目的等。例如,用户的意图为与智能设备进行人机交互,用户可以通过唤醒词唤醒智能设备。
需要说明的是,在智能语音交互场景下,用户的意图为进行人机交互,可以理解为用户通过语音形式向智能设备发出指令,并期望智能设备响应用户指令。
(2)用户语音信息
本申请实施例中,用户语音信息可以是设备接收到的模拟信号,也可以是设备根据模拟信号后获得的文本信息。
(3)用户指令
本申请实施例中,用户指令指的是由用户发起,且需要智能设备响应的指令。例如,“打开短信”、“接听电话”等。
本申请实施例提供的方法车内的人机语音交互场景。参考图1,该场景下用户(例如,驾驶员)发出语音信号,智能设备可以接收用户的语音信号。智能设备还可以根据用户的语音信号提取用户语音信息,根据用户语音信息确定用户指令,从而响应用户指令。
示例的,用户发出语音信号“播放歌曲”,智能设备接收到语音信号,将语音信号其转换成文本信息。还可以对该文本信息进行语义解析,确定用户指令,最后响应用户指令,例如,运行音乐播放软件,播放歌曲。
需要说明的是,智能设备的工作模式包括唤醒模式和免唤醒模式。唤醒模式下,用户需要发出唤醒词来唤醒智能设备,智能设备才接收用户的语音信号;免唤醒模式下,无需用户发出唤醒词唤醒,智能设备即可接收到用户的语音信号。
参考图2,智能设备10包括输出模块101、输入模块102、处理器103以及存储器104。
具体实现中,输出模块101可以与处理器103进行通信,输出处理器的处理结果。例如,输出模块101可以是液晶显示器(liquid crystal display,LCD),发光二级管(light emitting diode,LED)显示设备,阴极射线管(cathode ray tube,CRT)显示设备,投影仪(projector)或音响等。
输入模块102可以与处理器103通信,可以以多种方式接收用户的输入。例如,输入模块102可以是鼠标、键盘、触摸屏设备、传感设备或麦克风阵列等。
处理器103可以是一个通用中央处理器(central processing unit,CPU),微处理器,特定应用集成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制本申请方案程序执行的集成电路。
存储器104可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者 能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是独立存在,也可以与处理器相连接。存储器也可以和处理器集成在一起。存储器104还可以通过智能设备的网络接口访问各种云服务以及云服务管理模块。
本申请实施例中,处理器103可以运行存储器104存储的软件模块对输入模块102接收到的语音信号进行处理,确定用户指令,并通过输出模块10响应用户指令。存储器104存储的软件模块包括收件人识别(addressee detection,AD)模块、自然语言生成(natural language generation,NLG)模块、文本转语音(text to Speech,TTS)模块、自动语音识别(automatic speech recognition,ASR)模块以及对话管理(dialogue management,DM)模块等。
其中,AD模块用于对输入模块102接收到的语音进行二分类,识别语音是否为用户进行人机交互时发出的语音,即用户对智能设备发出的语音。AD模块还可以过滤出用户进行人机交互时发出的语音,将用户进行人机交互时发出的语音输入ASR模块。
ASR模块可以将从AD模块接收的语音信号转化成文本信息,还可以将文本信息输入DM模块;
DM模块可以根据从ASR模块接收的文本信息确定用户指令。DM模块还用于进行对话管理,例如,根据问题确定答案或反馈。因此DM模块还可以生成用户指令的响应内容。其中,用户指令的响应内容可以是文本信息。DM模块还可以将用户指令的响应内容输入NLG模块。
NLG模块用于根据用户指令的响应内容生成符合自然语言习惯的文本信息,还可以通过输出模块101显示该文本信息。
TTS模块用于将NLG模块生成的文本信息转化为语音,还可以通过输出模块10播放该语音。
具体实现中,车内除了智能设备10外还可以包括其他设备。例如,参考图3,车内还包括驾驶位抬头显示屏20以及驾驶员佩戴的耳机30、中控显示屏40、车内音响50、摄像头60以及驾驶位微型扬声器70。其中,智能设备10可以和中控显示屏40集成在一起,抬头显示屏20、驾驶员佩戴的耳机30、车内音响50、摄像头60可以独立存在。车内的各个设备之间可以进行交互,例如,摄像头60可以将拍摄到的图像传输给智能设备10进行处理。
本申请实施例中,车内的设备可以划分为公共设备和非公共设备。其中,公共设备输出的内容是面向多数人,多数人可以接收公共设备输出的内容,例如,多数人可以接收公共设备播放的语音或显示的文字、图像。
非公共设备面向指定人员(例如,驾驶员),指定人员可以接收非公共设备输出的内容,例如,指定人员可以接收非公共设备播放的语音或显示的文字、图像。
以图3所示的车内场景为例,公共设备可以是车内音响50或车内的中控显示屏40;非公共设备可以是驾驶员佩戴的耳机30或者驾驶位的微型扬声器70,也可以是驾驶位的抬头显示屏20。
需要说明的是,智能设备与用户的语音交互中,智能设备的反馈方式对用户体验 有非常大的影响。单单是理解用户的意图或响应用户指令,无法针对用户所处的不同场景做出区别性响应,也会给用户造成不好的体验。目前,设备与用户之间的语音交互方案还没有太多关注这方面的内容,大部分的焦点还是集中在语义理解部分。现有技术中设备对用户语音的反馈往往仅与用户指令的字面意思对应,不会考虑不同场景的差异。
本申请实施例提供一种车内语音交互方法,能够针对用户在不同场景下的用户指令进行区别反馈。尤其是可以识别涉及隐私的响应内容,对涉及隐私的响应内容进行区别反馈,采用隐私保护模式输出响应内容,尽可能保护隐私安全。
可以理解的,本申请实施例中,终端设备和/或网络设备可以执行本申请实施例中的部分或全部步骤,这些步骤或操作仅是示例,本申请实施例还可以执行其它操作或者各种操作的变形。此外,各个步骤可以按照本申请实施例呈现的不同的顺序来执行,并且有可能并非要执行本申请实施例中的全部操作。
本申请实施例提供一种车内语音交互方法,适用于图3所示的车内场景,所述方法的执行主体可以是车内的智能设备10。如图4所示,所述方法包括以下步骤:
401、获取用户语音信息。
具体实现中,智能设备的输入模块20可以接收到语音(即模拟信号)。其中,输入模块20接收到的模拟信号可以是本申请实施例所述的用户语音信息。或者,输入模块20可以将接收到的语音输入智能设备的处理器30,处理器30(例如,ASR模块)可以根据模拟获得文本信息,该文本信息也可以是本申请实施例所述的用户语音信息。
例如,输入模块20可以是麦克风阵列,麦克风阵列可以拾取到用户发出的语音,用户语音信息可以是麦克风阵列拾取到的语音。
402、根据所述用户语音信息确定用户指令。
本申请实施例,智能设备的输入模块20获取模拟信号后,由ASR模块将模拟信号转换成文本信息,还可以将文本信息输入DM模块。DM模块可以对文本信息进行语义解析确定用户指令。
DM模块还可以按照自然的对话习惯生成用户指令的响应内容。其中,DM模块生成的用户指令的响应内容可以是文本信息。
一种可能的实现方式中,DM模块对ASR模块输入的文本信息进行语意解析还可以确定用户指令的槽位。其中,用户指令的槽位可以认为是用户指令的参数。例如,用户指令为:把空调温度调为26度,“26度”就是用户指令的槽位(或参数)。
403、根据所述用户指令判断针对所述用户指令的响应内容是否涉及隐私。
具体地,可以判断DM模块生成的响应内容是否包括隐私内容,若用户指令的响应内容包括隐私内容,则判断针对用户指令的响应内容涉及隐私。
一种可能的实现方式中,智能设备的存储器104可以保存隐私内容列表,包括至少一个隐私内容。处理器103查询存储器104中存储的隐私内容列表,若所述用户指令的响应内容包括隐私内容列表中的一个或多个隐私内容,则确定针对所述用户指令的响应内容涉及隐私。
示例的,与微信相关的隐私内容记为隐私内容1,与备忘录相关的隐私内容记为隐私内容2。隐私内容列表可以包括隐私内容1和隐私内容2。当用户指令的响应内容 包括隐私内容1或隐私内容2,则确定针对所述用户指令的响应内容涉及隐私。
404、根据所述响应内容是否涉及隐私,确定是否通过隐私保护模式输出所述响应内容。
具体实现中,当用户指令的响应内容涉及隐私,才判断是否通过隐私保护模式输出用户指令的响应内容,以保护用户隐私。当用户指令的响应内容不涉及隐私,则以正常方式输出用户指令的响应内容,例如,通过非隐私模式输出用户指令的响应内容。
一种可能的实现方式中,当智能设备的处理器103判断用户指令的响应内容涉及隐私,但所述用户处于单人场景,则通过非隐私模式输出所述响应内容。
另一种可能的实现方式中,当智能设备的处理器103判断用户指令的响应内容涉及隐私,且所述用户处于多人场景,则通过所述隐私保护模式输出所述响应内容。
另一种可能的实现方式中,当智能设备的处理器103判断用户指令的响应内容涉及隐私,则通过所述隐私保护模式输出所述响应内容。
需要说明的是,车内的摄像头60可以拍摄用户图像,并将用户图像发送给智能设备10。智能设备10的处理器103还可以对所述用户图像进行解析、处理。若在所述用户图像中解析到多个人物图像,则确定用户当前所处场景包括多人,即用户处于多人场景。若在所述用户图像中解析到1个人物图像,则确定用户当前处于单人场景。
具体实现中,处理器103可以使用yolo算法对用户图像进行人脸目标检测,然后根据识别到的人脸目标数目确定场景内的人数,例如,车内的人数。根据场景内的人数判断用户处于单人场景还是多人场景。
具体实现中,智能设备可以通过以下两种隐私保护模式输出用户指令的响应内容。其中,“输出”指的是智能设备呈现用户指令的响应内容。当响应内容为文本信息,则可以通过显示屏显示响应内容;当响应内容为语音,则可以通过音响播放响应内容。两种隐私保护模式具体如下:
第一、所述智能设备通过公共设备输出所述响应内容时,隐藏所述响应内容包括的隐私内容。
为了完成人机智能语音交互,响应用户通过语音发起的用户指令,可以在公共设备输出用户指令的响应内容。由于公共设备面向多数人,可能导致用户隐私泄露,因此在公共设备输出用户指令的响应内容时,可以隐藏响应内容包括的隐私内容。
其中,通过公共设备输出用户指令的响应内容,可以是通过公共显示屏(例如,车载中控显示器)显示用户指令的响应内容,但需要隐藏隐私内容,例如,隐藏关键的人名、地点等信息。
可以理解的是,隐藏隐私内容,可以是以特殊图像(例如,马赛克)遮盖隐私内容。也可以是不显示隐私内容,以特殊字符代替隐私内容,仅显示不涉及隐私的内容。
本申请实施例中,通过公共设备输出用户指令的响应内容,也可以是通过公共音响系统(例如,车载音响)播放用户指令的响应内容,但不能播放响应内容中的隐私内容,例如,隐藏关键的人名、地点等信息,仅播放不涉及隐私的内容。
第二、所述智能设备通过非公共设备输出所述响应内容。
为了完成人机智能语音交互,响应用户通过语音发起的用户指令,可以在非公共模块输出用户指令的响应内容。由于非公共模块仅面向智能设备的用户(例如,驾驶 员)。在非公共模块输出用户指令的响应内容时,可以保护用户隐私内容。
其中,通过非公共模块输出用户指令的响应内容,可以是通过非公共显示屏(例如,驾驶位抬头显示屏)显示用户指令的响应内容。或者,通过非公共音响系统(例如,驾驶员佩戴的耳机)播放用户指令的响应内容。
需要说明的是,如果用户所处场景包括多人,智能设备的输入模块20接收到的语音有两种可能:一种是真实的用户对设备输入的语音信号(即用户对设备所讲的话),另一种可能是用户之间的聊天语音,这些语音对于智能设备确定真实的用户指令而言属于噪音。
通常认为用户通过唤醒词唤醒智能设备之后的语音信号是有效的,智能设备接收用户发出的唤醒词,在唤醒之后接收用户语音。根据接收到的用户语音确定用户指令,并对用户指令进行响应。
当智能设备长时间处于唤醒状态,输入模块20接收到的语音很多属于用户聊天语音,为了避免设备针对这些语音进行不必要的反馈,可以基于对接收到的语音进行判别,提取用户进行人机交互时发出的语音。具体地,可以通过以下两种方式对接收到的语音进行判别:
第一种、根据AD模块判断输入模块20接收到的语音是否为用户进行人机交互时发出的语音。
需要说明的是,用户之间聊天的语速、语调、韵律或语音感情色彩往往不同与人机器交互的语音,可以根据这些差异来判断一段语音的接收对象是否为智能设备。本申请实施例中,可以通过AD模块可以利用这些差异来区分用户语音是用户进行人机交互时发出的语音,还是用户与其他人之间的聊天语音。
具体地,AD模块是一个基于输入的语音信号进行二分类的模型。将输入模块20接收到的语音输入AD模块,AD模块可以输出一个结果值。这个结果值代表输入模块20接收到的语音为用户进行人机交互时发出的语音,或者,输入模块20接收到的语音不是用户进行人机交互时发出的语音。或者,结果值还可以代表输入模块20接收到的语音是用户进行人机交互时发出的语音的概率,当概率大于相应阈值,可以认为输入模块20接收到的语音为用户进行人机交互时发出的语音。
AD模块可以通过对训练样本进行训练得到,AD模块的训练样本可以是AD判别样本、意图识别(NLU)样本、词性标注(POS)样本、文本对对抗样本等。其中,AD判别样本可以包括语音信号,语音信息的AD判别结果指示语音信号的接收对象为智能设备或语音信号的接收对象不是智能设备。意图识别(NLU)样本可以包括文本信息以及文本信息对应的用户意图(或用户指令)。词性标注(POS)样本可以包括词(Word)以及词性。文本对对抗样本包括文本对以及文本对之间的干扰量。
AD判别样本、意图识别(NLU)样本、词性标注(POS)样本的损失函数为交叉熵损失,文本对对抗样本的损失函数为两个文本对应的向量之间的欧式距离。需要说明的是,损失函数用于计算训练样本的误差,根据各个训练样本的损失函数,可以确定AD模块的误差。
第二种、根据用户的注视对象判断用户语音的接收对象是否为所述智能设备。
通常,当用户向智能设备发出语音,会同时注视智能设备,因此当判断用户的注 视对象为智能设备,则可以确定用户语音的接收对象为智能设备。
具体实现中,智能设备还可以获取用户图像。示例的,车内的摄像头60可以拍摄用户图像,并将用户图像发送给智能设备10的处理器103。
处理器103根据所述用户图像判断所述用户的注视方向;当判断所述用户的注视方向为目标方向,则确定所述用户的意图为进行人机交互。进一步,还可以根据所述用户的注视方向为所述目标方向时所发出的用户语音信息确定所述用户指令。
本申请实施例中,目标方向可以是预先设定的方向。该方向可以是指向车内某个设备的方向,例如,目标方向可以是指向智能设备的方向;或者目标方向可以是指向采集设备的方向,例如,目标方向可以是指向摄像头的方向。
一种可能的实现方式中,利用人体头部姿态进行视线跟踪。具体地,首先使用yolo算法进行人脸目标检测,检测到人脸目标后,进行2D人脸关键点检测。然后根据检测到的2D人脸关键点进行3D人脸模型匹配。匹配了3D人脸模型后,可以根据3D人脸关键点与2D人脸关键点的旋转关系求解人脸的姿态角度,将这个角度作为用户的视线角度。根据用户的视线角度判断用户是否注视智能设备,若用户注视对象为智能设备则可以确定用户意图为进行人机交互。
可选的,本申请实施例所述的方法还包括:当智能设备判断接收到的语音信号为用户与其他人之间的聊天语音,则在显示屏显示动态的波形,表示智能设备在接收外部语音,并不会实时显示语音信号的识别结果。
当判断接收到的语音信号为用户对设备发出的,才通过ASR模块将语音信号转化为文本信息,还可以在显示屏显示该文本信息,以便用户判断识别结果是否准确。
以图3所示的场景为例,驾驶员发出语音信号1“吃早饭了吗”,副驾驶回复语音信号2“没呢,没来得及”;主驾驶发出语音信号3“你几点起床的”,副驾驶回复语音信号4“起的比较完”。
智能设备的麦克风阵列收集到语音信号1~语音信号4,对语音信号1~语音信号4进行分析,根据语音信号的语调、语速或语言感情色彩判断语言信号1~语音信号4为乘客与驾驶员之间的聊天语音,则不进行后续处理,即不会将语音信号转化成文本信息确定用户指令。
或者,智能设备根据摄像头60确定用户(驾驶员)的注视对象,若用户的注视对象不是智能设备,则不进行后续处理。
可选的,参考图5,中控显示屏40可以显示波形,以表示正在接收用户语音。
驾驶员发出语音信号5“打开空调,调至24度”。
智能设备的麦克风阵列收集到语音信号5,对语音信号5进行分析,根据语音信号的语调、语速或语言感情色彩判断语言信号5为驾驶员对设备发出的,则进行后续处理,将语音信号转化成文本信息确定用户指令为“打开空调,调至24度”。
进一步,智能设备判断用户指令“打开空调,调至24度”的响应内容不涉及隐私,则对意图进行反馈,打开车内空调,并将温度调为24摄氏度。
驾驶员发出语音信号6“查看今天日程”。
智能设备的麦克风阵列收集到语音信号6,对语音信号6进行分析,根据语音信号的语调、语速或语言感情色彩判断语言信号6为驾驶员进行人机交互时对智能设备 10发出的,则进行后续处理,将语音信号转化成文本信息,根据文本信息确定用户指令为“查看今天日程”。
进一步,智能设备判断用户指令“查看今天日程”的响应内容为“日程”,涉及隐私,且根据用户图像判断用户当前所处场景包括多人,即用户当前处于多人场景。则通过非公共模块输出用户指令的响应内容,即用户的日程。或者,通过公共模块输出用户指令的响应内容时,隐藏关键人名、地点。
示例的,用户的日程为“今天14:40在高新大酒店参加A公司的招标会”。参考图6,通过中控显示屏40显示“您今天14:40在**大酒店参加*公司的招标会”。
或者,参考图7,通过车内音响50播放语音“您今天14:40需要参加一个招标会”。
或者,参考图8,通过抬头显示屏20显示“您今天14:40在高新大酒店参加A公司的招标会”。
或者,参考图9,通过耳机30播放语音“您今天14:40在高新大酒店参加A公司的招标会”。
本申请实施例提供的方法中,在智能设备加入AD模块,过滤很多无无效的语音信号,减少由于无效语音误触发的反馈,提升用户的使用体验。此外,还可以进行反馈模式决策,基于用户意图、用户场景动态调整反馈方式。不仅仅支持反馈设备的调整,还支持调整反馈的内容,能够更好地保护用户的隐私。
本申请实施例还提供一种语音交互方法,如图10所示,所述方法包括以下步骤:
1001、获取用户的多模态信息。
其中,用户的多模态信息可以用户语音信息、用户图像。用户语音信息可以是智能设备接收到的模拟信号;用户图像可以是车内的摄像头拍摄的图像。
1002、判断用户意图是否为进行人机交互。
一种可能的实现方式中,通常认为通过唤醒词唤醒系统智能设备之后输入的用户语音是有效的,即唤醒词唤醒系统之后,接收到的语音是用户进行人机交互时发出的语音。
另一种可能的实现方式中,智能设备长时间处于唤醒状态,在长时唤醒情况下,设备接收到的语音可能包括用户与其他人之间的聊天语音。因此,可以基于AD模块判别接收到的语音是用户进行人机交互时发出的语音。
或者,还可以利用摄像头确定用户的注视对象。当用户的注视对象为目标方向,例如,用户的注视方向指向智能设备,则可以是确定接收到的语音是用户进行人机交互时发出的语音。
如果接收到的语音是用户进行人机交互时发出的语音则执行步骤1003;否则仅在智能设备的显示屏显示波形表示设备正在接收用户语音。
1003、根据语音信号信息确定用户指令。
具体实现参考步骤402的相关描述,在此不做赘述。
1004、判断针对用户指令的响应内容是否涉及隐私。
具体地,可以定义一个隐私内容列表,常见的隐私内容包括:短信、微信、备忘录等。涉及隐私的响应内容可以是短信内容,微信内容,备忘录内容。当针对用户指令的响应内容不包括隐私内容列表中的隐私内容,直接执行1007常规显示用户指令的 响应内容;当针对用户指令的响应内容包括隐私内容列表中的隐私内容进行后续的进一步判断与决策,执行步骤1005。
1005、判断用户是否处于多人场景。
具体地,可以基于摄像头获取的用户图像判断用户是否处于多人场景。例如,可以根据用户图像判别车内是否有多人。在有多人的情况下才会产生隐私问题,多人场景下通过车内音响语音播报反馈内容,或通过中控显示屏呈现反馈内容发生都会有一些隐私泄露风险。
因此,当判断车内有多人,则确定用户是否处于多人场景,执行步骤1006,保护隐私。否则,执行步骤1007,以常规方式输出用户指令的响应内容。
1006、以隐私保护模式输出用户指令的响应内容。
具体实现中,可以通过智能设备的非公共设备输出用户指令的响应内容。例如,通过驾驶员用户佩戴的耳机播放用户指令的响应内容,或者通过驾驶位显示屏显示用户指令的响应内容。
例如,首先可以检测是否存在隐私模式所需的硬件条件,比如,驾驶位显示屏,或驾驶员是否佩戴耳机等。当满足隐私模式所需的硬件条件,例如驾驶员佩戴了耳机,可以通过耳机播放用户指令的响应内容。
当不存在所需硬件环境时,则对反馈内容进行调整,隐藏用户的隐私信息。例如,在中控显示屏显示响应内容,但隐藏关键的地点、人名等隐私信息。
1007、以常规模式输出用户指令的响应内容。
其中,常规模式输出用户指令的响应内容,即通过智能设备的公共设备输出用户指令的响应内容。例如,通过车内音响播放用户指令的响应内容,或者通过中控显示屏显示用户指令的响应内容。
在采用对应各个功能划分各个功能模块的情况下,图11示出上述实施例中所涉及的设备(例如,本申请实施例所述的智能设备)的一种可能的结构示意图。例如,图11所示的设备可以是本申请实施例所述的智能设备,也可以是智能设备中实现上述方法的部件。如图11所示,设备包括获取单元1101、处理单元1102、收发单元1103。处理单元可以是一个或多个处理器,收发单元可以是收发器。
获取单元1101,用于支持智能设备执行步骤401,和/或用于本文所描述的技术的其它过程。
数据处理单元1102,用于支持智能设备执行步骤401~步骤404,和/或用于本文所描述的技术的其它过程。
收发单元1103,用于支持智能设备与其他设备或设备之间的通信,和/或用于本文所描述的技术的其它过程。可以是智能设备的接口电路或网络接口。
需要说明的是,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。
一种可能的实现方式中,图11所示的结构也可以是应用于智能设备中的芯片的结构。所述芯片可以是片上系统(System-On-a-Chip,SOC)或者是具备通信功能的基带芯片等。
示例性的,在采用集成的单元的情况下,本申请实施例提供的设备的结构示意图 如图12所示。在图12中,该设备包括:处理模块1201和通信模块1202。处理模块1201用于对设备的动作进行控制管理,例如,执行上述获取单元1101、处理单元1102执行的步骤,和/或用于执行本文所描述的技术的其它过程。通信模块1202用于执行上述收发单元1103执行的步骤,支持设备与其他设备之间的交互,如与其他终端设备之间的交互。如图12所示,设备还可以包括存储模块1203,存储模块1203用于存储设备的程序代码和数据。
当处理模块1201为处理器,通信模块1202为收发器,存储模块1203为存储器时,设备为图2所示的设备。
本申请实施例提供一种计算机可读存储介质,计算机可读存储介质中存储有指令;指令用于执行如图4或图10所示的方法。
本申请实施例提供一种包括指令的计算机程序产品,当其在设备上运行时,使得设备实现如图4或图10所示的方法。
本申请实施例一种无线设备,包括:无线设备中存储有指令;当无线设备在图2、图11、图12所示的设备上运行时,使得设备实现如图4或图10所示的方法。该设备可以为芯片等。
需要说明的是,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。
本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,另外,在本申请各个实施例中的各功能模块可以集成在一个处理器中,也可以是单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
本申请实施例提供的方法中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、网络设备、用户设备或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,简称DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机可以存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,数字视频光盘(digital video disc,简称DVD))、或者半导体介质(例如,SSD)等。
在本申请实施例中,在无逻辑矛盾的前提下,各实施例之间可以相互引用,例如方法实施例之间的方法和/或术语可以相互引用,例如装置实施例之间的功能和/或术语可以相互引用,例如装置实施例和方法实施例之间的功能和/或术语可以相互引用。
本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的范围。这 样,倘若本申请的这些修改和变型属于本申请实施例提供的方法及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (14)

  1. 一种车内语音交互方法,其特征在于,包括:
    获取用户语音信息;
    根据所述用户语音信息确定用户指令;
    根据所述用户指令判断针对所述用户指令的响应内容是否涉及隐私;
    根据所述响应内容是否涉及隐私,确定是否通过隐私保护模式输出所述响应内容。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:获取用户图像;
    所述根据所述用户语音信息确定用户指令,具体为:
    根据所述用户图像判断所述用户的注视方向;
    当判断所述注视方向为目标方向,则确定所述用户的意图为进行人机交互;
    根据所述注视方向为所述目标方向时所发出的用户语音信息确定所述用户指令。
  3. 根据权利要求1或2所述的方法,其特征在于,所述根据所述响应内容是否涉及隐私,确定是否通过隐私保护模式输出所述响应内容,具体为:
    判断所述响应内容涉及隐私,且所述用户处于单人场景,则通过非隐私模式输出所述响应内容。
  4. 根据权利要求1或2所述的方法,其特征在于,所述根据所述响应内容是否涉及隐私,确定是否通过隐私保护模式输出所述响应内容,具体为:
    判断所述响应内容涉及隐私,且所述用户处于多人场景,则通过所述隐私保护模式输出所述响应内容。
  5. 根据权利要求1或2所述的方法,其特征在于,所述根据所述响应内容是否涉及隐私,确定是否通过隐私保护模式输出所述响应内容,具体为:
    判断所述响应内容涉及隐私,则通过所述隐私保护模式输出所述响应内容。
  6. 根据权利要求4或5所述的方法,其特征在于,所述通过所述隐私保护模式输出所述响应内容,具体为:
    通过公共设备输出所述响应内容时,隐藏所述响应内容包括的隐私内容;或,
    通过非公共设备输出所述响应内容。
  7. 一种设备,其特征在于,包括:
    获取单元,用于获取用户语音信息;
    处理单元,用于根据所述用户语音信息确定用户指令;
    所述处理单元还用于,根据所述用户指令判断针对所述用户指令的响应内容是否涉及隐私;根据所述响应内容是否涉及隐私,确定是否通过隐私保护模式输出所述响应内容。
  8. 根据权利要求7所述的设备,其特征在于,所述获取单元还用于,获取用户图像;
    所述处理单元具体用于,根据所述用户图像判断所述用户的注视方向;
    当判断所述注视方向为目标方向,则确定所述用户的意图为进行人机交互;
    根据所述注视方向为所述目标方向时所发出的用户语音信息确定所述用户指令。
  9. 根据权利要求7或8所述的设备,其特征在于,所述处理单元具体用于,判断所述响应内容涉及隐私,且所述用户处于单人场景,则通过非隐私模式输出所述响应 内容。
  10. 根据权利要求7或8所述的设备,其特征在于,所述处理单元具体用于,判断所述响应内容涉及隐私,且所述用户处于多人场景,则通过所述隐私保护模式输出所述响应内容。
  11. 根据权利要求7或8所述的设备,其特征在于,所述处理单元具体用于,判断所述响应内容涉及隐私,则通过所述隐私保护模式输出所述响应内容。
  12. 根据权利要求10或11所述的设备,其特征在于,所述处理单元具体用于,通过公共设备输出所述响应内容时,隐藏所述响应内容包括的隐私内容;或,
    通过非公共设备输出所述响应内容。
  13. 一种装置,其特征在于,包括至少一个处理器和存储器,所述至少一个处理器与所述存储器耦合;
    所述存储器,用于存储计算机程序;
    所述至少一个处理器,用于执行所述存储器中存储的计算机程序,以使得所述装置执行如权利要求1至6中任一项所述的方法。
  14. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序或指令,当所述计算机程序或指令被运行时,实现如权利要求1至6中任一项所述的方法。
PCT/CN2020/087913 2020-04-29 2020-04-29 一种车内语音交互方法及设备 WO2021217527A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP20933148.7A EP4138355A4 (en) 2020-04-29 2020-04-29 VOICE INTERACTION METHOD AND APPARATUS IN A VEHICLE
PCT/CN2020/087913 WO2021217527A1 (zh) 2020-04-29 2020-04-29 一种车内语音交互方法及设备
CN202080004874.8A CN112673423A (zh) 2020-04-29 2020-04-29 一种车内语音交互方法及设备
US17/976,339 US20230048330A1 (en) 2020-04-29 2022-10-28 In-Vehicle Speech Interaction Method and Device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/087913 WO2021217527A1 (zh) 2020-04-29 2020-04-29 一种车内语音交互方法及设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/976,339 Continuation US20230048330A1 (en) 2020-04-29 2022-10-28 In-Vehicle Speech Interaction Method and Device

Publications (1)

Publication Number Publication Date
WO2021217527A1 true WO2021217527A1 (zh) 2021-11-04

Family

ID=75413920

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/087913 WO2021217527A1 (zh) 2020-04-29 2020-04-29 一种车内语音交互方法及设备

Country Status (4)

Country Link
US (1) US20230048330A1 (zh)
EP (1) EP4138355A4 (zh)
CN (1) CN112673423A (zh)
WO (1) WO2021217527A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115085988A (zh) * 2022-06-08 2022-09-20 广东中创智家科学研究有限公司 智能语音设备隐私侵犯检测方法、系统、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107465678A (zh) * 2017-08-04 2017-12-12 上海博泰悦臻网络技术服务有限公司 一种隐私信息控制系统与方法
KR20190062982A (ko) * 2017-11-29 2019-06-07 삼성전자주식회사 전자 장치 및 전자 장치의 동작 방법
CN110033774A (zh) * 2017-12-07 2019-07-19 交互数字Ce专利控股公司 用于隐私保护型语音交互的设备和方法
US20190347387A1 (en) * 2018-05-08 2019-11-14 Covidien Lp Automated voice-activated medical assistance
US20200082123A1 (en) * 2017-08-24 2020-03-12 International Business Machines Corporation Selective enforcement of privacy and confidentiality for optimization of voice applications

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8856948B1 (en) * 2013-12-23 2014-10-07 Google Inc. Displaying private information on personal devices
WO2016157658A1 (ja) * 2015-03-31 2016-10-06 ソニー株式会社 情報処理装置、制御方法、およびプログラム
JP6447578B2 (ja) * 2016-05-27 2019-01-09 トヨタ自動車株式会社 音声対話装置および音声対話方法
CN108595011A (zh) * 2018-05-03 2018-09-28 北京京东金融科技控股有限公司 信息展示方法、装置、存储介质及电子设备
CN110493449A (zh) * 2018-05-15 2019-11-22 上海博泰悦臻网络技术服务有限公司 车辆及其基于乘车人数的隐私策略实时设置方法
CN109814448A (zh) * 2019-01-16 2019-05-28 北京七鑫易维信息技术有限公司 一种车载多模态控制方法及系统
CN110908513B (zh) * 2019-11-18 2022-05-06 维沃移动通信有限公司 一种数据处理方法及电子设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107465678A (zh) * 2017-08-04 2017-12-12 上海博泰悦臻网络技术服务有限公司 一种隐私信息控制系统与方法
US20200082123A1 (en) * 2017-08-24 2020-03-12 International Business Machines Corporation Selective enforcement of privacy and confidentiality for optimization of voice applications
KR20190062982A (ko) * 2017-11-29 2019-06-07 삼성전자주식회사 전자 장치 및 전자 장치의 동작 방법
CN110033774A (zh) * 2017-12-07 2019-07-19 交互数字Ce专利控股公司 用于隐私保护型语音交互的设备和方法
US20190347387A1 (en) * 2018-05-08 2019-11-14 Covidien Lp Automated voice-activated medical assistance

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4138355A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115085988A (zh) * 2022-06-08 2022-09-20 广东中创智家科学研究有限公司 智能语音设备隐私侵犯检测方法、系统、设备及存储介质

Also Published As

Publication number Publication date
EP4138355A4 (en) 2023-03-01
EP4138355A1 (en) 2023-02-22
CN112673423A (zh) 2021-04-16
US20230048330A1 (en) 2023-02-16

Similar Documents

Publication Publication Date Title
US11380331B1 (en) Virtual assistant identification of nearby computing devices
US11670302B2 (en) Voice processing method and electronic device supporting the same
JP6697024B2 (ja) 手動始点/終点指定及びトリガフレーズの必要性の低減
US11714598B2 (en) Feedback method and apparatus of electronic device for confirming user's intention
US11435980B2 (en) System for processing user utterance and controlling method thereof
US10811008B2 (en) Electronic apparatus for processing user utterance and server
US11749285B2 (en) Speech transcription using multiple data sources
KR20190101630A (ko) 사용자 발화를 처리하는 시스템 및 그 시스템의 제어 방법
KR102339819B1 (ko) 프레임워크를 이용한 자연어 표현 생성 방법 및 장치
KR102508677B1 (ko) 사용자 발화를 처리하는 시스템 및 그 시스템의 제어 방법
CN112292724A (zh) 用于调用自动助理的动态和/或场境特定热词
US20230386461A1 (en) Voice user interface using non-linguistic input
KR20210137118A (ko) 대화 단절 검출을 위한 글로벌 및 로컬 인코딩을 갖는 컨텍스트 풍부 주의 기억 네트워크를 위한 시스템 및 방법
WO2020015473A1 (zh) 交互方法及装置
KR20190109916A (ko) 전자 장치 및 상기 전자 장치로부터 수신된 데이터를 처리하는 서버
CN110945455A (zh) 处理用户话语以用于控制外部电子装置的电子装置及其控制方法
KR102369309B1 (ko) 파셜 랜딩 후 사용자 입력에 따른 동작을 수행하는 전자 장치
US20230048330A1 (en) In-Vehicle Speech Interaction Method and Device
CN111554314A (zh) 噪声检测方法、装置、终端及存储介质
WO2023006033A1 (zh) 语音交互方法、电子设备及介质
US11929081B2 (en) Electronic apparatus and controlling method thereof
JP2023120130A (ja) 抽出質問応答を利用する会話型aiプラットフォーム
US20230341948A1 (en) Multimodal ui with semantic events
US20240185858A1 (en) Virtual assistant identification of nearby computing devices
KR102349681B1 (ko) 결여된 파라미터를 획득하고 기록하는 전자 장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20933148

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020933148

Country of ref document: EP

Effective date: 20221115

NENP Non-entry into the national phase

Ref country code: DE