WO2021190225A1 - 一种语音交互方法及电子设备 - Google Patents

一种语音交互方法及电子设备 Download PDF

Info

Publication number
WO2021190225A1
WO2021190225A1 PCT/CN2021/077514 CN2021077514W WO2021190225A1 WO 2021190225 A1 WO2021190225 A1 WO 2021190225A1 CN 2021077514 W CN2021077514 W CN 2021077514W WO 2021190225 A1 WO2021190225 A1 WO 2021190225A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
voice
electronic device
information
dialogue
Prior art date
Application number
PCT/CN2021/077514
Other languages
English (en)
French (fr)
Inventor
李伟国
钱莉
蒋欣
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21774325.1A priority Critical patent/EP4116839A4/en
Publication of WO2021190225A1 publication Critical patent/WO2021190225A1/zh
Priority to US17/952,401 priority patent/US20230017274A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3343Query execution using phonetics
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0091Means for obtaining special acoustic effects
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/315Sound category-dependent sound synthesis processes [Gensound] for musical use; Sound category-specific synthesis-controlling parameters or control means therefor
    • G10H2250/455Gensound singing voices, i.e. generation of human voices for musical applications, vocal singing sounds or intelligible words at a desired pitch or with desired vocal effects, e.g. by phoneme synthesis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads

Definitions

  • the embodiments of the present application relate to the field of artificial intelligence technology and voice processing technology, and in particular to a voice interaction method and electronic equipment.
  • the existing smart devices can receive voice information (such as voice commands) sent by users, and perform operations corresponding to the voice information.
  • the aforementioned smart device may be a mobile phone, a smart robot, a smart watch, or a smart home device (such as a smart TV) and other devices.
  • the mobile phone can receive the voice command "turn down the volume" from the user, and then automatically lower the volume of the mobile phone.
  • Some smart devices can also provide voice interaction functions.
  • the intelligent robot can receive the user's voice information, and conduct a voice conversation with the user based on the voice information, thereby realizing the voice interaction function.
  • the intelligent robot can receive the user's voice information, and conduct a voice conversation with the user based on the voice information, thereby realizing the voice interaction function.
  • an existing smart device conducts a voice conversation with a user, it can only give some modal voice replies according to a set voice mode. The interaction performance between smart devices and users is poor and cannot provide users with a personalized voice interaction experience.
  • the present application provides a voice interaction method and electronic device to improve the performance of the interaction between the electronic device and the user, thereby providing the user with a personalized voice interaction experience.
  • the present application provides a voice interaction method.
  • the method may include: an electronic device may receive first voice information sent by a second user; and in response to the first voice information, the electronic device recognizes the first voice information. Wherein, the first voice information is used to request a voice conversation with the first user. Based on the electronic device recognizing that the first voice information is the voice information of the second user, the electronic device can simulate the voice of the first user, and conduct a voice dialogue with the second user in a manner in which the first user conducts a voice dialogue with the second user.
  • the electronic device may receive the first voice information and recognize that the first voice information is sent by the second user. Since the first voice information requests a voice conversation with the first user, the electronic device can recognize that the first voice information is that the second user wants to have a voice conversation with the first user. In this way, the electronic device can simulate the voice of the first user, and intelligently conduct a voice dialogue with the second user according to the dialogue mode of the voice dialogue between the first user and the second user. In this way, the electronic device can simulate the first user and provide the second user with a real voice conversation communication experience with the first user. This voice interaction method improves the interaction performance of the electronic device and can provide users with a personalized voice interaction experience.
  • the above-mentioned dialogue manner is used to indicate the tone and wording of the voice dialogue between the first user and the second user.
  • the electronic device conducts a voice dialogue with the second user according to the dialogue mode of the first user and the second user. That is to say, the electronic device conducts a voice dialogue with the first user according to the tone and words used when the first user is talking with the second user. Provide the second user with a more realistic communication experience of voice dialogue with the first user, and improve the interactive performance of the electronic device.
  • the image information of the first user may be stored in the electronic device. Then, when the electronic device simulates the voice of the first user and conducts a voice dialogue with the second user according to the dialogue mode of the first user and the second user, the electronic device may also display the image information of the first user.
  • the electronic device can display images, and the electronic device saves the image information of the first user. Then the electronic device displays the image information of the first user when simulating the voice dialogue between the first user and the second user. In this way, when the electronic device simulates a voice conversation between the first user and the second user, the second user can not only hear the voice of the first user, but also see the image of the first user.
  • the face model of the first user may be stored in the electronic device. Then, when the electronic device simulates the voice of the first user and conducts a voice dialogue with the second user according to the dialogue mode of the first user and the second user, the electronic device can simulate the expression of the voice dialogue between the first user and the second user To display the face model of the first user.
  • the facial expression of the first user in the face model displayed by the electronic device can be dynamically changed.
  • the electronic device displays the face model of the first user.
  • the face model displayed by the electronic device can be dynamically changed, so that the user thinks that he is in a voice conversation with the first user.
  • the electronic device simulates the voice dialogue between the first user and the second user, the second user can not only hear the voice of the first user, but also see the facial expression of the first user during the voice dialogue with the first user.
  • the above method may further include: the electronic device may also obtain second voice information, and the second voice information is performed by the first user and the second user. Voice information during a voice conversation.
  • the electronic device analyzes and obtains the second voice information, so that the voice feature of the first user and the second user during a voice conversation can be obtained, and the voice feature can be saved.
  • voice special diagnosis can include voiceprint features, tone features, and word features.
  • the tone feature is used to indicate the tone of voice when the first user and the second user are in a voice dialogue;
  • the word feature is used to indicate the usual vocabulary of the first user and the second user in the voice dialogue. It provides the second user with a more realistic communication experience of voice dialogue with the first user, which further improves the interactive performance of the electronic device.
  • the electronic device acquires second voice information, and the second voice information is voice information of a voice conversation between the first user and the second user.
  • the electronic device can analyze the voice characteristics of the first user and the second user during a voice conversation according to the second voice information. In this way, when the electronic device simulates the dialogue mode of the voice dialogue between the first user and the second user, the electronic device can send out a voice dialogue similar to that of the first user, thereby providing the user with a personalized voice interaction experience.
  • the electronic device may also save a voice dialogue record of the above-mentioned electronic device simulating the first user and the second user.
  • the electronic device can simulate the voice of the first user and follow the dialogue of the voice dialogue between the first user and the second user. Way, to have a voice conversation with the second user. It may be that the electronic device recognizes that the first voice is the voice information of the second user, and the electronic device simulates the voice of the first user, and emits the voice response information of the first voice according to the dialogue mode of the voice dialogue between the first user and the second user . If the electronic device receives the third voice information after emitting the voice response information of the first voice, and the electronic device recognizes that the third voice is voice information of the second user. Then the electronic device recognizes that the third voice is the voice information of the second user, the electronic device can simulate the voice of the first user, and send out the voice response of the third voice information according to the dialogue mode of the voice dialogue between the first user and the second user information.
  • the electronic device when the electronic device responds to the first voice information by simulating the dialogue between the first user and the second user, the electronic device needs to recognize that the third voice is made by the second user after receiving the third voice information. Then, the electronic device needs to recognize that the third voice information is the voice information of the second user, and then send response information in response to the third voice information. If there are other users in the environment where the electronic device and the second user are sending voice information, after receiving the third voice information, the electronic device recognizes that the third voice information is sent by the second user, which can better communicate with each other. The second user conducts a voice conversation. Thereby improving the voice interaction function and enhancing the user experience.
  • the electronic device may obtain schedule information of the first user, and the schedule information is used to refer to the schedule of the first user.
  • the voice response information that the electronic device emits the third voice may be that the electronic device refers to the schedule information and emits the voice response information of the third voice information.
  • the electronic device can directly respond to the third voice information according to the schedule information. So as to provide a personalized interactive experience for the first user.
  • the electronic device may save the voice dialogue record with the second user by simulating the voice of the first user by the electronic device, and the electronic device may also send the voice dialogue record to the electronic device of the first user.
  • the electronic device sends the voice conversation record to the electronic device of the first user, so that the first user can understand the content of the conversation.
  • the electronic device provides a more personalized voice interaction for the second user.
  • the electronic device saves a voice dialogue record of the electronic device simulating the first user and the second user, and the electronic device may also extract keywords in the voice dialogue from the voice dialogue record.
  • the electronic device may send the keyword to the electronic device of the first user.
  • the electronic device simulates the voice of the first user, and interacts with the second user in a voice dialogue manner of the first user and the second user.
  • the electronic device may also obtain the image information and action information of the second user, and save the image information and action information of the second user.
  • the electronic device acquires image information and action information of the second user when simulating the voice dialogue between the first user and the second user, and can learn the expressions and actions of the second user in the voice dialogue with the first user. So that the electronic device simulates the voice dialogue mode of the second user and the first user.
  • the present application also provides an electronic device, which may include a memory, a voice module, and one or more processors.
  • the memory, the voice module, and one or more processors are coupled.
  • the microphone can be used to receive the first voice information.
  • the memory is used to store computer program codes.
  • the computer program codes include computer instructions.
  • the processor executes the computer instructions, the processor is used to, in response to the first voice information, recognize the first voice information.
  • the user conducts a voice conversation. Based on the voice information of the second user that the first voice is recognized, the voice of the first user is simulated, and the voice dialogue with the second user is performed in accordance with the dialogue mode of the voice dialogue between the first user and the second user.
  • the electronic device may further include a display screen, and the display screen is coupled with the processor.
  • the display screen is used to display image information of the first user.
  • the face model of the first user is stored in the electronic device.
  • the display screen in the electronic device is also used to simulate the expression of the first user and the second user in a voice dialogue, and display the face model; wherein the expression of the first user in the face model changes dynamically.
  • the microphone is also used to acquire second voice information, where the second voice information is voice information when the first user conducts a voice conversation with the second user.
  • the processor is also used to analyze the second voice information to obtain the voice characteristics of the voice conversation between the first user and the second user, and save the voice characteristics.
  • the voice features include voiceprint features, tone features, and word features.
  • the tone features are used to indicate the tone of a voice conversation between the first user and the second user, and the word features are used to indicate the first user and the second user to perform a voice conversation. Idiomatic vocabulary in conversation.
  • the processor is further configured to store, in the second voice information, a voice dialogue record of the electronic device simulating the first user and the second user.
  • the microphone is also used to receive the third voice information.
  • the processor is further configured to, in response to the third voice information, recognize the third voice information. Based on the recognition that the third voice information is the voice information of the second user, the electronic device simulates the voice of the first user. According to the dialogue mode of the voice dialogue between the first user and the second user, the speaker is also used to send out the third voice information. Voice response information.
  • the processor is further configured to obtain schedule information of the first user, and the schedule information is used to indicate the schedule of the first user.
  • the voice response information of issuing the third voice information includes: the electronic device refers to the schedule information and emits the voice response information of the third voice information.
  • the processor is further configured to store the voice dialogue record with the second user by simulating the voice of the first user by the electronic device; and send the voice dialogue record to the electronic device of the first user.
  • the processor is further configured to save a voice dialogue record of the electronic device simulating the first user and the second user. Extract keywords from the voice dialogue record that the electronic device simulates the voice dialogue between the first user and the second user; send the keywords to the electronic device of the first user.
  • the electronic device further includes a camera, which is coupled with the processor; the camera is used to obtain image information and action information of the second user, and the processor is also used to save the image information and action information of the second user .
  • the present application also provides a server, which may include a memory and one or more processors.
  • the memory is coupled to one or more processors.
  • the memory is used to store computer program code, and the computer program code includes computer instructions.
  • the server is caused to execute the method in the first aspect and any one of its possible implementation manners.
  • this application also provides a computer-readable storage medium, including computer instructions.
  • the computer instructions run on an electronic device, the electronic device can execute the first aspect and any one of its possible implementations. Methods.
  • this application also provides a computer program product, which when the computer program product runs on a computer, causes the computer to execute the method in the first aspect and any one of its possible implementation manners.
  • FIG. 1A is a system architecture diagram provided by an embodiment of this application.
  • FIG. 1B is a diagram of another system architecture provided by an embodiment of this application.
  • FIG. 2A is a schematic structural diagram of an electronic device provided by an embodiment of this application.
  • 2B is a schematic diagram of the software structure of an electronic device provided by an embodiment of this application.
  • FIG. 3A is a flowchart of a voice interaction manner provided by an embodiment of this application.
  • 3B is a schematic diagram of a display interface of a smart speaker provided by an embodiment of the application.
  • FIG. 4 is a schematic structural diagram of a smart speaker provided by an embodiment of the application.
  • first and second are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined with “first” and “second” may explicitly or implicitly include one or more of these features. In the description of the present embodiment, unless otherwise specified, “plurality” means two or more.
  • a general electronic device with a voice interaction function can issue a corresponding voice response according to the recognized voice information.
  • the electronic device cannot recognize the user who sent the voice information, that is, when the electronic device is in the voice interaction function, once the voice information is recognized, it will issue a corresponding voice response.
  • the corresponding voice response sent by the electronic device is also fixed.
  • the voice interaction function of the electronic device enables the electronic device to have a voice dialogue with the user. If the electronic device can identify the user who issued the voice message, it can issue a corresponding voice response according to the user who issued the voice message and the targeted voice. In this way, a personalized voice interaction experience can be provided to the user, thereby increasing the user's interest in using the electronic device for voice interaction.
  • electronic devices generally cannot "impersonate” other users.
  • “play” means that the electronic device simulates the voice of user 1 when interacting with user 2 by voice, and uses the dialogue mode of user 1 and user 2 to interact with user 2 by voice.
  • parents who need to go to work, they cannot communicate with their children at any time.
  • the electronic device can "play" the father or mother and the child's voice dialogue, to satisfy the child's idea of communicating with the parent. This enables electronic devices to provide children with more personalized and humanized voice interactions.
  • the embodiment of the present application provides a voice interaction method, which is applied to an electronic device.
  • the electronic device can "impersonate" user 1 and user 2 voice interaction.
  • the voice interaction performance of the electronic device is improved, and the user 2 can also be provided with a personalized interaction experience.
  • the electronic devices in the embodiments of the present application may be mobile phones, televisions, smart speakers, tablet computers, desktops, laptops, handheld computers, notebook computers, vehicle-mounted devices, ultra-mobile personal computers (ultra-mobile personal computers).
  • computer UMPC
  • netbooks as well as cellular phones, personal digital assistants (PDAs), augmented reality (AR) ⁇ virtual reality (VR) devices, etc.
  • PDAs personal digital assistants
  • AR augmented reality
  • VR virtual reality
  • FIG. 1A is a system architecture diagram provided by an embodiment of this application.
  • the electronic device “plays" user 1 and interacts with user 2 by voice.
  • the electronic device can collect the voice information sent by the user 2.
  • the electronic device can interact with the remote server through the Internet and send the voice information of the user 2 to the server.
  • the server generates the response information corresponding to the voice information, and sends the generated information to the electronic device.
  • the electronic device is used to play the response information corresponding to the voice information, so as to achieve the purpose of "impersonating" the voice interaction between the user 1 and the user 2.
  • the electronic device can collect and recognize the voice information sent by the user 2 and can play the response information corresponding to the voice information.
  • the voice information of the user 2 is recognized through the server connected to the electronic device, and response information corresponding to the voice information is generated.
  • the electronic device plays the response information corresponding to the voice information, which can reduce the computing demand of the electronic device and reduce the production cost of the electronic device.
  • FIG. 1B is another system architecture diagram provided by an embodiment of this application.
  • the electronic device “plays" user 1 and interacts with user 2 by voice.
  • the electronic device can collect the voice information sent by the user 2, and the electronic device recognizes that it is the voice information of the user 2 according to the voice information, and the voice information is a request to have a voice dialogue with the user 1.
  • the electronic device generates corresponding response information according to the voice information, and plays the response information.
  • the electronic device can implement voice interaction, which reduces the electronic device's dependence on the Internet.
  • FIG. 2A is a schematic structural diagram of an electronic device 200 according to an embodiment of this application.
  • the electronic device 200 may include a processor 210, an external memory interface 220, an internal memory 221, a universal serial bus (USB) interface 230, a charging management module 240, a power management module 241, and a battery 242, antenna 1, antenna 2, mobile communication module 250, wireless communication module 260, audio module 270, sensor module 280, camera 293, display screen 294, and so on.
  • USB universal serial bus
  • the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the electronic device 200.
  • the electronic device 200 may include more or fewer components than shown, or combine certain components, or split certain components, or arrange different components.
  • the illustrated components can be implemented in hardware, software, or a combination of software and hardware.
  • the processor 210 may include one or more processing units.
  • the processor 210 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait.
  • AP application processor
  • modem processor modem processor
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller memory
  • video codec digital signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • the different processing units may be independent devices or integrated in one or more processors.
  • the controller may be the nerve center and command center of the electronic device 200.
  • the controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching instructions and executing instructions.
  • a memory may also be provided in the processor 210 for storing instructions and data.
  • the memory in the processor 210 is a cache memory.
  • the memory can store instructions or data that have just been used or recycled by the processor 210. If the processor 210 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 210 is reduced, and the efficiency of the system is improved.
  • the processor 210 may include one or more interfaces.
  • the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and a universal asynchronous transmitter/receiver (universal asynchronous) interface.
  • I2C integrated circuit
  • I2S integrated circuit built-in audio
  • PCM pulse code modulation
  • PCM pulse code modulation
  • UART universal asynchronous transmitter/receiver
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM subscriber identity module
  • USB Universal Serial Bus
  • the interface connection relationship between the modules illustrated in the embodiment of the present invention is merely illustrative, and does not constitute a structural limitation of the electronic device 200.
  • the electronic device 200 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.
  • the external memory interface 220 may be used to connect an external memory card, such as a Micro SD card, so as to expand the storage capacity of the electronic device 200.
  • the external memory card communicates with the processor 210 through the external memory interface 220 to realize the data storage function. For example, save music, video and other files in an external memory card.
  • the internal memory 221 may be used to store computer executable program code, where the executable program code includes instructions.
  • the processor 210 executes various functional applications and data processing of the electronic device 200 by running instructions stored in the internal memory 221.
  • the internal memory 221 may include a storage program area and a storage data area.
  • the storage program area can store an operating system, an application program (such as a sound playback function, an image playback function, etc.) required by at least one function, and the like.
  • the data storage area can store data (such as audio data, phone book, etc.) created during the use of the electronic device 200.
  • the internal memory 221 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like.
  • UFS universal flash storage
  • the charging management module 240 is used to receive charging input from the charger.
  • the charger can be a wireless charger or a wired charger.
  • the power management module 241 is used to connect the battery 242, the charging management module 240 and the processor 210.
  • the power management module 241 receives input from the battery 242 and/or the charging management module 240, and supplies power to the processor 210, the internal memory 221, the external memory, the display screen 294, the wireless communication module 260, and the audio module 270.
  • the wireless communication function of the electronic device 200 can be implemented by the antenna 1, the antenna 2, the mobile communication module 250, the wireless communication module 260, the modem processor, and the baseband processor.
  • the mobile communication module 250 can provide a wireless communication solution including 2G/3G/4G/5G and the like applied to the electronic device 200.
  • the mobile communication module 250 may include at least one filter, a switch, a power amplifier, a low noise amplifier (LNA), and the like.
  • the mobile communication module 250 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modem processor for demodulation.
  • the mobile communication module 250 can also amplify the signal modulated by the modem processor, and convert it into electromagnetic wave radiation via the antenna 1.
  • the wireless communication module 260 can provide applications on the electronic device 200 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), and global navigation satellites. System (global navigation satellite system, GNSS), frequency modulation (FM), near field communication (NFC), infrared technology (infrared, IR) and other wireless communication solutions.
  • the wireless communication module 260 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 260 receives electromagnetic waves via the antenna 2, frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 210.
  • the wireless communication module 260 may also receive the signal to be sent from the processor 210, perform frequency modulation, amplify, and convert it into electromagnetic waves and radiate it through the antenna 2.
  • the display screen 294 is used to display images, videos, and the like.
  • the display screen 294 includes a display panel.
  • the display panel can adopt liquid crystal display (LCD), organic light-emitting diode (OLED), active matrix organic light-emitting diode or active-matrix organic light-emitting diode (active-matrix organic light-emitting diode).
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • active-matrix organic light-emitting diode active-matrix organic light-emitting diode
  • AMOLED flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (QLED), etc.
  • the electronic device 200 may include one or N display screens 294, and N is a positive integer greater than one.
  • the camera 293 is used to capture still images or videos.
  • the object generates an optical image through the lens and is projected to the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transfers the electrical signal to the ISP to convert it into a digital image signal.
  • ISP outputs digital image signals to DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
  • the electronic device 200 may include 1 or N cameras 293, and N is a positive integer greater than 1.
  • Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the electronic device 200 selects the frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.
  • Video codecs are used to compress or decompress digital video.
  • the electronic device 200 may support one or more video codecs. In this way, the electronic device 200 can play or record videos in multiple encoding formats, such as: moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, and so on.
  • MPEG moving picture experts group
  • MPEG2 MPEG2, MPEG3, MPEG4, and so on.
  • the electronic device 200 can implement audio functions through an audio module 270, a speaker 270A, a microphone 270B, and an application processor. For example, music playback, recording, etc.
  • the audio module 270 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal.
  • the audio module 270 can also be used to encode and decode audio signals.
  • the audio module 270 may be provided in the processor 210, or part of the functional modules of the audio module 270 may be provided in the processor 210.
  • the speaker 270A also called “speaker” is used to convert audio electrical signals into sound signals.
  • the electronic device 200 can listen to music through the speaker 270A, or listen to a hands-free call.
  • the speaker 270A can play the response information of the voice information.
  • Microphone 270B also called “microphone”, “microphone”, is used to convert sound signals into electrical signals.
  • the user can make a sound by approaching the microphone 270B through the human mouth, and input the sound signal into the microphone 270B.
  • the microphone 270B can collect voice information uttered by the user.
  • the electronic device 200 may be provided with at least one microphone 270B.
  • the electronic device 200 may be provided with two microphones 270B, which can implement a noise reduction function in addition to collecting sound signals.
  • the electronic device 200 may also be provided with three, four or more microphones 270B to collect sound signals, reduce noise, identify sound sources, and realize directional recording functions.
  • the software system of the electronic device 200 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
  • an Android system with a layered architecture is taken as an example to illustrate the software structure of the electronic device 200.
  • FIG. 2B is a software structure block diagram of an electronic device 200 according to an embodiment of the present invention.
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Communication between layers through software interface.
  • the Android system is divided into four layers, from top to bottom, the application program layer, the application framework layer, the Android runtime and system libraries, and the kernel layer.
  • the application layer can include a series of application packages.
  • the application package may include applications such as camera, gallery, calendar, WLAN, voice conversation, Bluetooth, music, video, etc.
  • the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
  • the application framework layer includes some predefined functions.
  • the application framework layer may include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, and so on.
  • the window manager is used to manage window programs.
  • the window manager can obtain the size of the display screen, determine whether there is a status bar, lock the screen, take a screenshot, etc.
  • the content provider is used to store and retrieve data and make these data accessible to applications.
  • the data may include videos, images, audios, phone calls made and received, browsing history and bookmarks, phone book, etc.
  • the data may be the voiceprint characteristics of user 2, the relationship between user 2 and user 1, and so on.
  • the view system includes visual controls, such as controls that display text, controls that display pictures, and so on.
  • the view system can be used to build applications.
  • the display interface can be composed of one or more views.
  • the phone manager is used to provide the communication function of the electronic device 200. For example, the management of the call status (including connecting, hanging up, etc.).
  • the resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.
  • the notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and it can automatically disappear after a short stay without user interaction.
  • the notification manager is used to notify download completion, message notification, Bluetooth pairing success notification, etc.
  • the notification manager can also be a notification that appears in the status bar at the top of the system in the form of a chart or a scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window. For example, text messages are prompted in the status bar, prompt sounds, electronic devices vibrate, and indicator lights flash.
  • Android Runtime includes core libraries and virtual machines. Android runtime is responsible for the scheduling and management of the Android system.
  • the core library consists of two parts: one part is the function functions that the java language needs to call, and the other part is the core library of Android.
  • the application layer and application framework layer run in a virtual machine.
  • the virtual machine executes the java files of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.
  • the system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), three-dimensional graphics processing library (for example: OpenGL ES), 2D graphics engine (for example: SGL), etc.
  • the surface manager is used to manage the display subsystem and provides a combination of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
  • the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, synthesis, and layer processing.
  • the 2D graphics engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least display driver, camera driver, audio driver, and sensor driver.
  • FIG. 3A is a flowchart of a voice interaction method provided by an embodiment of this application.
  • the electronic device is a smart speaker, and the smart speaker "plays" the user 1 and the user 2 as an example to illustrate the voice interaction method in detail.
  • the method includes step 301-step 304.
  • Step 301 User 2 sends the first voice message to the smart speaker.
  • the first voice information is used to request the smart speaker to "play" the voice conversation between user 1 and himself (user 2).
  • the parents in the family go to work, and the children need the company of their parents at home. If they want to have a voice conversation with their parents, they can send a voice message to the smart speaker at home, requesting the smart speaker to "play" the father or mother to accompany you. Own.
  • the first voice message may be "I want to talk to Dad" or "Speaker speaker, I want to talk to Dad".
  • smart speakers need to be awakened to work, and the wake-up voice of smart speakers can be fixed.
  • user 2 may first send a wake-up word to the smart speaker, so that the smart speaker is in an awake state.
  • the wake-up word can be "speaker speaker”, “smart speaker” or “voice speaker”, etc.
  • the wake word can be pre-configured in the smart speaker, or can be set by the user in the smart speaker.
  • the above-mentioned first voice information may not include a wake-up word.
  • the first voice message may be "I want to talk to Dad".
  • the first voice message may include the above-mentioned wake-up word, and may also include a voice command sent by user 2 to the smart speaker.
  • the first voice message may be "Speaker box, I want to talk to Dad".
  • Step 302 The smart speaker receives the first voice message sent by the user 2.
  • the smart speaker is in a dormant state when it is not awakened.
  • the voice wake-up process may include: the smart speaker monitors voice data through a low-power digital signal processor (Digital Signal Processing, DSP).
  • DSP Digital Signal Processing
  • the DSP detects that the similarity between the voice data and the above wake-up words meets certain conditions, the DSP delivers the monitored voice data to an application processor (Application Processor, AP).
  • AP Application Processor
  • the AP performs text verification on the above voice data to determine whether the voice data can wake up the smart speaker.
  • the smart speaker when the smart speaker is in the dormant state, it can monitor the voice information sent by the user at any time. If the voice message is not a wake-up voice to wake up itself (smart speaker), the smart speaker will not respond to the voice information, nor will it record the voice information. voice message.
  • the smart speaker is in the wake-up state, and the wake-up word of the smart speaker may not be included in the first voice information, and the smart speaker receives the first voice information and responds to the first voice information.
  • the first voice information includes the wake-up word of the smart speaker, and the smart speaker is awakened upon receiving the first voice information, and responds to the first voice information.
  • Step 303 The smart speaker recognizes the first voice information in response to the first voice information; and determines that the first voice information is used to request a voice dialogue with the user 1.
  • the smart speaker can perform text recognition on the first voice information, and according to the result of the text recognition, determine that the first voice information is used to request a voice dialogue with user 1. That is, the first voice information includes the name or title of the role to be "played” by the smart speaker, so that the smart device can identify the role to be “played” based on the name or title.
  • the smart speaker recognizes the name in the first voice message, it can determine the role to be "played”. For example, if the first voice message is "I want to talk to Li Ming", the smart speaker can determine that the role to be “played” is Li Ming.
  • the smart speaker recognizes the title in the first voice information, it can be determined that it is user 2 who issued the first voice information, and the smart speaker determines the role to be "played” according to the relationship between the user 2 and the title in the first voice information.
  • the relationship of family members can be pre-stored in the smart speaker. Then, after the smart speaker receives the first voice information, it can determine the role to be "played" according to the relationship of the family members.
  • Example 1 The first voice message sent by user 2 received by the smart speaker is "I want to talk to Dad". After the smart speaker recognizes the first voice message, it recognizes the title “Dad” and can determine user 2 and the person to be played. The role is a parent-child relationship. The smart speaker can recognize that the first voice message is sent by the child “Li Xiaoming", and based on the father-son relationship between Li Huaweing and Li Ming in the pre-existing family member relationship, it is determined that the role of "Li Ming" (dad) is to be "played”.
  • Example 2 The first voice message sent by user 2 received by the smart speaker is "I want to speak with Li Ming", and the smart speaker recognizes the name “Li Ming” included in the first voice message. The smart speaker can also recognize that the first voice message is "Li Xiaoming". The smart speaker determines that Li Ming and Li Xiaoming are father-son relationships based on the pre-stored family member relationship. The smart speaker determines that the role to be "played” is Li Xiaoming's "dad” (Li Xiaoming) bright).
  • the smart speaker when the smart speaker is initially set, the relationship between members of the family needs to be entered into the smart speaker.
  • the smart speaker only needs to acquire the relationship between a family character and another known family character, and the smart speaker can infer the relationship between the family character and other family characters.
  • family members include: grandpa, grandma, father, mother, and children. If you have entered grandpa, grandma, and mother, you can only indicate that the father and mother are husband and wife after entering the father.
  • Smart speakers can infer from the relationship between mom and dad that dad and grandpa are father-son relationships, and that dad and grandma are mother-child relationships.
  • the above-mentioned reasoning can be realized through technologies such as knowledge graphs.
  • the pre-stored family member information may include: name, age, gender, communication method, voice information, image information, preferences and personality, and so on.
  • record the relationship information between the family member and the existing family members when recording the information of each family member, it is also possible to record the name of the member, such as “dad”, “grandfather”, “old man Li” and “Mr. Li”, among which, “dad” and “Mr. Li” are both Refers to Li Ming; “Grandpa” and “Mr. Li” both refer to Li Ming's father. For example, if Li Xiaoming is user 2, the first voice message is "I want to talk to Mr. Li” or the first voice message is "I want to talk to Dad", the smart speaker can determine that the person who "plays" is Dad Li Ming.
  • Step 304 Based on the recognition that the first voice information is the voice information of the user 2, the smart speaker can simulate the voice of the user 1 and send out the response information of the first voice information according to the dialogue mode of the voice dialogue between the user 1 and the user 2.
  • the smart speaker recognizes that the first voice is the voice information uttered by the user 2, and the smart speaker can generate the response information of the first voice information according to the dialogue mode of the user 1 and the user 2. In other words, the smart speaker can "play" user 1, and speculate that user 1 may respond after hearing the first voice message sent by user 2.
  • the voice of user 1 and the dialogue mode of user 1 and user 2 are pre-stored in the smart speaker.
  • the smart speaker follows the dialogue mode of user 1 and user 2, and simulates the voice of user 1 to send out voice information, so that user 2 thinks it is really in a voice dialogue with user 1. So as to provide user 2 with a personalized voice interaction experience.
  • the smart speaker can analyze user 1’s voice, including analyzing user 1’s voiceprint characteristics. Among them, each person's voiceprint characteristics are unique, so the speaker can be identified according to the voiceprint characteristics in the voice.
  • the smart speaker analyzes the voice of user 1 and saves the voiceprint characteristics of user 1, so that the smart speaker can simulate the voice of user 1 when it wants to "play" user 1.
  • the smart speaker receives the voice information of user 1, and can analyze the voiceprint characteristics of user 1.
  • the smart speaker saves the voiceprint characteristics of user 1, so that when the smart speaker determines to "play" user 1, it can simulate the voice of user 1 according to the stored voiceprint characteristics of user 1. It is understandable that when the smart speaker has a voice conversation with user 1, the voiceprint feature of user 1 can be updated according to the change of user 1's voice. Or, as time changes, the smart speaker can update the voiceprint characteristics of user 1 when talking to user 1 after an interval of a preset time.
  • the dialogue mode of user 1 and user 2 can reflect the language expression characteristics of user 1, and the dialogue mode of user 1 and user 2 includes the tone and wording of the voice dialogue of user 1 and user 2.
  • different people may have different tones. For example, a person speaks softly when communicating with his lover, and respectful when communicating with elders in the family. Therefore, the smart speaker can infer the tone of the user 1 to be played based on the relationship between the role to be "played" and the user 2.
  • the words used in the voice conversation between user 1 and user 2 can also reflect the language expression characteristics of user 1, so that the response information of the first voice information generated by the smart speaker according to the words used in the voice dialogue between user 1 and user 2 is closer to user 1.
  • the smart speaker can simulate the voice of the user 1 and send out the response information of the first voice message according to the dialogue mode of the user 1 and the user 2, so that the user 2 thinks that the user 1 is in a voice dialogue.
  • the dialogue of the user 1 and the user 2 voice dialogue may include: the tone of the user 1 and the user 2 during the dialogue, the word habit (for example, the mantra), the voice expression habit, and so on.
  • the tone of the dialogue between user 1 and user 2 includes serious, gentle, stern, slow, and aggressive.
  • the habit of using words is the characteristic of a person's language expression when speaking. For example, when speaking, he is used to using words such as "then”, “is”, “yes”, “understand” and so on.
  • Voice expression habits can reflect the characteristics of a person's language expression. For example, some people like to say inverted sentences, such as "Have you eaten?", "Then I will go first" and so on.
  • the voice dialogue between user 1 and user 2 may be pre-stored in the smart speaker, and the smart speaker may learn this voice dialogue. Learn about the tone, word habits, and language expression characteristics of the user 1 and user 2 in the voice dialogue, and save the learned information in the dialogue information of the character. Among them, the dialogue information of the character and other tasks can be stored in the dialogue information. If the smart speaker receives the user 2’s request for the smart speaker to "play" the user 1 dialogue, the smart speaker can send out a voice dialogue according to the stored dialogue mode between the user 1 and the user 2.
  • the smart speaker can also learn the dialogue mode of user 2 and user 1 in the voice dialogue between user 1 and user 2, and store the dialogue mode of user 2 as the information of user 2 in the dialogue information of user 2. .
  • the smart speaker can infer the possible tone of the user 1 based on the relationship between the user 1 and the user 2. For example, the smart speaker recognizes that the relationship between user 1 and user 2 is a father-son relationship, and the smart speaker needs to "play" the role of the father, and the smart speaker can default user 1's tone to be harsh.
  • the smart speaker infers the tone of the voice response of the user 1 based on the relationship between the user 1 and the user 2.
  • the tone of user 1 inferred by the smart speaker may be at least one.
  • the smart speaker determines that the relationship between user 1 and user 2 is grandparent, and the tone of user 1 inferred by the smart speaker is wellbeinging, slow, and happy.
  • the smart speaker has a display screen, so when the smart speaker "plays" the user 1 and is used in the voice conversation of the user 2, the photo of the user 1 can be displayed on the display screen. As shown in FIG. 3B, a photo of user 1 is displayed on the display screen of the smart speaker in FIG. 3B. Or, if the face model of user 1 is stored in the smart speaker, the smart speaker can display the dynamic change of the facial expression of user 1 on the display screen when “impersonating” the voice conversation between user 1 and user 2.
  • the smart speaker when the smart speaker "plays" the voice interaction between user 1 and user 2, the smart speaker can also turn on the camera to obtain user 2's image information.
  • the smart speaker recognizes the acquired image information of the user 2, that is, acquires information such as the appearance and actions of the user 2. In this way, the smart speaker can establish the character model of the user 2 through the voice information of the user 2 and the image information of the user 2.
  • the smart speaker establishes the character model of User 2, which can be more vivid and vivid when "playing" User 2 in the future.
  • the smart speaker when the smart speaker "plays" the voice interaction between user 1 and user 2, the smart speaker can also turn on the camera to obtain user 2's expressions, actions, etc. In this way, the smart speaker can establish the character model of user 2 through the voice information of user 2 and the image information of user 2, and determine the action information and expression information of user 2 and user 1 during the dialogue.
  • the smart speaker receives the voice information of the user 2 and asks the user 1 for the schedule.
  • the smart speaker can obtain user 1’s schedule information, and the schedule information is used to indicate the user’s schedule.
  • the smart speaker can respond to the voice information asking for the schedule according to the schedule information of the user 1.
  • the smart speaker "plays" Li Ming and his son Li Huaweing in a voice dialogue, and his son Li Huaweing sends out a voice message asking his father's schedule.
  • the smart speaker checks the schedule information of user 1 (that is, dad) and determines that dad will go on a business trip on Friday in his schedule.
  • the smart speaker can reply "son," Dad has just received a notice from the company that he has gone to Beijing to attend an important meeting, and may not be able to attend your Friday graduation ceremony.”
  • the smart speaker can also save the dialogue information of each "playing" role. And in the next "playing" role, if relevant schedule information is involved, the smart speaker can feed back the updated schedule to the user 2.
  • the above-mentioned smart speaker “plays” after the voice dialogue between Dad and Li Xiaoming is over.
  • the smart speaker “plays” the voice conversation between Xiao Ming and Xiao Ming's mother (user 2).
  • the voice message sent by Xiaoming’s mother was “Son, your dad and I will accompany you to attend the graduation ceremony on Friday.”
  • the smart speaker can respond to “My dad said he needs to travel to Beijing to participate in the meeting” based on the last voice conversation of “playing” his father. , I can’t participate in my graduation ceremony.”
  • the above steps 301 to 304 are a conversation between the user 2 and the smart speaker.
  • the smart speaker can continue the voice conversation with the user 2. For example, the user 2 sends voice information to the smart speaker again, and after the smart speaker receives the voice information, it is based on the voice information that the user 2 sends.
  • the smart speaker continues to simulate the voice of user 1 and talks with user 2 according to the dialogue mode of user 1 and user 2. That is to say, the smart speaker continues to receive the voice information of user 2 before it will simulate the voice of user 1 and send out the voice information according to the dialogue mode of user 1 and user 2. If the voice message is not sent by user 2, the smart speaker may not simulate the voice of user 1.
  • each time the smart speaker responds to user 2's voice information it can wait for a preset time.
  • the preset waiting time is the reaction time of the user 2 so that the smart speaker can maintain the voice dialogue with the user 2. If the voice message of user 2 is not received within the preset time, the smart speaker can end the voice conversation.
  • the smart speaker determines that the voice dialogue with user 2 is over
  • the content of the voice dialogue can be sent to the electronic device of user 1, so that user 1 can understand that the smart speaker "plays" him (user 1) and the user 2’s dialogue details.
  • the smart speaker determines that the voice dialogue with user 2 is over, and the smart speaker may summarize the summary of the voice dialogue and send the summary of the voice dialogue to the electronic device of user 1. This allows user 1 to simply understand that the smart speaker "plays" the conversation between him (user 1) and user 2.
  • the smart speaker may send a summary of the voice dialogue to the electronic device of the user 1 after a preset time has elapsed after receiving the end of the voice dialogue.
  • user 2 is Xiao Ming's mother, and user 1 played by the smart speaker is Xiao Ming. If Huawei’s mother is going to go shopping for groceries, she says to the smart speaker, "Mom is going to buy groceries. You must finish your homework before you can watch TV.” Later, user 2 is Xiao Ming's grandmother, and user 1 played by the smart speaker is Xiao Ming.
  • Grandma Xiaoming is going to go for a walk, she says to the smart speaker, "Grandma went for a walk, and I left you a cake in the refrigerator, remember to take it to eat.”
  • the smart speaker summarizes and summarizes the conversations between different characters and Xiaoming, and then generates a summary of the conversation.
  • the summary can be "Mom reminds you to complete the homework in time, and grandma left you a cake in the refrigerator. ".
  • the smart speaker can send the conversation summary to Xiao Ming's mobile phone through a communication method (such as a text message).
  • the smart speaker can recognize that the first voice information is sent by the user 2 and can recognize that the first voice information instructs the smart speaker to "play" the user 1.
  • the smart speaker can simulate the voice of the user 1 and send out the response information of the first voice information in a dialogue manner between the user 1 and the user 2.
  • This voice interaction method improves the interaction performance of the smart speaker, and can provide the user 2 with a personalized voice interaction experience.
  • the above-mentioned smart speaker includes a hardware structure and/or software module corresponding to each function.
  • the embodiments of the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of the embodiments of the present application.
  • the embodiment of the present application may divide the functional modules of the smart speaker according to the foregoing method examples.
  • each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or software function modules. It should be noted that the division of modules in the embodiments of the present application is illustrative, and is only a logical function division, and there may be other division methods in actual implementation.
  • the smart speaker may include: a speech recognition module 401, a relational reasoning module 402, a role playing module 403, a knowledge pre-storage module 404, a role information knowledge base 405, and an audio module 406.
  • the smart speaker may also include a camera module, a communication module, and a sensor module.
  • the voice recognition module 401 is used to recognize the first voice information received by the smart speaker.
  • the relationship reasoning module 402 is used to infer the relationship between the newly entered character and the existing family character based on the existing family character relationship.
  • the role playing module 403 is used for the smart speaker to simulate the voice of the user 1 and send out the response information corresponding to the first voice information.
  • the knowledge pre-storage module 404 is used to store the information of each user, so that the role-playing module 403 can obtain the user information, so that the role-playing module 403 can generate response information corresponding to the voice information according to the user information.
  • the role information knowledge base 405 is used to store dialogue information of the user, and can generate response information of the voice information according to the first voice information.
  • the smart speaker may also include a summary module.
  • the summary module is used to extract keywords in the dialog information, and use the keywords as a summary of the dialog information; or, to summarize the information of the dialog information.
  • the summary module can send a summary of the dialogue information to the smart device of the user 1 who is "playing" by the smart speaker.
  • the communication module in the smart speaker sends the key words in the dialogue information extracted by the summary module to the smart device of the user 1 who is "playing" by the smart speaker.
  • the unit modules in the above-mentioned smart speaker include but are not limited to the above-mentioned speech recognition module 401, relational reasoning module 402, role-playing module 403, knowledge pre-storage module 404, role information knowledge base 405, audio module 406 and so on.
  • the smart speaker may also include a storage module.
  • the storage module is used to store the program code and data of the electronic device.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer program code, when the above-mentioned processor executes the computer program code, the smart speaker can execute the relevant method steps in FIG. 3A to realize the above The method in the embodiment.
  • the embodiment of the present application also provides a computer program product.
  • the computer program product runs on a computer, the computer is caused to execute the relevant method steps in FIG. 3A to implement the method in the foregoing embodiment.
  • the smart speakers, computer storage media, or computer program products provided in the embodiments of the present application are all used to execute the corresponding methods provided above. Therefore, the beneficial effects that can be achieved can refer to the corresponding methods provided above The beneficial effects in the process will not be repeated here.
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, for example, multiple units or components may be divided. It can be combined or integrated into another device, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a readable storage medium.
  • the technical solutions of the embodiments of the present application are essentially or the part that contributes to the prior art, or all or part of the technical solutions can be embodied in the form of a software product, and the software product is stored in a storage medium. It includes several instructions to make a device (which may be a single-chip microcomputer, a chip, etc.) or a processor (processor) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, ROM, magnetic disk, or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Telephone Function (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本申请实施例提供一种语音交互方法及电子设备,涉及人工智能AI技术领域和语音处理技术领域,可以提高电子设备与用户交互的性能,从而为用户提供个性化的语音交互体验。具体方案包括:电子设备可以接收第二用户发出的第一语音信息;并响应于该第一语音信息,电子设备识别该第一语音信息。其中,第一语音信息用于请求与第一用户进行语音对话。基于电子设备识别第一语音信息是第二用户的语音信息,电子设备可以模拟第一用户的声音,并且按照第一用户与第二用户进行语音对话的方式,与第二用户进行语音对话。该方法可应用于通过智能机器人代替父母实现陪伴和教育儿童的场景中。

Description

一种语音交互方法及电子设备
本申请要求于2020年3月27日提交国家知识产权局、申请号为202010232268.3、发明名称为“一种语音交互方法及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及人工智能技术领域和语音处理技术领域,尤其涉及一种语音交互方法及电子设备。
背景技术
现有的智能设备大多可以接收用户发出的语音信息(如语音命令),并执行该语音信息对应的操作。示例性的,上述智能设备可以是手机、智能机器人、智能手表或者智能家居设备(如智能电视机)等设备。例如,手机可以接收用户发出的语音命令“调低音量”,然后自动调低手机的音量。
一些智能设备还可以提供语音交互功能。例如,智能机器人可以接收用户的语音信息,并根据该语音信息与用户进行语音会话,从而实现语音交互功能。但是,现有的智能设备与用户进行语音会话时,只能按照设定的语音模式给出一些模式化的语音回复。智能设备与用户的交互性能较差,无法为用户提供个性化的语音交互体验。
发明内容
本申请提供一种语音交互方法及电子设备,提高电子设备与用户交互的性能,从而为用户提供个性化的语音交互体验。
为实现上述技术目的,本申请采用如下技术方案:
第一方面,本申请提供了一种语音交互方法,该方法可以包括:电子设备可以接收第二用户发出的第一语音信息;并响应于第一语音信息,电子设备识别该第一语音信息。其中,第一语音信息用于请求与第一用户进行语音对话。基于电子设备识别第一语音信息是第二用户的语音信息,电子设备可以模拟第一用户的声音,并且按照第一用户与第二用户进行语音对话的方式,与第二用户进行语音对话。
上述方案中,电子设备可以接收到第一语音信息,并识别出该第一语音信息是第二用户发出的。由于第一语音信息是请求与第一用户进行语音对话,则电子设备可以识别出第一语音信息是第二用户想和第一用户语音对话。这样,电子设备可以模拟第一用户的声音,按照第一用户与第二用户进行语音对话的对话方式,智能的与第二用户进行语音对话。如此,电子设备便可以模拟第一用户,为第二用户提供与第一用户进行真实语音对话的交流体验。这种语音交互方式提高了电子设备的交互性能,并且可以为用户提供个性化的语音交互体验。
在一种可能的实施方式中,上述对话方式用于指示第一用户与第二用户进行语音对话的语气和用词。
电子设备按照第一用户与第二用户进行语音对话的对话方式,与第二用户进行语音对话。也就是说,电子设备按照第一用户与第二用户对话时的语气和用词,与第一 用户进行语音对话。为第二用户提供更真实与第一用户语音对话的交流体验,提高了电子设备的交互性能。
另一种可能的实施方式中,电子设备中可以保存有第一用户的图像信息。那么,在电子设备模拟第一用户的声音,并按照第一用户与第二用户的对话方式,与第二用户进行语音对话时,电子设备还可以显示第一用户的图像信息。
如果电子设备可以显示图像,并且,电子设备中保存有第一用户的图像信息。那么电子设备在模拟第一用户和第二用户进行语音对话时,显示出第一用户的图像信息。如此,电子设备模拟第一用户与第二用户进行语音对话时,第二用户不仅可以听到第一用户的声音,还可以看到第一用户的图像。通过本方案,可以为用户提供类似于与第一用户面对面语音对话的交流体验。
另一种可能的实施方式中,电子设备中可以保存有第一用户的人脸模型。那么,在电子设备模拟第一用户的声音,并按照第一用户与第二用户的对话方式,与第二用户进行语音对话时,电子设备可以模拟第一用户与第二用户进行语音对话的表情,显示第一用户的人脸模型。其中,电子设备显示的人脸模型中第一用户的表情可以动态变化的。
如果电子设备中保存有第一用户的人脸模型,电子设备模拟第一用户与第二用户进行语音交互时,电子设备显示第一用户的人脸模型。而且,电子设备显示的人脸模型可以动态变化,使得用户以为在和第一用户语音对话。如此,电子设备模拟第一用户与第二用户进行语音对话时,第二用户可以在和第一用户语音对话时不仅能听见第一用户的声音,还能看见第一用户的面部表情。通过本方案,可以为用户提供更真实的与第一用户面对面语音对话的体验。
另一种可能的实施方式中,电子设备在接收到第一语音信息之前,上述方法还可以包括:该电子设备还可以获取第二语音信息,第二语音信息是第一用户与第二用户进行语音对话时的语音信息。电子设备分析获取到第二语音信息,从而可以得到第一用户和第二用户进行语音对话时的语音特征,并保存该语音特征。
可以理解的,语音特诊可以包括声纹特征、语气特征和用词特征。语气特征用于指示第一用户与第二用户进行语音对话时语气;用词特征用于指示第一用户与第二用户进行语音对话时的惯用词汇。为第二用户提供更真实与第一用户语音对话的交流体验,进一步提高了电子设备的交互性能。
其中,电子设备模拟第一用户和第二用户语音交互之前,电子设备获取第二语音信息,第二语音信息是第一用户和第二用户语音对话的语音信息。电子设备可以根据该第二语音信息分析第一用户和第二用户语音对话时的语音特征。这样一来,电子设备模拟第一用户与第二用户语音对话的对话方式时,电子设备可以发出和第一用户相似的语音对话,从而为用户提供个性化的语音交互体验。
另一种可能的实施方式中,电子设备在第二语音信息中,还可以保存上述电子设备模拟第一用户与第二用户的语音对话记录。
另一种可能的实施方式中,上述基于电子设备识别出第一语音信息是第二用户的语音信息,电子设备可以模拟第一用户的声音,按照第一用户与第二用户进行语音对话的对话方式,与第二用户进行语音对话。可以为,电子设备识别出第一语音是第二 用户的语音信息,电子设备模拟第一用户的声音,按照第一用户与第二用户进行语音对话的对话方式,发出第一语音的语音响应信息。如果电子设备在发出第一语音的语音响应信息之后,接收到第三语音信息,并且,电子设备识别出该第三语音是第二用户的语音信息。则电子设备识别出第三语音是第二用户的语音信息,电子设备可以模拟第一用户的声音,并按照第一用户与第二用户进行语音对话的对话方式,发出第三语音信息的语音响应信息。
可以理解的,当电子设备模拟第一用户与第二用户的对话方式响应第一语音信息,则电子设备在接收到第三语音信息后,需要识别出第三语音是第二用户发出的。则电子设备需要识别第三语音信息是第二用户的语音信息之后,发出响应第三语音信息的响应信息。假如电子设备与第二用户语音对话的环境中还有其他用户在发出语音信息,电子设备在接收到第三语音信息后,识别该第三语音信息是第二用户发出的,可以更好的与第二用户进行语音对话。从而提高语音交互功能,并提升用户体验。
另一种可能的实施方式中,电子设备可以获取第一用户的日程信息,该日程信息用于是指第一用户的日程安排。上述电子设备发出第三语音的语音响应信息可以为,电子设备参考该日程信息,发出第三语音信息的语音响应信息。
如果第三语音是第二用户发出用于询问第一用户日程安排的信息,由于电子设备已经获取到第一用户的日程信息,则电子设备可以直接根据日程信息响应第三语音信息。从而为第一用户提供个性化的交互体验。
另一种可能的实施方式中,电子设备可以保存上述电子设备模拟第一用户的声音,与第二用户的语音对话记录,电子设备还可以向第一用户的电子设备发送该语音对话记录。
电子设备向第一用户的电子设备发送语音对话记录,使得第一用户可以了解对话内容。电子设备为第二用户提供更个性化的语音交互。
另一种可能的实施方式中,电子设备保存上述电子设备模拟第一用户与第二用户的语音对话记录,电子设备还可以从上述的语音对话记录中提取语音对话中的关键字。电子设备可以向第一用户的电子设备发送该关键字。
另一种可能的实施方式中,电子设备模拟第一用户的声音,按照第一用户与第二用户语音对话的对话方式,与第二用户语音交互。电子设备还可以获取第二用户的图像信息和动作信息,并保存第二用户的图像信息和动作信息。
其中,电子设备在模拟第一用户与第二用户语音对话时,获取第二用户的图像信息和动作信息,可以学习第二用户与第一用户语音对话时的表情和动作。以便电子设备模拟第二用户与第一用户语音的对话方式。
第二方面,本申请还提供一种电子设备,该电子设备可以包括存储器、语音模块和一个或多个处理器。存储器、语音模块和一个或多个处理器耦合。
麦克风可以用于接收第一语音信息。存储器用于存储计算机程序代码,计算机程序代码包括计算机指令,当处理器执行计算机指令时,处理器用于,响应于第一语音信息,识别第一语音信息,第一语音信息用于请求与第一用户进行语音对话。基于第一语音被识别出是第二用户的语音信息,模拟第一用户的声音,按照第一用户和第二用户语音对话的对话方式,与第二用户语音对话。
在一种可能的实施方式中,电子设备还可以包括显示屏,显示屏与处理器耦合。显示屏用于显示第一用户的图像信息。
另一种可能的实施方式中,电子设备中保存有第一用户的人脸模型。电子设备中的显示屏还用于,模拟第一用户与第二用户进行语音对话的表情,显示人脸模型;其中,人脸模型中第一用户的表情动态变化。
另一种可能的实施方式中,麦克风还用于,获取第二语音信息,第二语音信息是第一用户与第二用户进行语音对话时的语音信息。
处理器,还用于分析第二语音信息,得到第一用户与第二用户进行语音对话时的语音特征,并保存语音特征。
其中,语音特征包括声纹特征、语气特征和用词特征,语气特征用于指示第一用户与第二用户进行语音对话时的语气,用词特征用于指示第一用户与第二用户进行语音对话时的惯用词汇。
另一种可能的实施方式中,处理器还用于在第二语音信息中,保存电子设备模拟第一用户与第二用户的语音对话记录。
另一种可能的实施方式中,麦克风还用于,接收第三语音信息。处理器还用于,响应于第三语音信息,识别第三语音信息。基于第三语音信息被识别出是第二用户的语音信息,电子设备模拟第一用户的声音,按照第一用户与第二用户进行语音对话的对话方式,扬声器还用于发出第三语音信息的语音响应信息。
另一种可能的实施方式中,处理器还用于获取第一用户的日程信息,日程信息用于指示第一用户的日程安排。其中,发出第三语音信息的语音响应信息,包括:电子设备参考日程信息,发出第三语音信息的语音响应信息。
另一种可能的实施方式中,处理器还用于,保存电子设备模拟第一用户的声音,与第二用户的语音对话记录;向第一用户的电子设备发送语音对话记录。
另一种可能的实施方式中,处理器还用于,保存电子设备模拟第一用户与第二用户的语音对话记录。从语音对话记录提取电子设备模拟第一用户与第二用户的语音对话的关键字;向第一用户的电子设备发送关键字。
另一种可能的实施方式中,电子设备还包括摄像头,摄像头与处理器耦合;摄像头用于获取第二用户的图像信息和动作信息,处理器还用于保存第二用户的图像信息和动作信息。
第三方面,本申请还提供一种服务器,该服务器可以包括:存储器和一个或多个处理器。存储器和一个或多个处理器耦合。其中,存储器用于存储计算机程序代码,计算机程序代码包括计算机指令。当处理器执行计算机指令时,使服务器执行上述第一方面及其任一种可能的实施方式中的方法。
第四方面,本申请还提供一种计算机可读存储介质,包括计算机指令,当计算机指令在电子设备上运行时,使得该电子设备可以执行上述第一方面及其任一种可能的实施方式中的方法。
第五方面,本申请还提供一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面及其任一种可能的实施方式中的方法。
可以理解的是,上述本申请提供的第二方面的电子设备,第三方面的服务器、第 四方面的计算机可读存储介质以及计算机程序产品所能达到的有益效果,可参考如第一方面及其任一种可能的设计方式中的有益效果,此处不再赘述。
附图说明
图1A为本申请实施例提供的一种系统架构图;
图1B为本申请实施例提供的另一种系统架构图;
图2A为本申请实施例提供的一种电子设备的结构示意图;
图2B为本申请实施例提供的一种电子设备的软件结构示意图;
图3A为本申请实施例提供的语音交互方式的流程图;
图3B为本申请实施例提供的一种智能音箱的显示界面示意图;
图4为本申请实施例提供的一种智能音箱的结构示意图。
具体实施方式
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。
一般的具备语音交互功能的电子设备,可以根据识别到的语音信息发出对应的语音响应。但是,电子设备不能识别出语音信息是哪个用户发出的,也就是说,电子设备处于语音交互功能时,一旦识别出语音信息就会发出对应的语音响应。另外,电子设备发出的对应的语音响应也是固定的。电子设备的语音交互功能使得电子设备可以和用户语音对话,如果电子设备可以识别出发出语音信息的用户,可以根据发出语音信息的用户以及针对性的发出对应的语音响应。则可以为用户提供个性化的语音交互体验,从而提高用户使用电子设备进行语音交互的兴趣。
另外,电子设备一般不能“扮演”其他用户。其中,“扮演”的意思是电子设备在和用户2语音交互时,模拟用户1的声音,以及使用用户1与用户2的对话方式,与用户2语音交互。在一些实际情况中,如父母需要出门上班,不能随时和孩子沟通。如果电子设备可以“扮演”父亲或母亲与孩子语音对话,以满足孩子想要和家长沟通的想法。使得电子设备可以为孩子提供更个性化、人性化的语音交互。
本申请实施例提供一种语音交互方法,应用于电子设备。使得电子设备可以“扮演”用户1和用户2语音交互。提高了电子设备的语音交互性能,而且还可以为用户2提供个性化的交互体验。
示例性的,本申请实施例中的电子设备可以是手机、电视机、智能音箱、平板电脑、桌面型、膝上型、手持计算机、笔记本电脑、车载设备、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本,以及蜂窝电话、个人数字助理(personal digital assistant,PDA)、增强现实(augmented reality,AR)\虚拟现实(virtual reality,VR)设备等,本申请实施例对该电子设备的具体形态不作特殊限制。
以下将结合附图,对本申请实施例中的技术方案进行说明。
请参考图1A,为本申请实施例提供的一种系统架构图。假设电子设备“扮演”用户1,与用户2语音交互。如图1A电子设备可以采集用户2发出的语音信息,电子设备可以通过互联网与远程服务器交互,向服务器发送用户2的语音信息,由服务器生 成该语音信息对应的响应信息,并向电子设备发送生成的语音信息对应的响应信息。电子设备用于播放该语音信息对应的响应信息,以实现“扮演”用户1与用户2语音交互的目的。也就是说,电子设备可以采集并识别出用户2发出的语音信息,并且可以播放该语音信息对应的响应信息。这种实现方式,通过与电子设备连接的服务器识别用户2的语音信息,并生成语音信息对应的响应信息。电子设备播放语音信息对应的响应信息,可以降低电子设备的运算需求,降低电子设备的生产成本。
请参考图1B,为本申请实施例提供的另一种系统架构图。假设电子设备“扮演”用户1,与用户2语音交互。如图1B电子设备可以采集用户2发出的语音信息,电子设备根据语音信息识别出是用户2的语音信息,该语音信息是请求与用户1语音对话。电子设备根据该语音信息生成对应的响应信息,并播放该响应信息。这种实现方式中,电子设备可以实现语音交互,降低了电子设备对互联网的依赖。
请参考图2A,为本申请实施例提供的一种电子设备200的结构示意图。如图2A所示,该电子设备200可以包括处理器210,外部存储器接口220,内部存储器221,通用串行总线(universal serial bus,USB)接口230,充电管理模块240,电源管理模块241,电池242,天线1,天线2,移动通信模块250,无线通信模块260,音频模块270,传感器模块280,摄像头293和显示屏294等。
可以理解的是,本发明实施例示意的结构并不构成对电子设备200的具体限定。在本申请另一些实施例中,电子设备200可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器210可以包括一个或多个处理单元,例如:处理器210可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
其中,控制器可以是电子设备200的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
处理器210中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器210中的存储器为高速缓冲存储器。该存储器可以保存处理器210刚用过或循环使用的指令或数据。如果处理器210需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器210的等待时间,因而提高了系统的效率。
在一些实施例中,处理器210可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。
可以理解的是,本发明实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对电子设备200的结构限定。在本申请另一些实施例中,电子设备200也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。
外部存储器接口220可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备200的存储能力。外部存储卡通过外部存储器接口220与处理器210通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器221可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器210通过运行存储在内部存储器221的指令,从而执行电子设备200的各种功能应用以及数据处理。内部存储器221可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储电子设备200使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器221可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。
充电管理模块240用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。电源管理模块241用于连接电池242,充电管理模块240与处理器210。电源管理模块241接收电池242和/或充电管理模块240的输入,为处理器210,内部存储器221,外部存储器,显示屏294,无线通信模块260和音频模块270等供电。
电子设备200的无线通信功能可以通过天线1,天线2,移动通信模块250,无线通信模块260,调制解调处理器以及基带处理器等实现。
移动通信模块250可以提供应用在电子设备200上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块250可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块250可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块250还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。
无线通信模块260可以提供应用在电子设备200上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。其中,无线通信模块260可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块260经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器210。无线通信模块260还可以从处理器210接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。
显示屏294用于显示图像,视频等。显示屏294包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix  organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,电子设备200可以包括1个或N个显示屏294,N为大于1的正整数。
摄像头293用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,电子设备200可以包括1个或N个摄像头293,N为大于1的正整数。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当电子设备200在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。
视频编解码器用于对数字视频压缩或解压缩。电子设备200可以支持一种或多种视频编解码器。这样,电子设备200可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。
电子设备200可以通过音频模块270,扬声器270A,麦克风270B,以及应用处理器等实现音频功能。例如音乐播放,录音等。
音频模块270用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块270还可以用于对音频信号编码和解码。在一些实施例中,音频模块270可以设置于处理器210中,或将音频模块270的部分功能模块设置于处理器210中。
扬声器270A,也称“喇叭”,用于将音频电信号转换为声音信号。电子设备200可以通过扬声器270A收听音乐,或收听免提通话。在一些实施例中,扬声器270A可以播放语音信息的响应信息。
麦克风270B,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风270B发声,将声音信号输入到麦克风270B。例如,麦克风270B可以采集用户发出的语音信息。电子设备200可以设置至少一个麦克风270B。在一些实施例中,电子设备200可以设置两个麦克风270B,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,电子设备200还可以设置三个,四个或更多麦克风270B,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。
电子设备200的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本发明实施例以分层架构的Android系统为例,示例性说明电子设备200的软件结构。
图2B是本发明实施例的电子设备200的软件结构框图。
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程 序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。
应用程序层可以包括一系列应用程序包。
如图2B所示,应用程序包可以包括相机,图库,日历,WLAN,语音对话,蓝牙,音乐,视频等应用程序。
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。
如图2B所示,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器等。
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。例如,该数据可以是用户2的声纹特征,用户2和用户1的关系等。
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。
电话管理器用于提供电子设备200的通信功能。例如通话状态的管理(包括接通,挂断等)。
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒,蓝牙配对成功提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。
Android Runtime包括核心库和虚拟机。Android runtime负责安卓系统的调度和管理。
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。
系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(Media Libraries),三维图形处理库(例如:OpenGL ES),2D图形引擎(例如:SGL)等。
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG, PNG等。
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。
2D图形引擎是2D绘图的绘图引擎。
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。
以下实施例中的方法均可以在具备上述硬件结构的电子设备中实现。
请参考图3A,为本申请实施例提供的语音交互方法的流程图。其中,本申请实施例以电子设备是智能音箱,智能音箱“扮演”用户1与用户2进行语音对话为例,对语音交互方法进行具体说明。如图3A所示,该方法包括步骤301-步骤304。
步骤301:用户2向智能音箱发出第一语音信息。
其中,第一语音信息用于请求智能音箱“扮演”用户1与自己(用户2)语音对话。
在一种可能的场景中,家庭中父母外出上班,孩子在家需要父母的陪伴,想要跟父母语音对话是,便可以向家里的智能音箱发出语音信息,请求智能音箱“扮演”父亲或母亲陪伴自己。例如,第一语音信息可以是“我想和爸爸说话”或者“音箱音箱,我想和爸爸说话”。
可以理解的,智能音箱需要唤醒才能工作,智能音箱的唤醒语音可以是固定的。在一些实施例中,在步骤301之前,用户2可以先向智能音箱发出唤醒词,使得智能音箱处于唤醒状态。
实现方式1,该唤醒词可以是“音箱音箱”、“智能音箱”或者“语音音箱”等。该唤醒词可以预先配置在智能音箱中,也可以由用户在智能音箱中设置。在该实施例中,上述第一语音信息可以不包括唤醒词。例如,该第一语音信息可以是“我想和爸爸说话”。
实现方式2,第一语音信息中可以包括上述唤醒词,还可以包括用户2向智能音箱发出的语音命令。例如,该第一语音信息可以是“音箱音箱,我想和爸爸说话”。
步骤302:智能音箱接收用户2发出的第一语音信息。
其中,智能音箱未被唤醒时,处于休眠状态。用户2想要使用智能音箱时,可以对语音助手进行语音唤醒。语音唤醒过程可以包括:智能音箱通过低功耗的数字信号处理器(Digital Signal Processing,DSP)监测语音数据。当DSP监测到语音数据与上述唤醒词的相似度满足一定条件时,DSP将监测到的语音数据交给应用处理器(Application Processor,AP)。由AP对上述语音数据进行文本校验,以判断该语音数据是否可以唤醒智能音箱。
可以理解的,智能音箱处于休眠状态时,可以随时监听用户发出的语音信息,如果该语音信息不是唤醒自己(智能音箱)工作的唤醒语音,智能音箱不会响应该语音信息,也不会记录该语音信息。
在上述实现方式1中,智能音箱处于唤醒状态,则第一语音信息中可以不包括智能音箱的唤醒词,智能音箱接收到第一语音信息并响应于第一语音信息。
在上述实现方式2中,智能音箱处于休眠状态,则第一语音信息中包括智能音箱的唤醒词,智能音箱接收到第一语音信息被唤醒,并响应于第一语音信息。
步骤303:智能音箱响应于第一语音信息,识别第一语音信息;并确定第一语音信息用于请求与用户1语音对话。
其中,智能音箱可以对第一语音信息进行文本识别,根据文本识别的结果,确定第一语音信息用于请求与用户1语音对话。也就是说,第一语音信息中包括智能音箱要“扮演”的角色的名字或称谓,使得智能设备可以根据名字或称谓识别出要“扮演”的角色。当智能音箱识别出第一语音信息中的名字时,可以确定出要“扮演”的角色。例如,第一语音信息为“我要和李明说话”,智能音箱可以确定要“扮演”的角色为李明。当智能音箱识别出第一语音信息中的称谓时,可以确定发出第一语音信息的是用户2,智能音箱根据用户2和第一语音信息中称谓的关系确定出要“扮演”的角色。
以智能音箱的使用场景是家庭环境为例,智能音箱中可以预先存储家庭成员关系。则智能音箱接收到第一语音信息后,可以根据家庭成员关系确定出要“扮演”的角色。
示例一:智能音箱接收到的用户2发出的第一语音信息为“我想和爸爸说话”,智能音箱识别第一语音信息之后,识别出“爸爸”这个称谓,可以确定用户2和要扮演的角色是父子关系。智能音箱可以识别出第一语音信息是孩子“李小明”发出的,并且根据预存的家庭成员关系中李小明和李明的父子关系确定出要“扮演”的是“李明”(爸爸)的角色。
示例二:智能音箱接收到的用户2发出的第一语音信息为“我想和李明说话”,智能音箱识别出第一语音信息中包括的人名“李明”。智能音箱还可以识别出第一语音信息是“李小明”发出的,智能音箱根据预存的家庭成员关系确定李明和李小明是父子关系,智能音箱确定要“扮演”的角色是李小明的“爸爸”(李明)。
以智能音箱应用于家庭场景为例,智能音箱初始设置时,需要将家庭中的成员之间关系录入智能音箱中。在一种可能的实施方式中,智能音箱在获取到一个家庭人物和已知的另一家庭人物之间的关系即可,智能音箱可以推断出该家庭人物与其他的家庭人物之间的关系。例如,在家庭成员中包括:爷爷、奶奶、爸爸、妈妈和孩子。如果已经输入爷爷、奶奶和妈妈,当输入爸爸之后,可以只说明爸爸和妈妈是夫妻关系即可。智能音箱可以根据妈妈和爸爸的关系推断出爸爸和爷爷是父子关系,以及爸爸和奶奶是母子关系。其中,上述推理实现可以通过知识图谱等技术实现。
在一些实施例中,预先存储家庭成员信息可以包括:姓名、年龄、性别、通讯方式、声音信息、图像信息、喜好和性格等。同时,记录该家庭成员与已有家庭成员之间的关系信息。其中,在记录每个家庭成员的信息还可以记录该成员的称谓,如“爸爸”、“爷爷”、“李老爷子”和“李先生”等,其中,“爸爸”和“李先生”都是指李明;“爷爷”和“李老先生”都是指李明的父亲。例如,李小明是用户2,第一语音信息为“我想和李先生对话”或者,第一语音信息为“我想和爸爸对话”,智能音箱可以确定包“扮演”的是爸爸李明。
步骤304:基于第一语音信息被识别出是用户2的语音信息,智能音箱可以模拟用户1的声音,并按照用户1与用户2语音对话的对话方式,发出第一语音信息的响应信息。
可以理解的,智能音箱识别出第一语音是用户2发出的语音信息,并且,智能音箱可以按照用户1与用户2对话的对话方式生成第一语音信息的响应信息。也就是说, 智能音箱可以“扮演”用户1,推测用户1听到用户2发出的第一语音信息后可能回应的信息。
其中,智能音箱中预先存储有用户1的声音,以及用户1和用户2的对话方式。智能音箱按照用户1和用户2的对话方式,并且模拟用户1的声音发出语音信息,使得用户2以为真的在和用户1语音对话。从而为用户2提供个性化的语音交互体验。
一方面,智能音箱可以分析用户1的声音,包括分析用户1的声纹特征。其中,每个人的声纹特征都是独特的,因此可以根据声音中的声纹特征辨别发声人。智能音箱分析用户1的声音并保存用户1的声纹特征,以便智能音箱要“扮演”用户1时可以模拟用户1的声音。
具体地说,智能音箱接收到用户1的语音信息,就可以分析出用户1的声纹特征。智能音箱保存用户1的声纹特征,使得智能音箱确定要“扮演”用户1时,可以根据存储用户1的声纹特征模拟用户1的声音。可以理解的,智能音箱和用户1语音对话时,可以根据用户1的声音变化更新用户1的声纹特征。或者,随着时间的改变,智能音箱可以在间隔预设时间之后,和用户1语音对话时更新用户1的声纹特征。
另一方面,用户1和用户2的对话方式可以体现用户1的语言表达特点,用户1和用户2的对话方式包括用户1和用户2语音对话的语气和用词。其中,一个人和不同的人语音对话时,因为对话的人不同可能有不同的语气。例如,一个人和爱人沟通时语气温柔,和家里长辈沟通时语气尊敬。因此,智能音箱可以根据要“扮演”的角色与用于2的关系,推测要扮演的用户1的语气。用户1和用户2语音对话时的用词也可以体现用户1的语言表达特点,使得智能音箱根据用户1和用户2语音对话时的用词生成的第一语音信息的响应信息,更接近用户1的语言表达。智能音箱可以模拟用户1的声音,并按照用户1和用户2的对话方式发出第一语音信息的响应信息,使得用户2以为在和用户1语音对话。
具体地说,用户1与用户2语音对话的对话可以方式包括:用户1和用户2对话时的语气、用词习惯(例如,口头禅)以及语音表达习惯等。用户1和用户2对话时的语气包括,严肃、温柔、严厉、慢悠悠以及咄咄逼人等。用词习惯是一个人说话时的语言表达特点,例如,讲话时习惯使用“然后”,“就是”,“是的”,“明白了么”等词语。语音表达习惯能够体现一个人的语言表达特点。例如,有人说话喜欢说倒装句,如“你饭吃了没?”、“那我走先了”等。
示例性的,智能音箱中可以预先存储用户1和用户2的语音对话,智能音箱可以学习这段语音对话。了解用户1和用户2语音对话时的语气、用词习惯和语言表达特点等信息,并将了解到的信息保存到该人物的对话信息中。其中,对话信息中可以保存该人物与其他任务对话的对话信息。如果智能音箱接收到用户2请求智能音箱“扮演”用户1对话,智能音箱可以根据存储的用户1与用户2的对话方式发出语音对话。
可以理解的,智能音箱得到的用户1和用户2之间的语音对话越多,智能音箱学习总结到的用户1和用户2的对话方式信息就越准确。智能音箱“扮演”用户1时,智能音箱发出的第一语音信息的响应信息就越接近用户1会给出的语音回复。同理,智能音箱也能够在用户1和用户2的语音对话中学习到用户2和用户1对话时的对话方式,并将用户2的对话方式作为用户2的信息存储到用户2的对话信息中。
又示例性的,如果智能音箱中并没有存储过用户1和用户2的语音对话,智能音箱可以根据用户1和用户2的关系推断出用户1可能使用的语气。例如,智能音箱识别出用户1和用户2的关系是父子,并且智能音箱要“扮演”父亲的角色,智能音箱可以默认用户1的语气是严厉。
其中,智能音箱根据用户1和用户2的关系推断出用户1发出语音响应时的语气。智能音箱推断出的用户1的语气可以是至少一种,例如,智能音箱确定用户1和用户2的关系是祖孙,智能音箱推断的用户1的语气是宠爱、慢慢的和开心的。
在一些实施例中,智能音箱具有显示屏,则智能音箱“扮演”用户1和用于2语音对话时,可以在显示屏上显示用户1的照片。如图3B所示,图3B中智能音箱的显示屏上显示有用户1的照片。或者,智能音箱中存储有用户1的人脸模型,则智能音箱在“扮演”用户1与用户2语音对话时,可以在显示屏上显示用户1的表情动态变化。
另外,智能音箱“扮演”用户1和用户2语音交互时,智能音箱也可以开启摄像头获取用户2的图像信息。智能音箱识别获取到的用户2的图像信息,也就是获取用户2的外貌动作等信息。这样,智能音箱可以通过用户2的语音信息和用户2的图像信息建立用户2的人物模型。智能音箱建立用户2的人物模型,可以方便以后“扮演”用户2时更形象、生动。
示例性的,智能音箱“扮演”用户1和用户2语音交互时,智能音箱也可以开启摄像头获取用户2表情、动作等。以便智能音箱可以通过用户2的语音信息和用户2的图像信息建立用户2的人物模型,确定出用户2和用户1对话时的动作信息和表情信息。
假如用户2与智能音箱语音交互的过程中,智能音箱接收到用户2的语音信息询问用户1的日程安排。智能音箱可以获取用户1的日程信息,日程信息用于指示用户的日程安排。这样一来,智能音箱就可以根据用户1的日程信息响应询问日程安排的语音信息。例如,智能音箱“扮演”李明与儿子李小明语音对话,儿子李小明发出语音信息询问父亲的日程安排。假设该语音信息为“我周五的毕业典礼你来么”,智能音箱通过查询用户1(即爸爸)的日程信息,确定爸爸的日程安排中周五要去出差,智能音箱可以回复“儿子,爸爸刚收到公司的通知,得出差北京参加一个重要的会,可能没法参加你周五的毕业典礼了”。
值得一提的是,智能音箱还可以保存每次“扮演”角色的对话信息。并且在下一次“扮演”角色时,如果涉及相关的日程信息,智能音箱可以将更新的日程安排反馈给用户2。又例如,上述智能音箱“扮演”爸爸和李小明的语音对话结束后。智能音箱“扮演”小明和小明妈妈(用户2)语音对话。小明妈妈发出的语音信息为“儿子,周五的毕业典礼我和你爸陪你去参加”,智能音箱可以根据上次“扮演”爸爸的语音对话可以回复“我爸说他需要出差北京参加会议,没办法参加我的毕业典礼了”。
需要说明的是,上述步骤301-步骤304是用户2和智能音箱的一次对话,步骤304之后,智能音箱可以继续和用户2语音对话。例如,用户2再次向智能音箱发出语音信息,智能音箱接收到该语音信息之后,基于该语音信息是用户2发出的语音信息。智能音箱继续模拟用户1的声音、并按照用户1和用户2的对话方式与用户2语音对 话。也就是说,智能音箱继续接收到用户2的语音信息,才会模拟用户1的声音,并按照用户1和用户2的对话方式发出语音信息。如果该语音信息不是用户2发出的,智能音箱可以不模拟用户1的声音。
在一些实施例中,智能音箱每次响应用户2的语音信息之后,可以等待预设时间。其中,等待的预设时间是用户2的反应时间,使得智能音箱可以保持与用户2的语音对话。如果预设时间内没有接收到用户2的语音信息,智能音箱可以结束此次语音对话。
示例性的,假如智能音箱确定与用户2的语音对话结束了,可以将此次语音对话的内容发送给用户1的电子设备,以供用户1了解智能音箱“扮演”他(用户1)与用户2的对话详情。或者,智能音箱确定与用户2的语音对话结束了,智能音箱可以总结此次语音对话的摘要,将语音对话的摘要发送给用户1的电子设备。使得用户1可以简单了解到智能音箱“扮演”他(用户1)与用户2的对话情况。
在一种实施例中,智能音箱可以在接收到语音对话结束之后,经过预设时间之后再将语音对话的摘要发送给用户1的电子设备。例如,用户2为小明妈妈,智能音箱扮演的用户1为小明。如果小明妈妈准备出门买菜,对智能音箱说道“妈妈出门去买菜了,你要先写完作业才可以看电视”。稍后,用户2为小明奶奶,智能音箱扮演的用户1为小明。如果小明奶奶准备出门散步,对智能音箱说道“奶奶去散步了,给你留了个蛋糕在冰箱里,记得拿去吃”。经过预设时间之后,智能音箱对发生在不同角色和小明之间的对话进行文本摘要和汇总,然后生成对话摘要,该摘要可以为“妈妈提醒要及时完成作业,奶奶给你留了蛋糕在冰箱”。智能音箱可以通过通信方式(如短信)向小明的手机发送该对话摘要。
通过上述方式,智能音箱可以识别出第一语音信息是用户2发出的,并且可以识别出第一语音信息指示智能音箱“扮演”用户1。响应于第一语音信息,智能音箱可以模拟用户1的声音,按照用户1和用户2的对话方式发出第一语音信息的响应信息。这样就实现可智能音箱“扮演”用户1与用户2语音对话的目的。这种语音交互方式提高了智能音箱的交互性能,并且可以为用户2提供个性化的语音交互体验。
可以理解的是,上述智能音箱为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请实施例能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的范围。
本申请实施例可以根据上述方法示例对智能音箱进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
如图4所示,为上述实施例中所涉及的智能音箱的一种可能的结构示意图。智能音箱可以包括:语音识别模块401、关系推理模块402、角色扮演模块403、知识预存 模块404、角色信息知识库405和音频模块406。可选的,该智能音箱中还可以包括摄像头模块、通信模块和传感器模块等。
其中,语音识别模块401用于识别智能音箱接收到的第一语音信息。关系推理模块402用于根据已有的家庭人物关系推理出新录入的人物与已有家庭人物的关系。角色扮演模块403用于智能音箱可以模拟用户1的声音,并发出第一语音信息对应的响应信息。知识预存模块404用于存储每个用户的信息,以便角色扮演模块403获取用户信息,使得角色扮演模块403可以根据用户信息生成语音信息对应的响应信息。角色信息知识库405用于存储用户的对话信息,并且可以根据第一语音信息生成该语音信息的响应信息。
在一些实施例中,智能音箱还可以包括总结摘要模块。总结摘要模块用于提取对话信息中的关键词,将关键词作为对话信息的摘要;或者,用于总结对话信息的信息。其中,总结摘要模块可以向智能音箱“扮演”的用户1的智能设备发送对话信息的摘要。或者,智能音箱中的通信模块将总结摘要模块提取对话信息中的关键字发送给智能音箱“扮演”的用户1的智能设备。
当然,上述智能音箱中的单元模块包括但不限于上述语音识别模块401、关系推理模块402、角色扮演模块403、知识预存模块404、角色信息知识库405和音频模块406等。例如,智能音箱中还可以包括存储模块。存储模块用于保存电子设备的程序代码和数据。
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序代码,当上述处理器执行该计算机程序代码时,智能音箱可以执行图3A中相关方法步骤实现上述实施例中的方法。
本申请实施例还提供了一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行图3A中相关方法步骤实现上述实施例中的方法。
其中,本申请实施例提供的智能音箱、计算机存储介质或者计算机程序产品均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个装置,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以使用硬件的形式实现,也可以使用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该软件产品存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、ROM、磁碟或者光盘等各种可以存储程序代码的介质。
以上内容,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (23)

  1. 一种语音交互方法,其特征在于,所述方法包括:
    电子设备接收第一语音信息;
    响应于所述第一语音信息,所述电子设备识别所述第一语音信息,所述第一语音信息用于请求与第一用户进行语音对话;
    基于所述第一语音信息被识别出是第二用户的语音信息,所述电子设备模拟所述第一用户的声音,按照所述第一用户与所述第二用户进行语音对话的对话方式,与所述第二用户进行语音对话。
  2. 根据权利要求1所述的方法,其特征在于,所述对话方式用于指示所述第一用户与所述第二用户进行语音对话的语气和用词。
  3. 根据权利要求1或2所述的方法,其特征在于,所述电子设备中保存有所述第一用户的图像信息;所述方法还包括:
    所述电子设备显示所述第一用户的图像信息。
  4. 根据权利要求1或2所述的方法,其特征在于,所述电子设备中保存有所述第一用户的人脸模型;所述方法还包括:
    所述电子设备模拟所述第一用户与所述第二用户进行语音对话的表情,显示所述人脸模型;其中,所述人脸模型中所述第一用户的表情动态变化。
  5. 根据权利要求1-4中任一项所述的方法,其特征在于,在所述电子设备接收第一语音信息之前,所述方法还包括:
    所述电子设备获取第二语音信息,所述第二语音信息是所述第一用户与所述第二用户进行语音对话时的语音信息;
    所述电子设备分析所述第二语音信息,得到所述第一用户与所述第二用户进行语音对话时的语音特征,并保存所述语音特征;
    其中,所述语音特征包括声纹特征、语气特征和用词特征,所述语气特征用于指示所述第一用户与所述第二用户进行语音对话时的语气,所述用词特征用于指示所述第一用户与所述第二用户进行语音对话时的惯用词汇。
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    所述电子设备在所述第二语音信息中,保存所述电子设备模拟所述第一用户与所述第二用户的语音对话记录。
  7. 根据权利要求1-6中任一项所述的方法,其特征在于,所述基于所述第一语音信息被识别出是第二用户的语音信息,所述电子设备模拟所述第一用户的声音,按照所述第一用户与所述第二用户进行语音对话的对话方式,与所述第二用户进行语音对话,包括:
    基于所述第一语音信息被识别出是所述第二用户的语音信息,所述电子设备模拟所述第一用户的声音,按照所述第一用户与所述第二用户进行语音对话的对话方式,发出所述第一语音信息的语音响应信息;
    所述电子设备接收第三语音信息;
    响应于所述第三语音信息,所述电子设备识别所述第三语音信息;
    基于所述第三语音信息被识别出是所述第二用户的语音信息,所述电子设备模拟 所述第一用户的声音,按照所述第一用户与所述第二用户进行语音对话的对话方式,发出所述第三语音信息的语音响应信息。
  8. 根据权利要求7所述的方法,其特征在于,所述方法还包括:
    所述电子设备获取所述第一用户的日程信息,所述日程信息用于指示所述第一用户的日程安排;
    其中,所述发出所述第三语音信息的语音响应信息,包括:
    所述电子设备参考所述日程信息,发出所述第三语音信息的语音响应信息。
  9. 根据权利要求1-8中任一项所述的方法,其特征在于,所述方法还包括:
    所述电子设备保存所述电子设备模拟所述第一用户的声音,与所述第二用户的语音对话记录;
    所述电子设备向所述第一用户的电子设备发送所述语音对话记录。
  10. 根据权利要求1-9中任一项所述的方法,其特征在于,所述方法还包括:
    所述电子设备保存所述电子设备模拟所述第一用户与所述第二用户的语音对话记录;
    所述电子设备从所述语音对话记录提取所述电子设备模拟所述第一用户与所述第二用户的语音对话的关键字;
    所述电子设备向所述第一用户的电子设备发送所述关键字。
  11. 根据权利要求1-10中任一项所述的方法,其特征在于,所述方法还包括:
    所述电子设备获取第二用户的图像信息和动作信息,并保存所述第二用户的图像信息和动作信息。
  12. 一种电子设备,其特征在于,所述电子设备包括:存储器、麦克风、扬声器和处理器;所述存储器、所述麦克风和所述扬声器与所述处理器耦合;所述麦克风用于接收第一语音信息;所述存储器用于存储计算机程序代码,计算机程序代码包括计算机指令,当所述处理器执行上述计算机指令时,
    所述处理器,用于响应于所述第一语音信息,识别所述第一语音信息,所述第一语音信息用于请求与第一用户进行语音对话;
    基于所述第一语音信息被识别出是第二用户的语音信息,模拟所述第一用户的声音,按照所述第一用户与所述第二用户进行语音对话的对话方式,与所述第二用户进行语音对话;
    所述扬声器用于发出所述第一语音信息对应的响应信息。
  13. 根据权利要求12所述的电子设备,其特征在于,所述电子设备还包括显示屏,所述显示屏与所述处理器耦合;所述显示屏用于显示所述第一用户的图像信息。
  14. 根据权利要求13所述的电子设备,其特征在于,所述电子设备中保存有所述第一用户的人脸模型;
    所述显示屏还用于,模拟所述第一用户与所述第二用户进行语音对话的表情,显示所述人脸模型;其中,所述人脸模型中所述第一用户的表情动态变化。
  15. 根据权利要求12-14任一项所述的电子设备,其特征在于,
    所述麦克风,还用于获取第二语音信息,所述第二语音信息是所述第一用户与所述第二用户进行语音对话时的语音信息;
    所述处理器,还用于分析所述第二语音信息,得到所述第一用户与所述第二用户进行语音对话时的语音特征,并保存所述语音特征;
    其中,所述语音特征包括声纹特征、语气特征和用词特征,所述语气特征用于指示所述第一用户与所述第二用户进行语音对话时的语气,所述用词特征用于指示所述第一用户与所述第二用户进行语音对话时的惯用词汇。
  16. 根据权利要求15所述的电子设备,其特征在于,所述处理器还用于在所述第二语音信息中,保存所述电子设备模拟所述第一用户与所述第二用户的语音对话记录。
  17. 根据权利要求12-16中任一项所述的电子设备,其特征在于,
    所述麦克风还用于,接收第三语音信息;
    所述处理器还用于,响应于所述第三语音信息,识别所述第三语音信息;
    基于所述第三语音信息被识别出是所述第二用户的语音信息,所述电子设备模拟所述第一用户的声音,按照所述第一用户与所述第二用户进行语音对话的对话方式,所述扬声器还用于发出所述第三语音信息的语音响应信息。
  18. 根据权利要求17所述的电子设备,其特征在于,所述处理器还用于获取所述第一用户的日程信息,所述日程信息用于指示所述第一用户的日程安排;
    其中,所述发出所述第三语音信息的语音响应信息,包括:
    所述电子设备参考所述日程信息,发出所述第三语音信息的语音响应信息。
  19. 根据权利要求12-18任一项所述的电子设备,其特征在于,所述处理器还用于,保存所述电子设备模拟所述第一用户的声音,与所述第二用户的语音对话记录;
    向所述第一用户的电子设备发送所述语音对话记录。
  20. 根据权利要求12-19任一项所述的电子设备,其特征在于,所述处理器还用于,
    保存所述电子设备模拟所述第一用户与所述第二用户的语音对话记录;
    从所述语音对话记录提取所述电子设备模拟所述第一用户与所述第二用户的语音对话的关键字;
    向所述第一用户的电子设备发送所述关键字。
  21. 根据权利要求12-20任一项所述的电子设备,其特征在于,所述电子设备还包括摄像头,所述摄像头与所述处理器耦合;
    所述摄像头用于获取第二用户的图像信息和动作信息,所述处理器还用于保存所述第二用户的图像信息和动作信息。
  22. 一种服务器,其特征在于,包括存储器和一个或多个处理器;所述存储器和一个或多个所述处理器耦合;
    其中,所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令,当所述处理器执行所述计算机指令时,使所述服务器执行如权利要求1-11中任一项所述的方法。
  23. 一种计算机可读存储介质,其特征在于,包括计算机指令,当所述计算机指令在电子设备上运行时,使得所述电子设备执行如权利要求1-11任一项所述的方法。
PCT/CN2021/077514 2020-03-27 2021-02-23 一种语音交互方法及电子设备 WO2021190225A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21774325.1A EP4116839A4 (en) 2020-03-27 2021-02-23 VOICE INTERACTION METHOD AND ELECTRONIC DEVICE
US17/952,401 US20230017274A1 (en) 2020-03-27 2022-09-26 Voice interaction method and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010232268.3 2020-03-27
CN202010232268.3A CN113449068A (zh) 2020-03-27 2020-03-27 一种语音交互方法及电子设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/952,401 Continuation US20230017274A1 (en) 2020-03-27 2022-09-26 Voice interaction method and electronic device

Publications (1)

Publication Number Publication Date
WO2021190225A1 true WO2021190225A1 (zh) 2021-09-30

Family

ID=77808191

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/077514 WO2021190225A1 (zh) 2020-03-27 2021-02-23 一种语音交互方法及电子设备

Country Status (4)

Country Link
US (1) US20230017274A1 (zh)
EP (1) EP4116839A4 (zh)
CN (1) CN113449068A (zh)
WO (1) WO2021190225A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114500419A (zh) * 2022-02-11 2022-05-13 阿里巴巴(中国)有限公司 信息交互方法、设备以及系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11893985B2 (en) * 2021-01-15 2024-02-06 Harman International Industries, Incorporated Systems and methods for voice exchange beacon devices

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107026943A (zh) * 2017-03-30 2017-08-08 联想(北京)有限公司 语音交互方法及系统
WO2018195276A1 (en) * 2017-04-19 2018-10-25 Cyara Solutions Pty Ltd Automated contact center agent workstation testing
CN110633357A (zh) * 2019-09-24 2019-12-31 百度在线网络技术(北京)有限公司 语音交互方法、装置、设备和介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9634855B2 (en) * 2010-05-13 2017-04-25 Alexander Poltorak Electronic personal interactive device that determines topics of interest using a conversational agent
CN108962217B (zh) * 2018-07-28 2021-07-16 华为技术有限公司 语音合成方法及相关设备
TW202009924A (zh) * 2018-08-16 2020-03-01 國立臺灣科技大學 音色可選之人聲播放系統、其播放方法及電腦可讀取記錄媒體

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107026943A (zh) * 2017-03-30 2017-08-08 联想(北京)有限公司 语音交互方法及系统
WO2018195276A1 (en) * 2017-04-19 2018-10-25 Cyara Solutions Pty Ltd Automated contact center agent workstation testing
CN110633357A (zh) * 2019-09-24 2019-12-31 百度在线网络技术(北京)有限公司 语音交互方法、装置、设备和介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4116839A4

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114500419A (zh) * 2022-02-11 2022-05-13 阿里巴巴(中国)有限公司 信息交互方法、设备以及系统

Also Published As

Publication number Publication date
EP4116839A4 (en) 2023-03-22
CN113449068A (zh) 2021-09-28
EP4116839A1 (en) 2023-01-11
US20230017274A1 (en) 2023-01-19

Similar Documents

Publication Publication Date Title
WO2021063343A1 (zh) 语音交互方法及装置
WO2020192456A1 (zh) 一种语音交互方法及电子设备
RU2766255C1 (ru) Способ голосового управления и электронное устройство
WO2021027267A1 (zh) 语音交互方法、装置、终端及存储介质
WO2021036714A1 (zh) 一种语音控制的分屏显示方法及电子设备
US20230017274A1 (en) Voice interaction method and electronic device
WO2020006711A1 (zh) 一种消息的播放方法及终端
CN113778663A (zh) 一种多核处理器的调度方法及电子设备
US11991253B2 (en) Intelligent layer to power cross platform, edge-cloud hybrid artificial intelligence services
US11495223B2 (en) Electronic device for executing application by using phoneme information included in audio data and operation method therefor
WO2022161077A1 (zh) 语音控制方法和电子设备
WO2020259514A1 (zh) 一种调用服务的方法及装置
WO2023015961A1 (zh) 一种播放界面的显示方法及电子设备
WO2022143258A1 (zh) 一种语音交互处理方法及相关装置
WO2022194190A1 (zh) 调整触摸手势的识别参数的数值范围的方法和装置
KR102369309B1 (ko) 파셜 랜딩 후 사용자 입력에 따른 동작을 수행하는 전자 장치
WO2021238371A1 (zh) 生成虚拟角色的方法及装置
CN115083401A (zh) 语音控制方法及装置
EP4293664A1 (en) Voiceprint recognition method, graphical interface, and electronic device
WO2022188551A1 (zh) 信息处理方法与装置、主控设备和受控设备
WO2023006033A1 (zh) 语音交互方法、电子设备及介质
WO2022135254A1 (zh) 一种编辑文本的方法、电子设备和系统
WO2020253694A1 (zh) 一种用于识别音乐的方法、芯片和终端
WO2023207149A1 (zh) 一种语音识别方法和电子设备
WO2023125514A1 (zh) 设备控制方法及相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21774325

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021774325

Country of ref document: EP

Effective date: 20221007

NENP Non-entry into the national phase

Ref country code: DE