WO2023202635A1 - 语音交互方法、电子设备以及存储介质 - Google Patents

语音交互方法、电子设备以及存储介质 Download PDF

Info

Publication number
WO2023202635A1
WO2023202635A1 PCT/CN2023/089278 CN2023089278W WO2023202635A1 WO 2023202635 A1 WO2023202635 A1 WO 2023202635A1 CN 2023089278 W CN2023089278 W CN 2023089278W WO 2023202635 A1 WO2023202635 A1 WO 2023202635A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
user
electronic device
voice signal
signal
Prior art date
Application number
PCT/CN2023/089278
Other languages
English (en)
French (fr)
Inventor
吴鹏扬
曾俊飞
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023202635A1 publication Critical patent/WO2023202635A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present application relates to the field of software technology, and in particular, to a voice interaction method, electronic device, and computer-readable storage medium.
  • the electronic device With the development of audio processing technology and artificial intelligence (AI), more and more electronic devices (such as smart speakers, smart robots, etc.) have voice interaction functions. During the voice interaction process, the electronic device needs to collect the user's voice. In order to improve the accuracy of voice collection, the electronic device will collect sounds in the target direction (that is, the direction of the current interacting person) and suppress sounds in other directions other than the target direction to reduce the impact of environmental noise on the user's voice signal. interference.
  • the target direction that is, the direction of the current interacting person
  • the electronic device may be in a multi-user interaction scenario. That is, in addition to the current interacting person, there are other users (called “potential interacting persons") who may perform voice interaction with the electronic device around the electronic device. Since the electronic device only collects the voice in the direction of the current interacting person, the voice of the potential interacting person will be suppressed. When a potential interactor speaks, the electronic device cannot perceive the content of his or her voice and therefore cannot respond to it.
  • Some embodiments of the present application provide a voice interaction method, an electronic device, and a computer-readable storage medium.
  • the present application is introduced from multiple aspects below, and the implementations and beneficial effects of the following multiple aspects can be referred to each other.
  • embodiments of the present application provide a voice interaction method for electronic devices.
  • the method includes: performing voice interaction with a first user, and during the voice collection period of the voice interaction, collecting the angle range of the first user.
  • the first audio signal and the second audio signal in the angular range of the second user, the second user is the user who performs voice interaction with the electronic device in the set historical period; determine the starting time of the first voice signal in the first audio signal Whether it is within the first period, and determine the target speech signal from the first speech signal and the second speech signal according to the judgment result.
  • the second speech signal is the speech signal contained in the second audio signal, and the first period is the speech collection period.
  • a first period of time elapses after the start time; the target voice signal is responded to.
  • the target interactor can be determined among the first user and the second user based on the first duration, the voice interaction needs of the first user and the second user can be reasonably taken into consideration, and the target interactor can be accurately determined, thereby improving multiple User experience in human interaction scenarios.
  • At least one of the starting time of the first voice signal and the starting time of the second voice signal is located within the first period; the target voice signal is determined from the first voice signal and the second voice signal according to the judgment result, Including: if the starting time of the first voice signal is within the first period, then determining the first voice signal as the target voice signal; otherwise, determining the target based on the overlap state of the second voice signal and the first voice signal in time. voice signal.
  • the electronic device will determine the first user as the target interactor (that is, maintain the first user as the current interactor unchanged), so as to prioritize the satisfaction of the first user. voice interaction requirements.
  • the first user does not speak within the first period, it is considered that the first user's willingness to interact is low, so it is possible to determine the second user as the target interactor to take into account the second user's voice interaction needs.
  • determining the target speech signal according to the overlap state of the second speech signal and the first speech signal in time includes: if the second speech signal and the first speech signal overlap in time, then the second speech signal and the first speech signal overlap in time. A voice signal is determined as the target voice signal; if the second voice signal and the first voice signal do not overlap in time, the second voice signal is determined as the target voice signal.
  • the first duration is determined based on the first user's interaction willingness value P and/or the number of interactions M between the first user and the electronic device within a set time period, where the interaction willingness value P is used to characterize the first Users’ willingness to interact with electronic devices via voice.
  • the interaction willingness value P is determined according to the facial angle of the first user and/or the distance between the first user and the electronic device.
  • the first duration is k 1 ⁇ P+k 2 ⁇ min ⁇ M, n ⁇ , where k 1 and k 2 are preset constants, and n is an integer between 3 and 6.
  • the second voice signal does not include the wake word of the electronic device.
  • the historical period is set to a second period of time before the start time of the voice collection period.
  • embodiments of the present application provide an electronic device, including: a memory for storing instructions executed by one or more processors of the electronic device; a processor, when the processor executes the instructions in the memory, The electronic device is caused to execute the voice interaction method provided by any embodiment of the first aspect of this application.
  • the beneficial effects that can be achieved in the second aspect can be referred to the beneficial effects of any embodiment of the first aspect of the application, and will not be described again here.
  • embodiments of the present application provide a computer-readable storage medium. Instructions are stored on the computer-readable storage medium. When the instructions are executed on a computer, they cause the computer to execute the speech provided by any embodiment of the first aspect of the application. Interactive methods. The beneficial effects that can be achieved in the third aspect can be referred to the beneficial effects of any embodiment of the first aspect of this application, and will not be described again here.
  • Figure 1 is an exemplary application scenario of an embodiment of the present application
  • Figure 2 is an exemplary structural diagram of an electronic device provided by an embodiment of the present application.
  • Figure 3 is an exemplary flow chart of the voice interaction method provided by the embodiment of the present application.
  • Figure 4 is a time sequence diagram of voice interaction between an electronic device and a current interacting person provided by an embodiment of the present application
  • Figure 5 is an exemplary flow chart of the user voice collection process provided by the embodiment of the present application.
  • Figure 6 is a schematic diagram of the angle range of the user provided by the embodiment of the present application.
  • Figure 7 is an exemplary flow chart of a target speech signal determination method provided by an embodiment of the present application.
  • Figure 8A is a schematic diagram 1 of the target speech signal determination rules provided by the embodiment of the present application.
  • Figure 8B is a schematic diagram 2 of the target speech signal determination rules provided by the embodiment of the present application.
  • Figure 9 is a schematic diagram three of the target speech signal determination rules provided by the embodiment of the present application.
  • Figure 10A is a schematic diagram 4 of the target speech signal determination rules provided by the embodiment of the present application.
  • Figure 10B is a schematic diagram 5 of the target speech signal determination rules provided by the embodiment of the present application.
  • Figure 11 is another exemplary application scenario of the embodiment of the present application.
  • Figure 12 is a schematic diagram of a voice interaction method in some embodiments.
  • Figure 13 is a schematic diagram of a voice interaction method in other embodiments.
  • Figure 14 shows a block diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 15 shows a schematic structural diagram of a system on chip (SOC) provided by an embodiment of the present application.
  • SOC system on chip
  • Beamforming technology can determine the direction of the sound source. Beamforming technology relies on microphone arrays. When a sound source makes a sound, there is a delay in the sound signal received by each microphone in the microphone array (i.e., each sound collection channel). Beamforming technology can use the delay information of each channel to locate the sound source (for example, determine the location of the sound source). direction, elevation and distance).
  • Beamforming technology can also collect sounds within a target angle. Beamforming technology can perform phase shifting, weighting and other processing on the sound signals of each channel in the microphone array, thereby achieving the purpose of enhancing the sound signal within the target angle and suppressing the sound signal in other directions. Sound collection within ⁇ 30° from the front).
  • VAD Voice Activity Detection
  • VoIP Voice Activity Detection
  • Endpoint Detection Voice activity detection technology can distinguish speech signals and non-speech signals in audio signals, and can determine the starting point and end point of the speech signal, thereby separating the speech signal from the audio signal. In this way, subsequent speech recognition can be performed only on the speech signal, thereby improving the accuracy of speech recognition.
  • the embodiments of the present application are used to provide a voice interaction method, which is used to determine the appropriate target interactive person in a multi-person interaction scenario to meet the user's voice interaction needs.
  • electronic devices can be smart speakers, car machines, large-screen devices, mobile phones, tablets, wearable devices, cameras and other devices of any form, as long as they have voice interaction functions.
  • an intelligent robot eg, Xiaoyi Elf
  • Xiaoyi Elf is taken as an example of an electronic device.
  • FIG. 1 shows an exemplary application scenario of the embodiment of the present application.
  • electronic device 100 (specifically, an intelligent robot) is performing voice interaction (referred to as "interaction") with user A. That is, User A is the current interactor of the electronic device 100 . To facilitate interaction, user A is located directly in front of the electronic device 100 .
  • the interactive content between the electronic device 100 and user A is, for example:
  • the above example is a voice interaction initiated by the user. That is, after User A actively speaks the wake-up word "Xiaoyi Xiaoyi" of the electronic device 100, the electronic device 100 is awakened and starts interacting with User A.
  • the voice interaction may also be an interaction actively initiated by the electronic device 100 . For example, when the electronic device 100 observes that User A is looking at it for a long time (for example, for 10 consecutive seconds), it can play a preset voice (for example, "Do you have any questions to ask me?") to actively initiate a conversation with User A. voice interaction.
  • user B also exists around the electronic device 100 .
  • User B is a user who has just finished voice interaction with the electronic device 100 (for example, half a minute ago). This application does not limit the reason why user B ends the voice interaction with the electronic device 100. For example, User B does not reply to the voice played by the electronic device 100, so as to actively end the voice interaction with the electronic device 100; Alternatively, during the voice interaction with user B, the electronic device 100 listens to the wake-up word (for example, Xiaoyi Xiaoyi) from user A, thereby ending the voice interaction with user B and starting the voice interaction with user A. .
  • the wake-up word for example, Xiaoyi Xiaoyi
  • User B Since User B has just performed voice interaction with the electronic device 100, User B may still continue to perform voice interaction with the electronic device 100. That is, User B is a potential interactor of the electronic device 100 .
  • the electronic device 100 when the electronic device 100 performs voice interaction with User A, in order to improve the accuracy of voice collection, the electronic device 100 suppresses sounds from other directions other than the direction in which User A is located. For example, electronic equipment only collects sounds within ⁇ 30° from directly in front, while suppressing sounds from other directions. In this way, when User B is speaking, the electronic device 100 cannot collect User B's voice, and thus cannot perceive User B's interaction needs.
  • embodiments of the present application provide a voice interaction method for determining the target interactive person in a multi-person interaction scenario, so as to improve the user experience in a multi-person interaction scenario.
  • the electronic device 100 not only collects the audio signals in the direction of user A (as the "current interacting person" or the "first user"), but also collects the audio signals of user B (as the "potential interacting person”). ” or “second user”).
  • the electronic device 100 determines whether the time when user A starts speaking is within the set priority waiting period (also known as the "first period") based on the collected audio signals, and determines the target interaction between user A and user B based on the determination result.
  • person the interactive person that the electronic device 100 is about to respond to). For example, when the time when user A starts speaking is within the set priority waiting period, user A is determined as the target interactor; otherwise, user B may be used as the target interactor.
  • the target interactor can be determined between user A and user B according to the priority waiting period, and the voice interaction needs of the current interactor (for example, user A) and the potential interactor (for example, user B) can be reasonably considered, and the voice interaction needs of the potential interactor (for example, user B) can be accurately considered.
  • an intelligent robot is taken as an example of the electronic device 100 .
  • the present application is not limited to this.
  • FIG. 2 shows an exemplary structural diagram of the electronic device 100 provided in this embodiment.
  • the electronic device 100 includes a processor 110 , a camera 120 , a microphone 130 , a speaker 140 , a communication module 150 , a memory 160 and a sensor 170 .
  • the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the electronic device 100 .
  • the electronic device 100 may include more or fewer components than shown in the figures, or some components may be combined, some components may be separated, or some components may be arranged differently.
  • the components illustrated may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units.
  • the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural network processor (neural-network processing unit, NPU), etc.
  • image signal processor image signal processor, ISP
  • controller video codec
  • digital signal processor digital signal processor
  • DSP digital signal processor
  • baseband processor baseband processor
  • neural network processor neural-network processing unit
  • NPU neural-network processing unit
  • different processing units can be independent devices or integrated in one or more processors.
  • the processor can generate operation control signals based on the instruction opcode and timing signals to complete the control of fetching and executing instructions.
  • Camera 120 is used to capture still images or video.
  • the object passes through the lens to produce an optical image that is projected onto the photosensitive element.
  • the photosensitive element can be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then passes the electrical signal to the ISP to convert it into a digital image signal.
  • ISP outputs digital image signals to DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other format image signals.
  • the electronic device 100 may include 1 or N cameras 120, where N is a positive integer greater than 1.
  • Microphone 130 also called “microphone” or “microphone” is used to convert sound signals into electrical signals.
  • Electronic device 100 can Multiple (eg, three, four or more) microphones 130 are provided to form a microphone array.
  • the microphone array can not only collect sound signals, but also can collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
  • Speaker 140 also called “speaker” is used to convert audio electrical signals into sound signals.
  • the electronic device 100 can listen to music through the speaker 140, or listen to hands-free calls.
  • the communication module 150 can provide applications on the electronic device 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) network), Bluetooth (bluetooth, BT), and global navigation satellite systems. (global navigation satellite system, GNSS), frequency modulation (FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
  • WLAN wireless local area networks
  • BT Bluetooth
  • BT global navigation satellite system
  • FM frequency modulation
  • NFC near field communication technology
  • infrared technology infrared, IR
  • Memory 160 may be used to store computer executable program code, which includes instructions.
  • the memory 160 may include a program storage area and a data storage area.
  • the stored program area can store an operating system, at least one application program required for a function (such as a sound playback function, an image playback function, etc.).
  • the storage data area may store data created during use of the electronic device 100 (such as audio data, phone book, etc.).
  • the memory 160 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, universal flash storage (UFS), etc.
  • the processor 110 executes various functional applications and data processing of the electronic device 100 by executing instructions stored in the memory 160 and/or instructions stored in the memory provided in the processor.
  • the instructions stored in the memory 160 may include: causing the electronic device 100 to implement the voice interaction method provided by the embodiment of the present application when executed by at least one of the processors 110 .
  • Sensors 170 may include distance sensors and proximity light sensors.
  • Distance sensors are used to measure distance.
  • distance sensors can measure distance via infrared or laser.
  • Proximity light sensors may include, for example, light emitting diodes (LEDs) and light detectors, such as photodiodes.
  • the light emitting diode may be an infrared light emitting diode.
  • the electronic device 100 emits infrared light outwardly through the light emitting diode.
  • Electronic device 100 uses photodiodes to detect infrared reflected light from nearby objects. When sufficient reflected light is detected, it can be determined that there is an object near the electronic device 100 . When insufficient reflected light is detected, the electronic device 100 may determine that there is no object near the electronic device 100 .
  • the electronic device 100 may detect whether there is a user around using the proximity light sensor 180G.
  • the electronic device may also include a rotating mechanism to realize the steering gear turning function of the electronic device. For example, through the rotation mechanism, the electronic device can be rotated from an angle facing user A to an angle facing user B.
  • the electronic device may include several functional units.
  • ASR algorithm unit for example, ASR algorithm unit, sound source localization algorithm unit (for example, beamforming algorithm unit), sound source suppression algorithm unit and other speech recognition units, as well as visual recognition units such as face recognition algorithm unit.
  • sound source localization algorithm unit for example, beamforming algorithm unit
  • sound source suppression algorithm unit for example, sound source suppression algorithm unit
  • visual recognition units such as face recognition algorithm unit.
  • the voice interaction method provided by this embodiment includes the following steps:
  • the electronic device performs voice interaction with user A (also known as the "current interacting person” or “first user”).
  • the electronic device collects the audio signal Autio_A in the angular range of user A and the audio signal Autio_B in the angular range of user B (also known as the "potential interactor" or "second user”).
  • Figure 4 shows a sequence diagram of voice interaction between the electronic device and user A.
  • the white box is the content spoken by user A
  • the gray box is the content played by the electronic device.
  • the voice interaction process between the electronic device and user A includes an alternating voice playback stage and a voice collection stage.
  • the electronic device plays the device voice; after playing the device voice, the electronic device enters the voice collection stage to monitor the user's voice.
  • the electronic device determines that the user has finished speaking, it ends the voice collection phase. to enter the next voice playback stage.
  • the voice collection phase may be ended.
  • the electronic device can turn off the microphone and not collect external sounds.
  • the electronic device determines whether there is a potential interactive person. If there is a potential interactor, the audio signal in the direction range of the current interactor and the audio signal in the direction range of the potential interactor are collected; if there is no potential interactor, only the audio signal in the direction range of the current interactor is collected.
  • the following takes the current voice collection stage (ie, voice collection stage 3 in Figure 4) as an example to introduce the process of the electronic device collecting the user's voice.
  • the starting time of the current speech collection phase is TS .
  • the process of collecting the user's voice by an electronic device includes the following steps:
  • S111 The electronic device determines that there is a potential interacting person.
  • the historical period P1 is set to be the period of the second duration T 2 before time TS . That is, the starting time of the historical period P1 is set to T S -T 2 and the terminal time is set to T S . That is, the potential interactors are users who have interacted with the electronic device within the second most recent period of time. At this time, the potential interactor has a greater probability of voice interaction with the electronic device.
  • This embodiment does not limit the specific value of the second duration.
  • the second duration is 0.5-2 minutes, for example, 0.5 minutes, 1 minutes, 1.3 minutes. It should be noted that the numerical ranges in this application include end values. For example, the data range 0.5 ⁇ 2min includes 0.5min and 2min.
  • the following uses user B as an example to introduce the method for electronic devices to determine whether there is a potential interactive person.
  • the electronic device determines that user B exists around it.
  • the electronic device may determine that user B exists around it through an image recognition method, or determine that user B exists around it through a sensor (for example, a proximity light sensor).
  • the electronic device determines users within a set distance (eg, within 3m) as users located around it.
  • the electronic device identifies User B's identity. For example, the electronic device identifies User B's identity through face recognition, voiceprint recognition, etc. Then, the electronic device can determine whether user B is a potential interactor by querying its stored voice interaction records. For example, the voice interaction record stores the identities of users who have conducted voice interaction with the electronic device within a recent period of time (for example, within 2 days), as well as the start time, end time, etc. of each voice interaction.
  • the historical period P1 is set to [ TS -60s, TS ].
  • the time period in which User B recently interacted with the electronic device is [ TS -80s, T S -40s]. Therefore, User B has conducted voice interaction with the electronic device within the set historical period P1. Therefore, the electronic device The device determines user B as a potential interactor.
  • the electronic device determines the angle range where user A is located (denoted as angle_A) and the angle range where user B is located (denoted as angle_B).
  • FIG. 6 shows a schematic diagram of the angle range angle_A and the angle range angle_B.
  • the direction directly in front of the electronic device is 0°
  • the direction of clockwise rotation around the electronic device is the positive direction.
  • the electronic device in order to more reliably collect the voice signals of User A and User B, the electronic device not only collects the audio signals at the locations of User A and User B, but also collects the audio signals around User A and User B . That is, the electronic device collects the audio signals of the angle range angle_A where user A is located and the angle range angle_B where user B is located.
  • This embodiment does not limit the values of c1 and c2.
  • the electronic device collects the audio signal in the angle range angle_A (denoted as Audio_A, as the first audio signal) and the audio signal in the angle range angle_B (denoted as Audio_B, as the second audio signal).
  • the electronic device turns on the microphone and collects audio signals through the microphone array. Afterwards, the electronic device processes the audio signal collected by the microphone array through beamforming algorithm A to obtain audio signal Audio_A; and processes the audio signal collected by the microphone array through beamforming algorithm B to obtain audio signal Audio_B.
  • the target angle of the beamforming algorithm A is set to angle_A. Therefore, the audio signal Audio_A is an audio signal after suppressing sound signals outside angle_A. That is, angle_A is an audio signal within the angle range angle_A.
  • the target angle of the beamforming algorithm B is set to angle_B. Therefore, the audio signal Audio_B is an audio signal after suppressing sound signals outside angle_B, that is, angle_B is an audio signal within the angle range angle_B.
  • the electronic device determines whether the start time of the voice signal (denoted as Voice_A, as the first voice signal) in the audio signal Audio_A is within the priority waiting period (also known as the "first period"), and based on the judgment result, the voice signal is The target voice signal (the voice signal to which the electronic device is about to reply) is determined among Voice_A and the voice signal Voice_B (as the second voice signal), where the voice signal Voice_B is the voice signal included in the audio signal Audio_B.
  • the audio signal Audio_A will include the voice signal Voice_A.
  • the audio signal B will include the voice signal Voice_B.
  • the electronic device can detect the voice signals Voice_A and Voice_B in the audio signals Audio_A and Audio_B through the voice activity detection VAD algorithm, and determine the end points (starting point and end point) of the voice signals Voice_A and Voice_B.
  • the electronic device determines whether the starting point of the voice signal Voice_A (also called the starting time of Voice_A) is within the priority waiting period, and determines one of the voice signal Voice_A and the voice signal Voice_B as the target voice based on the determination result.
  • the first duration T is the time when User A is most likely to speak (for example, 3s). That is, if User A continues to interact with the electronic device, there is a high probability that User A will wait before the priority waiting period. Speak inside.
  • This embodiment determines the target voice signal according to the priority waiting period, which can be more in line with the user's interaction intention and improve the user experience. For the sake of narrative coherence, the specific method of determining the first duration T will be introduced later.
  • the process of determining the target speech signal by the electronic device includes the following steps:
  • S121 The electronic device determines whether the start time T AS of the voice signal Voice_A is within the priority waiting period.
  • the electronic device can detect the starting time T AS of the voice signal Voice_A through the VAD algorithm. It can be understood that the starting time T AS of Voice_A is the time when user A starts speaking.
  • the electronic device After determining the time T AS , the electronic device compares it with the end time TE of the priority waiting period. If the time T AS is earlier than or equal to the time TE , the electronic device determines that the start time T AS of the voice signal Voice_A is within the priority waiting period, and executes step S122; otherwise, the electronic device executes step S123.
  • S122 The electronic device determines the voice signal Voice_A as the target voice signal.
  • Voice_A is determined as the target voice signal (the voice signal to which the electronic device is about to respond). In other words, as long as user A is within the priority waiting period When you speak, the electronic device determines User A as the target interactor (ie, maintains User A as the current interactor unchanged), so as to prioritize user A's voice interaction needs.
  • FIG. 8A shows an example of determining the voice signal Voice_A as the target voice signal.
  • the start time T AS of Voice_A and the start time T BS of Voice_B are both within the priority waiting period, and the time T AS is earlier than the time T BS (that is, user A speaks earlier than user B).
  • the electronic device determines the voice signal Voice_A as the target voice signal.
  • the electronic device may stop collecting the audio signal Audio_B (or the voice signal Voice_B).
  • FIG. 8B gives another example of determining the voice signal Voice_A as the target voice signal.
  • the start time T AS of Voice_A and the start time T BS of Voice_B are both within the priority waiting period, but the time T BS is earlier than the time T AS (that is, user B speaks earlier than user A).
  • the electronic device still determines the voice signal Voice_A as the target voice signal.
  • the electronic device can stop collecting the audio signal Audio_B (or the voice signal Voice_B), and discard the voice signal Voice_B collected before the time T AS .
  • S123 The electronic device determines whether the voice signal Voice_A and the voice signal Voice_B overlap in time.
  • At least one of the start time T AS of the voice signal Voice_A and the start time T BS of the voice signal Voice_B is located within the priority waiting period.
  • the electronic device determines the target voice signal based on the temporal overlap state of the voice signal Voice_A and the voice signal Voice_B.
  • step S124 the electronic device determines that the voice signal Voice_A and the voice signal Voice_B do not overlap in time, and step S125 is executed.
  • the endpoints of each voice signal for example, the start time T AS of the voice signal Voice_A, the end time T BE of the voice signal Voice_B, etc., can be determined by the voice activity detection VAD algorithm.
  • S124 The electronic device determines the voice signal Voice_A as the target voice signal.
  • the electronic device determines the voice signal Voice_A as the target voice signal. That is to say, when user B starts speaking before user B has finished speaking, user A is still regarded as the target interactor (user A is maintained as the current interactor).
  • Figure 9 gives another example of determining the voice signal Voice_A as the target voice signal.
  • the starting time T BS of Voice_B is located within the priority waiting period, and the end time T BE is located outside the priority waiting period.
  • the start time T AS of Voice_A is outside the priority waiting period, but the time T AS is earlier than the end time T BE of Voice_B (that is, the time when user A starts speaking is earlier than the time when user B ends speaking).
  • the electronic device determines the voice signal Voice_A as the target voice signal.
  • the electronic device when the electronic device detects the starting point of the voice signal Voice_A (i.e., after the time T AS ), it can stop collecting the audio signal Audio_B (or the voice signal Voice_B), and discard the data collected before the time T AS .
  • the voice signal Voice_B after detecting the starting point of voice signal Voice_A, if it is determined that voice signal Voice_B has not yet ended (that is, the end point T BE of voice signal Voice_B has not been detected), it can be determined that the starting time T AS of Voice_A is earlier than the end of Voice_B. Time T BE .
  • S125 The electronic device determines the voice signal Voice_B as the target voice signal.
  • the electronic device determines the voice signal Voice_B as the target voice signal. That is to say, when User B has finished speaking and User A has not yet started speaking, the electronic device determines User B as the target interactor (that is, the electronic device switches the current interactor from User A to User B).
  • FIG. 10A shows an example of determining the voice signal Voice_B as the target voice signal.
  • the start time T BS of Voice_B is within the priority waiting period, and the end time T BE is outside the priority waiting time.
  • the start time T AS of Voice_A is later than the end time T BE of Voice_B (that is, when user B ends speaking, user A has not yet started speaking).
  • the electronic device determines the voice signal Voice_B as the target voice signal.
  • the electronic device after the electronic device detects the end of the voice signal Voice_B (i.e., after the time T BE ), it can stop collecting the audio signal (for example, turn off the microphone and not collect the audio signal Audio_A), and respond to the voice signal Voice_B (i.e. Execute step S130).
  • FIG. 10B shows an example of determining the voice signal Voice_B as the target voice signal.
  • the start time T BS and the end time T BE of Voice_B are both within the priority waiting period.
  • the starting time T AS of Voice_A is outside the priority waiting time.
  • the electronic device continues to wait until time TE .
  • the electronic device determines that user A has not spoken after the time TE is reached, the electronic device determines the voice signal Voice_B as the target voice signal.
  • the electronic device can stop collecting audio signals (for example, turn off the microphone and not collect audio signal Audio_A), and respond to voice signal Voice_B (ie, perform step S130).
  • S130 The electronic device responds to the target voice signal.
  • the electronic device After determining the target voice signal, the electronic device responds to the target voice signal (that is, replies to the target interactor). Illustratively, the electronic device uploads the target voice signal to the cloud server.
  • the cloud server uses the Automatic Speech Recognition (ASR) algorithm and the Neuro-Linguistic Programming (NLP) algorithm to perform semantic recognition of the target speech signal to determine the reply text content.
  • ASR Automatic Speech Recognition
  • NLP Neuro-Linguistic Programming
  • the ASR algorithm is a technology used to convert speech into text
  • the NLP algorithm is a technology used to enable electronic devices to "read" human language.
  • the cloud server After determining the reply text, the cloud server sends the reply text to the electronic device.
  • the electronic device After receiving the reply text, the electronic device converts the reply text into a voice stream through the (Text To Speech, TTS) algorithm, and outputs (for example, plays) the voice stream to perform processing on the target voice signal (or target interactor). answer.
  • the electronic device can turn off the microphone and not collect audio signals.
  • the electronic device can also determine the reply text content through the local ASR algorithm and NLP algorithm.
  • the response methods of electronic devices can also include expressions, actions, etc. For example, after the electronic device determines that the target interactor is user B, it turns to face user B to improve the intelligence and anthropomorphism of the electronic device.
  • this embodiment provides a voice interaction method that determines the target interactor according to the priority waiting period, and can reasonably take into account the voice interaction needs of the current interactor (for example, user A) and the potential interactor (for example, user B) , improve user experience in multi-person interaction scenarios.
  • the electronic device determines User A as the target interacting person to satisfy User A's voice interaction needs first. If user A does not speak within the priority waiting period, it is considered that user A’s willingness to interact is low, so it is possible to identify user B (i.e., the potential interactor) as the target interactor to take into account user B’s voice interaction needs.
  • User A i.e., the current interacting person
  • the electronic device determines User A as the target interacting person to satisfy User A's voice interaction needs first. If user A does not speak within the priority waiting period, it is considered that user A’s willingness to interact is low, so it is possible to identify user B (i.e., the potential interactor) as the target interactor to take into account user B’s voice interaction needs.
  • the target interactor if user A does not speak within the priority waiting period, then according to the voice signal Voice_A and The overlapping state of the voice signal Voice_B in time determines the target interactor.
  • user B is determined to be the target interactor.
  • the target interactor can be determined relatively quickly.
  • this embodiment can also improve the priority of potential interactors.
  • At least one of the start time T AS of the voice signal Voice_A and the start time T BS of the voice signal Voice_B is located within the priority waiting period.
  • the electronic device can continue to monitor the voice signal Voice_A. If the starting point of the voice signal Voice_A is detected before T S +8s), the voice signal Voice_A is used as the target voice signal. Otherwise, the electronic device ends the current voice interaction.
  • the wake-up word may directly cause the switching of the target interactor
  • user B's voice ie, the voice signal Voice_B
  • the voice signal Voice_B does not include the wake-up word of the electronic device.
  • the electronic device can select one of user B and user C as the tentative interactor, and determine the target interactor among the tentative interactors and user A according to the method described in step S120.
  • This application does not limit the method of selecting the temporary interactor.
  • the one of User B and User C who has recently performed voice interaction with the electronic device is selected as the tentative interactor; or the one who speaks first in the current voice collection stage is selected as the tentative interactor.
  • the following describes the method for determining the first duration T provided in this embodiment.
  • the first duration T is determined based on the time when user A is most likely to speak. That is, assuming that User A is willing to continue the interaction, User A will speak within the first period of time at the latest (that is, before time TE ).
  • the first duration is determined based on User A's interaction willingness value P and User A's interaction rounds M with the electronic device within the set time period.
  • the interaction willingness value P is used to characterize user A's interaction willingness with the electronic device. The greater the interaction willingness value P, the greater the possibility that user A will engage in voice interaction with the electronic device.
  • the first duration T k 1 ⁇ P+k 2 ⁇ min ⁇ M, n ⁇ , where k 1 and k 2 are preset constants, and n is an integer between 3 and 6. Detailed introduction is given below.
  • Interaction willingness value P is a value between 0 and 1.
  • the interaction willingness value P can also be other values, for example, a value between 1 and 5.
  • the interaction willingness value P is based on the facial angle of user A. And/or the distance D between user A and the electronic device is determined.
  • User A s facial angle Used to indicate the degree of alignment between user A's face and the electronic device. When the user's face is facing the front of the electronic device, the angle of User A's face is 0°; the larger the angle of User A's side facing the electronic device, the greater the angle of User A's face The bigger. Generally, the closer User A is to the electronic device, the greater the interaction willingness value P of User A is.
  • User A’ s facial angle It can be determined through image recognition.
  • the distance D between user A and the electronic device can be determined based on image recognition, distance sensor ranging, sound source localization, etc. For example, when the distance D between User A and the electronic device is within a set range (for example, 0.5 to 1 times the height of the electronic device), User A is considered to have a higher interaction willingness value P; The greater the deviation in the range, the smaller the interaction willingness value P of user A is.
  • a set range for example, 0.5 to 1 times the height of the electronic device
  • the interaction willingness value P is the facial angle and the weighted sum of distance D.
  • the weights of the facial angle ⁇ and the distance D may be constants determined based on experience.
  • the interaction willingness value P may be determined through an AI algorithm.
  • Pre-trained AI models can be stored in electronic devices, and the AI models are used to represent facial angles. The mapping relationship between distance D and interaction willingness value P.
  • the electronic device measures the facial angle After the distance D, the interaction willingness value P can be calculated through the AI model.
  • the interaction round M is the interaction round of user A with the electronic device within the set time period P2.
  • completing a question and answer session between the electronic device and the user is counted as one interaction round. For example, user A asks "Do you have any favorite animals?" and the electronic device replies "I like furry animals, they look so warm.” This is counted as one interaction round.
  • the interaction round M can reflect the frequency of interaction between user A and the electronic device.
  • the higher the interaction frequency that is, the larger the M value), the greater the possibility of user A's interaction.
  • the time period P2 is set to be the third time period before the time T S. That is to say, M is the interaction round of user A with the electronic device in the latest third period of time. In this way, M can more accurately characterize the possibility of user A to continue interacting with the electronic device.
  • the third duration is 0.5-2 minutes, for example, 1 minute.
  • M is the number of rounds of continuous interaction between User A and the electronic device within the set time period P2. Continuous interaction means that no other user intervenes during the interaction between User A and the electronic device. If other users intervene, the interaction round M is recalculated from 0.
  • k 1 , k 2 . k 1 and k 2 are the weights of the interaction willingness value P and the interaction round M respectively, and are used to adjust the weights of the interaction willingness value P and the interaction round M in the first duration T.
  • the interaction willingness value P is a value between 0 and 1
  • M is a value greater than 1
  • k 1 is greater than k 2 .
  • k 1 is 3 to 5 times of k 2 .
  • n is used to define the upper limit of the first duration T.
  • the interaction round item is set to min ⁇ M, n ⁇ to limit the upper limit of the first duration T.
  • the method for determining the first duration T is introduced above.
  • the first duration is determined based on the interaction willingness value P and the interaction round M, so that user A's interaction possibility can be accurately predicted.
  • the greater the possibility of user A's interaction the longer the priority waiting time of the electronic device for user A, so that user A's voice interaction needs can be reasonably met.
  • the method for determining the first duration T is introduced above. However, this application is not limited to this.
  • the first duration T is dynamically adjusted according to the interaction willingness value P and the interaction round M.
  • the first duration T may be a fixed value determined based on experience, for example, the first duration is 3 seconds. This embodiment can simplify the determination process of the first duration T and reduce calculation overhead.
  • the voice interaction method provided by the embodiments of the present application can reasonably select the target interactive person in a multi-person interaction scenario.
  • the following compares separately with the voice interaction methods provided by other embodiments.
  • Figure 12 illustrates one embodiment. Specifically, Figure 12 provides a sound pickup method and device.
  • the initial pickup beam points to the target sound source.
  • the direction of the pickup beam is dynamically adjusted to ensure that the target sound source and the pickup beam point in the same direction. , thereby attenuating or shielding the sound signals from other noise sources.
  • the implementation shown in Figure 12 is to dynamically adjust the pickup beam direction according to the direction of the detection recording device after the target sound source is clarified. However, it does not solve the problem of how the robot picks up the sound direction in a multi-person interaction scenario, and how to use it among multiple users. Select the target interactor question.
  • Figure 13 shows another embodiment. Specifically, Figure 13 provides a sound pickup method, including the following steps S1001, when the received sound source angle is within a preset angle, by obtaining the face of the current camera; Step S1002, select the sound source among these faces The face with the closest angle is the speaker, that is, the face with the closest angle is tracked to achieve the purpose of tracking the current speaker; finally, in step S1003, the robot angle is adjusted so that the center of the speaker's face falls on the robot Front and center position to facilitate response to the speaker's vocal signal.
  • a sound pickup method including the following steps S1001, when the received sound source angle is within a preset angle, by obtaining the face of the current camera; Step S1002, select the sound source among these faces The face with the closest angle is the speaker, that is, the face with the closest angle is tracked to achieve the purpose of tracking the current speaker; finally, in step S1003, the robot angle is adjusted so that the center of the speaker's face falls on the robot Front and center position to facilitate response to the speaker's vocal signal.
  • the voice interaction method in this embodiment can reasonably take into account the voice interaction needs of the current interactor and potential interactants in a multi-person interaction scenario, and accurately determine the target interactor, thereby Improve user experience in multi-person interaction scenarios.
  • Electronic device 400 may include one or more processors 401 coupled to controller hub 403 .
  • the controller hub 403 communicates with the controller via a multi-drop bus such as a Front Side Bus (FSB), a point-to-point interface such as a QuickPath Interconnect (QPI), or a similar connection 406
  • FOB Front Side Bus
  • QPI QuickPath Interconnect
  • Processor 401 communicates.
  • Processor 401 executes instructions that control general types of data processing operations.
  • the controller hub 403 includes, but is not limited to, a Graphics & Memory Controller Hub (GMCH) (not shown) and an Input Output Hub (IOH) (which can be on a separate chip) (not shown), where the GMCH includes memory and graphics controller and is coupled to the IOH.
  • GMCH Graphics & Memory Controller Hub
  • IOH Input Output Hub
  • Electronic device 400 may also include a coprocessor 402 and memory 404 coupled to controller hub 403 .
  • a coprocessor 402 and memory 404 coupled to controller hub 403 .
  • one or both of the memory and GMCH may be integrated within the processor (as described in this application), with memory 404 and coprocessor 402 directly coupled to processor 401 and controller hub 403 , which 403 and IOH are in a single chip.
  • the memory 404 may be, for example, a dynamic random access memory (DRAM), a phase change memory (PCM), or a combination of the two.
  • Memory 404 may include one or more tangible, non-transitory computer-readable media for storage of data and/or instructions.
  • the computer-readable storage medium stores instructions, and specifically, temporary and permanent copies of the instructions.
  • the instructions may include instructions that, when executed by at least one of the processors, cause the electronic device 400 to implement the methods shown in FIG. 3 , FIG. 5 , and FIG. 7 . When the instructions are run on the computer, the computer is caused to perform the methods disclosed in the above embodiments.
  • coprocessor 402 is a special purpose processor such as, for example, a high throughput Many Integrated Core (MIC) processor, a network or communications processor, a compression engine, a graphics processor, a graphics processing unit. General-purpose computing on graphics processing units (GPGPU), or embedded processors, etc. Optional properties of coprocessor 402 are represented by dashed lines in Figure 14.
  • MIC Many Integrated Core
  • GPGPU General-purpose computing on graphics processing units
  • embedded processors etc.
  • Optional properties of coprocessor 402 are represented by dashed lines in Figure 14.
  • the electronic device 400 may further include a network interface (Network Interface Controller, NIC) 406.
  • Network interface 406 may include a transceiver for providing a radio interface for electronic device 400 to communicate with any other suitable devices (such as front-end modules, antennas, etc.).
  • network interface 406 may be integrated with other components of electronic device 400 .
  • the network interface 406 can implement the functions of the communication unit in the above embodiment.
  • the electronic device 400 may further include an input/output (I/O) device 405.
  • the I/O 405 may include: a user interface designed to enable a user to interact with the electronic device 400; and a peripheral component interface designed to enable the peripheral components to interact with the electronic device 400. Capable of interacting with electronic device 400; and/or sensors designed to determine environmental conditions and/or location information related to electronic device 400.
  • Figure 14 is only exemplary. That is, although FIG. 14 shows that the electronic device 400 includes multiple devices such as a processor 401, a controller hub 403, and a memory 404, in actual applications, the device using the methods of the present application may only include the electronic device 400. Some of the devices, for example, may only include the processor 401 and the network interface 406. The properties of optional devices are shown in dashed lines in Figure 14.
  • SoC 500 includes: interconnect unit 550, which is coupled to processor 510; system agent unit 580; bus controller unit 590; integrated memory controller unit 540; one or more co-processors 520 , which may include integrated graphics logic, image processor, audio processor and video processor; static random access memory (Static Random access Memory, SRAM) unit 530; direct memory access (Direct Memory Access, DMA) unit 560.
  • coprocessor 520 includes a special purpose processor such as, for example, a network or communications processor, a compression engine, general-purpose computing on graphics processing units (GPGPU), a high-throughput MIC processor, or embedded processor, etc.
  • Static random access memory (SRAM) unit 530 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions.
  • the computer-readable storage medium stores instructions, and specifically, temporary and permanent copies of the instructions.
  • the instructions may include instructions that, when executed by at least one of the processors, cause the SoC to implement the methods shown in Figures 3, 5, and 7. When the instructions are run on the computer, the computer is caused to perform the methods disclosed in the above embodiments.
  • a and/or B can mean: A exists alone, A and B exist simultaneously, and they exist alone. B these three situations.
  • Program code may be applied to input instructions to perform the functions described herein and to generate output information.
  • Output information can be applied to one or more output devices in a known manner.
  • a processing system includes any system having a processor such as, for example, a Digital Signal Processor (DSP), a microcontroller, an Application Specific Integrated Circuit (ASIC), or a microprocessor.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • Program code may be implemented in a high-level procedural language or an object-oriented programming language to communicate with the processing system.
  • assembly language or machine language can also be used to implement program code.
  • the mechanisms described in this article are not limited to the scope of any particular programming language. In either case, the language may be a compiled or interpreted language.
  • IP Intelligent Property
  • an instruction converter may be used to convert instructions from a source instruction set to a target instruction set.
  • the instruction translator may transform (eg, using static binary transformation, dynamic binary transformation including dynamic compilation), transform, emulate, or otherwise transform the instruction into one or more other instructions to be processed by the core.
  • the instruction converter can be implemented in software, hardware, firmware, or a combination thereof.
  • the instruction translator may be on-processor, off-processor, or partially on-processor and partially off-processor.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本申请提供了一种语音交互方法、电子设备以及计算机可读存储介质。语音交互方法包括:与第一用户进行语音交互,并在语音交互的语音采集时段中,采集第一用户所在角度范围的第一音频信号和第二用户所在角度范围的第二音频信号,第二用户为在设定历史时段中与电子设备进行语音交互的用户;判断第一音频信号中的第一语音信号的开始时刻是否位于第一时段内,并根据判断结果从第一语音信号和第二语音信号中确定目标语音信号,第二语音信号为第二音频信号中包含的语音信号,第一时段为语音采集时段的开始时刻之后经过第一时长的时段;对目标语音信号进行应答。本申请可以在多人交互场景下准确地确定目标交互人。

Description

语音交互方法、电子设备以及存储介质
本申请要求2022年04月22日提交中国专利局、申请号为202210430483.3、申请名称为“语音交互方法、电子设备以及存储介质”的中国专利申请的优先权,上述申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及软件技术领域,尤其涉及一种语音交互方法、电子设备以及计算机可读存储介质。
背景技术
随着音频处理技术和人工智能(Artificial Intelligence,AI)的发展,越来越多的电子设备(例如,智能音箱,智能机器人等)具备语音交互功能。在语音交互过程中,电子设备需要采集用户语音。为提高语音采集的准确率,电子设备会对目标方向(即当前交互人所在的方向)的声音进行采集,并抑制目标方向之外的其他方向的声音,以减小环境噪声对用户语音信号的干扰。
在一些情形中,电子设备可能处于多用户交互场景中。即,电子设备周围除了存在当前交互人之外,还存在其他可能与电子设备进行语音交互的用户(称作“潜在交互人”)。由于电子设备仅采集当前交互人所在方向的语音,这样,潜在交互人的声音会被抑制。当潜在交互人说话时,电子设备无法感知到其语音内容,从而无法对其进行回应。
发明内容
本申请的一些实施方式提供了一种语音交互方法、电子设备以及计算机可读存储介质,以下从多个方面介绍本申请,以下多个方面的实施方式和有益效果可互相参考。
第一方面,本申请实施方式提供了一种语音交互方法,用于电子设备,方法包括:与第一用户进行语音交互,并在语音交互的语音采集时段中,采集第一用户所在角度范围的第一音频信号和第二用户所在角度范围的第二音频信号,第二用户为在设定历史时段中与电子设备进行语音交互的用户;判断第一音频信号中的第一语音信号的开始时刻是否位于第一时段内,并根据判断结果从第一语音信号和第二语音信号中确定目标语音信号,第二语音信号为第二音频信号中包含的语音信号,第一时段为语音采集时段的开始时刻之后经过第一时长的时段;对目标语音信号进行应答。
根据本申请实施方式,可以根据第一时长在第一用户和第二用户中确定目标交互人,可以合理兼顾第一用户和第二用户的语音交互需求,准确地确定目标交互人,从而提高多人交互场景下的用户体验。
在一些实施方式中,第一语音信号的开始时刻和第二语音信号的开始时刻的至少之一位于第一时段内;根据判断结果从第一语音信号和第二语音信号中确定目标语音信号,包括:若第一语音信号的开始时刻位于第一时段内,则将第一语音信号确定为目标语音信号;否则,则根据第二语音信号和第一语音信号在时间上的交叠状态确定目标语音信号。
根据本申请实施方式,只要第一用户在第一时段内开口说话,电子设备即把第一用户确定为目标交互人(即维持第一用户为当前交互人不变),以优先满足第一用户的语音交互需求。
如果第一用户未在第一时段内开口说话,则认为第一用户的交互意愿较小,从而有可能将第二用户确定为目标交互人,以兼顾第二用户的语音交互需求。
在一些实施方式中,根据第二语音信号和第一语音信号在时间上的交叠状态确定目标语音信号,包括:若第二语音信号和第一语音信号在时间上有交叠,则将第一语音信号确定为目标语音信号;若第二语音信号和第一语音信号在时间上无交叠,则将第二语音信号确定为目标语音信号。
在一些实施方式中,第一时长根据第一用户的交互意愿值P和/或第一用户与电子设备在设定时间段内的交互次数M确定,其中,交互意愿值P用于表征第一用户与电子设备的语音交互意愿。
在一些实施方式中,交互意愿值P根据第一用户的面部角度和/或第一用户与电子设备的距离确定。
在一些实施方式中,第一时长为k1×P+k2×min{M,n},其中,k1、k2为预设常数,n为3~6之间的整数。
在一些实施方式中,第二语音信号中不包括电子设备的唤醒词。
在一些实施方式中,设定历史时段为语音采集时段的开始时刻之前第二时长的时段。
第二方面,本申请实施方式提供了一种电子设备,包括:存储器,用于存储由电子设备的一个或多个处理器执行的指令;处理器,当处理器执行存储器中的指令时,可使得电子设备执行本申请第一方面任一实施方式提供的语音交互方法。第二方面能达到的有益效果可参考本申请第一方面任一实施方式的有益效果,此处不再赘述。
第三方面,本申请实施方式提供了一种计算机可读存储介质,计算机可读存储介质上存储有指令,该指令在计算机上执行时使得计算机执行本申请第一方面任一实施方式提供的语音交互方法。第三方面能达到的有益效果可参考本申请第一方面任一实施方式的有益效果,此处不再赘述。
附图说明
图1为本申请实施例的示例性应用场景;
图2为本申请实施例提供的电子设备的示例性结构图;
图3为本申请实施例提供的语音交互方法的示例性流程图;
图4为本申请实施例提供的电子设备与当前交互人进行语音交互的时序示意图;
图5为本申请实施例提供的用户语音采集过程的示例性流程图;
图6为本申请实施例提供的用户所在角度范围的示意图;
图7为本申请实施例提供的目标语音信号确定方法的示例性流程图;
图8A为本申请实施例提供的目标语音信号确定规则的示意图一;
图8B为本申请实施例提供的目标语音信号确定规则的示意图二;
图9为本申请实施例提供的目标语音信号确定规则的示意图三;
图10A为本申请实施例提供的目标语音信号确定规则的示意图四;
图10B为本申请实施例提供的目标语音信号确定规则的示意图五;
图11为本申请实施例的另一示例性应用场景;
图12为一些实施例中的语音交互方法示意图;
图13为另一些实施例中的语音交互方法示意图;
图14示出了本申请实施方式提供的电子设备的框图;
图15示出了本申请实施方式提供的片上系统(System on Chip,SOC)的结构示意图。
具体实施方式
以下将参考附图详细说明本申请的具体实施方式。
为便于理解,首先介绍本申请中可能涉及的音频处理技术。
(1)波束成形:波束成形技术可以确定声源的方向。波束成形技术依赖于麦克风阵列。声源在发声时,麦克风阵列中各个麦克风(即各声音采集通道)接收到的声音信号存在延时,波束成形技术能够通过各通道的延时信息对声源进行定位(例如,确定声源的方向角,仰角和距离)。
波束成形技术还可以对目标角度内的声音进行采集。波束成形技术能够对麦克风阵列中各通道声音信号进行移相、加权等处理,从而实现增强目标角度内的声音信号,抑制其他方向的声音信号的目的,以实现目标角度内(例如,电子设备正前方±30°内)的声音采集。
(2)语音活动检测(Voice Activity Detection,VAD),又称“语音边界检测”或“端点检测”。语音活动检测技术能够区分音频信号中的语音信号和非语音信号,能够确定语音信号的起点和终点,从而将语音信号从音频信号中分离出来。这样,后续的语音识别可以只对语音信号进行,从而提高语音识别的准确率。
本申请实施方式用于提供一种语音交互方法,用于在多人交互场景下确定合适的目标交互人,以满足用户的语音交互需求。
本申请中,电子设备可以是智能音箱,车机,大屏设备,手机、平板,可穿戴设备,摄像头等任意形态的设备,只要具备语音交互功能即可。在下文中,将智能机器人(例如,小艺精灵)作为电子设备的示例。
图1示出了本申请实施例的示例性应用场景。在图1中,电子设备100(具体为智能机器人)正在与用户甲进行语音交互(简称“交互”)。即,用户甲为电子设备100的当前交互人。为便于交互,用户甲位于电子设备100的正前方。电子设备100与用户甲的交互内容例如为:
用户甲:“小艺小艺”;
电子设备:“我在”;
用户甲:“你有喜欢的动物吗”;
电子设备:“我喜欢毛茸茸的动物,看起来好暖和”;
用户甲:“那你一定喜欢这个熊猫玩具了”;
电子设备:“嗯,比喜欢你更喜欢这个”;
用户甲:“xxxxx……”。
上述示例为用户主动发起的语音交互。即,用户甲主动说出电子设备100的唤醒词“小艺小艺”后,电子设备100被唤醒,并开始与用户甲的交互。在其他示例中,语音交互也可以是电子设备100主动发起的交互。例如,当电子设备100观察到用户甲长时间(例如,连续10s)对其进行注视时,可以播放预设语音(例如,“你有什么问题想问我吗”),以主动发起与用户甲的语音交互。
继续参考图1,电子设备100的周围还存在用户乙。用户乙为刚刚(例如,半分钟前)与电子设备100结束语音交互的用户。本申请对用户乙与电子设备100结束语音交互的原因不作限定。例如,用户乙不对电子设备100播放的语音进行回复,以主动结束与电子设备100的语音交互; 或者,电子设备100在与用户乙进行语音交互的过程中,监听到来自用户甲的唤醒词(例如,小艺小艺),从而结束与用户乙的语音交互,并开启与用户甲的语音交互。
由于用户乙为刚刚与电子设备100进行过语音交互的用户,因此,用户乙仍可能与电子设备100继续进行语音交互。即,用户乙为电子设备100的潜在交互人。但是,在一些实施例中,电子设备100在与用户甲进行语音交互时,为提高语音采集准确度,会对用户甲所在方向之外的其他方向的声音进行抑制。例如,电子设备仅采集正前方±30°内的声音,而对其他方向的声音进行抑制。这样,当用户乙在说话时,电子设备100无法采集到用户乙的语音,从而无法感知到用户乙的交互需求。
为此,本申请实施方式提供了一种语音交互方法,用于在多人交互场景下确定目标交互人,以提高多人交互场景下的用户体验。具体地,电子设备100在与用户甲的语音交互过程中,不仅采集用户甲(作为“当前交互人”或“第一用户”)所在方向的音频信号,还采集用户乙(作为“潜在交互人”或“第二用户”)所在方向的音频信号。电子设备100根据采集到的音频信号,判断用户甲开始说话的时间是否位于设定的优先等待时段(又称“第一时段”)内,并根据判断结果在用户甲和用户乙中确定目标交互人(电子设备100即将应答的交互人)。例如,当用户甲开始说话的时间位于设定的优先等待时段内时,将用户甲确定为目标交互人;否则,有可能将用户乙作为目标交互人。
本申请中,可以根据优先等待时段在用户甲和用户乙中确定目标交互人,可以合理兼顾当前交互人(例如,用户甲)和潜在交互人(例如,用户乙)的语音交互需求,准确地确定目标交互人,从而提高多人交互场景下的用户体验。
以下介绍本申请的具体实施例。在下述实施例中,将智能机器人作为电子设备100的示例。但可以理解,本申请不限于此。
图2示出了本实施例提供的电子设备100的示例性结构图。参考图2,电子设备100包括处理器110,摄像头120,麦克风130,扬声器140,通信模块150、存储器160和传感器170。
可以理解的是,本发明实施例示意的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。处理器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
摄像头120用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,电子设备100可以包括1个或N个摄像头120,N为大于1的正整数。
麦克风130,也称“话筒”,“传声器”,用于将声音信号转换为电信号。电子设备100可以 设置多个(例如,三个、四个或更多)麦克风130,以形成麦克风阵列。麦克风阵列作为语音前端设备,除了采集声音信号,还可以实现采集声音信号,降噪,识别声音来源,实现定向录音功能等功能。
扬声器140,也称“喇叭”,用于将音频电信号转换为声音信号。电子设备100可以通过扬声器140收听音乐,或收听免提通话。
通信模块150可以提供应用在电子设备100上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。通过通信模块150,电子设备可以与其他设备(例如,云端服务器)进行通信。
存储器160可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。存储器160可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储电子设备100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,存储器160可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。处理器110通过运行存储在存储器160的指令,和/或存储在设置于处理器中的存储器的指令,执行电子设备100的各种功能应用以及数据处理。存储器160中存储的指令可以包括:由处理器110中的至少一个执行时导致电子设备100实施本申请实施例提供的语音交互方法。
传感器170可以包括距离传感器和接近光传感器。距离传感器用于测量距离。例如,距离传感器可以通过红外或激光测量距离。接近光传感器可以包括例如发光二极管(LED)和光检测器,例如光电二极管。发光二极管可以是红外发光二极管。电子设备100通过发光二极管向外发射红外光。电子设备100使用光电二极管检测来自附近物体的红外反射光。当检测到充分的反射光时,可以确定电子设备100附近有物体。当检测到不充分的反射光时,电子设备100可以确定电子设备100附近没有物体。电子设备100可以利用接近光传感器180G检测周围是否有用户。
此外,电子设备还可以包括旋转机构,以实现电子设备的舵机转身功能。例如,通过旋转机构,电子设备可以从面向用户甲的角度转动至面向用户乙的角度。
另外,电子设备可以包括若干功能单元。例如,ASR算法单元,声源定位算法单元(例如,波束成形算法单元)、声源抑制算法单元等语音识别单元,以及人脸识别算法单元等视觉识别单元。
以下结合图1示出的场景介绍本实施例提供的语音交互方法的具体流程。参考图3,本实施例提供的语音交互方法包括以下步骤:
S110:电子设备与用户甲(又称“当前交互人”或“第一用户”)进行语音交互。在该语音交互的语音采集阶段,电子设备采集用户甲所在角度范围的音频信号Autio_A和用户乙(又称“潜在交互人”或“第二用户”)所在角度范围的音频信号Autio_B。
图4示出了电子设备与用户甲的语音交互的时序图。其中,白框内为用户甲说话内容,灰框内为电子设备播放内容。参考图4,电子设备与用户甲的语音交互过程包括交替进行的语音播放阶段和语音采集阶段。在语音播放阶段,电子设备播放设备语音;在播放完设备语音后,电子设备进入语音采集阶段,以监测用户语音。当电子设备确定用户讲话完毕后,结束语音采集阶段, 以进入下一个语音播放阶段。在其他实施例中,当电子设备在设定时间内(例如,8s内)没有监测到用户讲话时,可以结束语音采集阶段。通常地,在语音播放阶段,电子设备可以关闭麦克风,不采集外界声音。
本实施例中,在每个语音采集阶段,电子设备会判断是否存在潜在交互人。如存在潜在交互人,则采集当前交互人所在方向范围的音频信号和潜在交互人所在方向范围的音频信号;如不存在潜在交互人,则仅采集当前交互人所在方向范围的音频信号。
以下以当前的语音采集阶段(即图4中的语音采集阶段3)为例,介绍电子设备采集用户语音的过程。当前语音采集阶段的开始时刻为TS。参考图5,电子设备采集用户语音的过程包括以下步骤:
S111:电子设备确定存在潜在交互人。
潜在交互人为在设定历史时段P1内与电子设备进行过语音交互的用户。潜在交互人由于与电子设备进行过语音交互,因此其仍有可能与电子设备进行语音交互。本实施例中,设定历史时段P1为TS时刻之前第二时长T2的时段。即,设定历史时段P1的起点时刻为TS-T2,终端时刻为TS。也就是说,潜在交互人为在最近第二时长的时段内与电子设备交互过的用户。此时,潜在交互人有较大的概率与电子设备进行语音交互。本实施例对第二时长的具体数值不作限定。在一些示例中,第二时长为0.5~2min,例如,0.5min,1min,1.3min。需要说明的是,本申请中的数值范围包括端值。例如,数据范围0.5~2min包括0.5min和2min。
以下以用户乙为例介绍电子设备判断是否存在潜在交互人的方法。电子设备在与用户甲进行语音交互的过程中,确定其周围存在用户乙。例如,电子设备可以通过图像识别方法确定其周围存在用户乙,或者通过传感器(例如,接近光传感器)确定周围存在用户乙。在一些示例中,电子设备将设定距离以内(例如,3m以内)的用户确定为位于其周围的用户。
进一步地,电子设备对用户乙的身份进行识别,例如,电子设备通过人脸识别方式,声纹识别方式等对用户乙的身份进行识别。然后,电子设备可以通过查询其存储的语音交互记录,来判断用户乙是否为潜在交互人。示例性地,语音交互记录中保存有最近一段时间内(例如,2天内)与电子设备进行过语音交互的用户的标识,以及每次语音交互的开始时间,结束时间等。
本实施例中,设定历史时段P1为[TS-60s,TS]。根据历史交互记录,用户乙最近与电子设备交互的时间段为[TS-80s,TS-40s],因此,用户乙在设定历史时段P1内与电子设备进行过语音交互,从而,电子设备确定用户乙为潜在交互人。
S112:电子设备确定用户甲所在角度范围(记作angle_A)和用户乙所在角度范围(记作angle_B)。
图6示出了角度范围angle_A和角度范围angle_B的示意图。示例性地,电子设备的正前方为0°,绕电子设备顺时针转动的方向为正方向。电子设备可以通过视觉定位方式,距离传感器测距方式、声源定位方式等对用户甲、用户乙进行定位。在对用户甲、用户乙进行定位之后,电子设备可以确定用户甲的角度为α(用户甲位于电子设备正前方,因此α=0°),用户乙的角度为β(例如,75°)。在其他实施例中,用户甲也可以不正对电子设备,该实施例中,α可以为其他值,例如,α=-12°。
考虑到定位误差和声音扩散等因素,为了更加可靠地采集用户甲、用户乙的语音信号,电子设备不仅采集用户甲,用户乙所在位置的音频信号,还采集用户甲,用户乙周围的音频信号。即,电子设备采集用户甲所在角度范围angle_A和用户乙所在角度范围angle_B的音频信号。本实施 例中,angle_A=α±c1,angle_B=β±c2。本实施例对c1,c2的值不作限定。示例性地,c1=c2=20°~40°,例如,c1=c2=30°。
S113:电子设备采集角度范围angle_A的音频信号(记作Audio_A,作为第一音频信号)和角度范围angle_B中的音频信号(记作Audio_B,作为第二音频信号)。
具体地,在时刻TS,电子设备开启麦克风,并通过麦克风阵列采集音频信号。之后,电子设备通过波束成形算法A对麦克风阵列采集到的音频信号进行处理,以得到音频信号Audio_A;通过波束成形算法B对麦克风阵列采集到的音频信号进行处理,以得到音频信号Audio_B。
其中,波束成形算法A的目标角度被设置为angle_A,因此,音频信号Audio_A为对angle_A之外的声音信号进行抑制后的音频信号,即,angle_A为角度范围angle_A内的音频信号。
波束成形算法B的目标角度被设置为angle_B,因此,音频信号Audio_B为对angle_B之外的声音信号进行抑制后的音频信号,即,angle_B为角度范围angle_B内的音频信号。
以上介绍了电子设备采集用户语音的示例性方法。以下返回到图3,继续介绍本实施例提供的语音交互方法的后续步骤。
S120:电子设备判断音频信号Audio_A中的语音信号(记作Voice_A,作为第一语音信号)的开始时刻是否位于优先等待时段(又称“第一时段”)之内,并根据判断结果从语音信号Voice_A和语音信号Voice_B(作为第二语音信号)中确定目标语音信号(电子设备即将回复的语音信号),其中,语音信号Voice_B是音频信号Audio_B中包含的语音信号。
如果用户甲在当前语音采集时段中开口说话,音频信号Audio_A中将包括语音信号Voice_A。同样地,如果用户乙在当前语音采集时段中开口说话,音频信号B中将包括语音信号Voice_B。电子设备可以通过语音活动检测VAD算法检测音频信号Audio_A、Audio_B中的语音信号Voice_A、Voice_B,并确定语音信号Voice_A、Voice_B的端点(起点和终点)。
本实施例中,电子设备会判断语音信号Voice_A的起点(又称作Voice_A的开始时刻)是否位于优先等待时段内,并根据判断结果将语音信号Voice_A和语音信号Voice_B的其中之一确定为目标语音信号。其中,优先等待时段为时刻TS之后经过第一时长T的时段。即,优先等待时段的起点为时刻TS,终点为时刻TE(TE=TS+T)。
在一些实施例中,第一时长T为用户甲最有可能开口说话的时间(例如,3s),即,如果用户甲会与电子设备继续进行交互的话,用户甲大概率会在优先等待时段之内开口说话。本实施例根据优先等待时段来确定目标语音信号,可以更为符合用户的交互意愿,以提高用户体验。为叙述的连贯性,第一时长T的具体决定方法将在后文中介绍。
参考图7,本实施例提供的电子设备确定目标语音信号的过程包括以下步骤:
S121:电子设备判断语音信号Voice_A的开始时刻TAS是否位于优先等待时段内。
电子设备可以通过VAD算法检测语音信号Voice_A的开始时刻TAS。可以理解,Voice_A的开始时刻TAS即为用户甲开始说话的时刻。
电子设备在确定时刻TAS之后,将其与优先等待时段的结束时刻TE进行比较。如果时刻TAS早于或等于时刻TE,电子设备判断语音信号Voice_A的开始时刻TAS位于优先等待时段内,从而执行步骤S122;否则,电子设备执行步骤S123。
S122:电子设备将语音信号Voice_A确定为目标语音信号。
本实施例中,只要语音信号Voice_A的开始时刻TAS位于优先等待时段内,即把Voice_A确定为目标语音信号(电子设备即将回应的语音信号)。也就是说,只要用户甲在优先等待时段内 开口说话,电子设备即把用户甲确定为目标交互人(即维持用户甲为当前交互人不变),以优先满足用户甲的语音交互需求。
图8A给出了将语音信号Voice_A确定为目标语音信号的一个示例。参考图8A,Voice_A的开始时刻TAS和Voice_B的开始时刻TBS均位于优先等待时段内,且时刻TAS早于时刻TBS(即用户甲开口说话的时间早于用户乙)。该示例中,由于时刻TAS位于优先等待时段内,因此,电子设备将语音信号Voice_A确定为目标语音信号。
可选地,当电子设备检测到语音信号Voice_A的起点之后(即时刻TAS之后),可以停止采集音频信号Audio_B(或语音信号Voice_B)。
图8B给出了将语音信号Voice_A确定为目标语音信号的另一个示例。参考图8B,Voice_A的开始时刻TAS和Voice_B的开始时刻TBS均位于优先等待时段内,但时刻TBS早于时刻TAS(即用户乙开口说话的时间早于用户甲)。该示例中,虽然用户乙先开口说话,但由于时刻TAS位于优先等待时段内,因此,电子设备仍将语音信号Voice_A确定为目标语音信号。可选地,当电子设备检测到语音信号Voice_A的起点之后(即时刻TAS之后),可以停止采集音频信号Audio_B(或语音信号Voice_B),并丢弃时刻TAS之前采集到的语音信号Voice_B。
如果用户甲未在优先等待时段内开口说话,则认为用户甲的交互意愿较小,从而有可能将用户乙确定为目标交互人,以兼顾用户乙的语音交互需求。以下结合图7中的后续步骤进行介绍。
S123:电子设备判断语音信号Voice_A和语音信号Voice_B在时间上是否有交叠。
本实施例中,语音信号Voice_A的开始时刻TAS和语音信号Voice_B的开始时刻TBS的至少一个位于优先等待时段内。当语音信号Voice_A的开始时刻TAS位于优先等待时段之外时,电子设备根据语音信号Voice_A和语音信号Voice_B在时间上的交叠状态确定目标语音信号。
其中,如果语音信号Voice_A的开始时刻TAS早于或等于语音信号Voice_B的结束时刻TBE(又称“Voice_B的终点TBE”)时,电子设备判断语音信号Voice_A和语音信号Voice_B在时间上有交叠,从而执行步骤S124;否则,电子设备判断语音信号Voice_A和语音信号Voice_B在时间上无交叠,从而执行步骤S125。如上文所述,各语音信号的端点,例如,语音信号Voice_A的开始时刻TAS,语音信号Voice_B的结束时刻TBE等,均可以通过语音活动检测VAD算法确定。
S124:电子设备将语音信号Voice_A确定为目标语音信号。
当语音信号Voice_A和语音信号Voice_B在时间上有交叠时,电子设备将语音信号Voice_A确定为目标语音信号。也就是说,当用户乙说话尚未结束,用户甲即开始说话时,仍将用户甲作为目标交互人(维持用户甲为当前交互人不变)。
图9给出了将语音信号Voice_A确定为目标语音信号的又一个示例。参考图9,Voice_B的开始时刻TBS位于优先等待时段内,结束时刻TBE位于优先等待时段外。Voice_A的开始时刻TAS位于优先等待时段外,但时刻TAS早于Voice_B的结束时刻TBE(即用户甲开口说话的时间早于用户乙结束说话的时间)。该示例中,由于语音信号Voice_A和语音信号Voice_B在时间上有交叠,因此,电子设备将语音信号Voice_A确定为目标语音信号。
可选地,在一些实施例中,当电子设备检测到语音信号Voice_A的起点之后(即时刻TAS之后),可以停止采集音频信号Audio_B(或语音信号Voice_B),并丢弃时刻TAS之前采集到的语音信号Voice_B。该实施例中,在检测到语音信号Voice_A的起点之后,如果判断语音信号Voice_B尚未结束(即尚未检测到语音信号Voice_B的终点TBE),则可以确定Voice_A的开始时刻TAS早于Voice_B的结束时刻TBE
S125:电子设备将语音信号Voice_B确定为目标语音信号。
当语音信号Voice_A和语音信号Voice_B在时间上无交叠时,电子设备将语音信号Voice_B确定为目标语音信号。也就是说,当用户乙说话已经结束,用户甲尚未开始说话时,电子设备将用户乙确定为目标交互人(即,电子设备将当前交互人从用户甲切换为用户乙)。
图10A给出了将语音信号Voice_B确定为目标语音信号的一个示例。参考图10A,Voice_B的开始时刻TBS均位于优先等待时段内,结束时刻TBE位于优先等待时间外。Voice_A的开始时刻TAS晚于Voice_B的结束时刻TBE(即用户乙说话结束时,用户甲尚未开始说话)。该示例中,由于语音信号Voice_A和语音信号Voice_B在时间上无交叠,因此,电子设备将语音信号Voice_B确定为目标语音信号。可选地,当电子设备检测到语音信号Voice_B的终点之后(即时刻TBE之后),可以停止采集音频信号(例如,关闭麦克风,不采集音频信号Audio_A),并对语音信号Voice_B进行应答(即执行步骤S130)。
图10B给出了将语音信号Voice_B确定为目标语音信号的一个示例。参考图10B,Voice_B的开始时刻TBS和结束时刻TBE均位于优先等待时段内。Voice_A的开始时刻TAS位于优先等待时间之外。该示例中,电子设备在检测到Voice_B的终点TBE之后,继续等待至时刻TE。当电子设备确定在时刻TE达到后用户甲仍未开口说话时,电子设备将语音信号Voice_B确定为目标语音信号。可选地,电子设备等待至时刻TE之后,可以停止采集音频信号(例如,关闭麦克风,不采集音频信号Audio_A),并对语音信号Voice_B进行应答(即执行步骤S130)。
S130:电子设备对目标语音信号进行应答。
在确定目标语音信号后,电子设备对目标语音信号进行应答(即对目标交互人进行回复)。示例性地,电子设备将目标语音信号上传到云端服务器中。云端服务器通过自动语音识别(Automatic Speech Recognition,ASR)算法,自然语音处理(Neuro-Linguistic Programming,NLP)算法对目标语音信号进行语义识别,以确定回复文本内容。其中,ASR算法为用于将语音转换为文本的技术,NLP算法为用于使电子设备“读懂”人类语言的技术。
云端服务器在确定回复文本后,将回复文本发送至电子设备。电子设备在接收到回复文本之后,通过(Text To Speech,TTS)算法,将回复文本转换为语音流,并输出(例如,播放)该语音流,以对目标语音信号(或目标交互人)进行应答。电子设备在播放语音流的过程中,可以关闭麦克风,不采集音频信号。在其他实施例中,电子设备也可以通过本地的ASR算法、NLP算法确定回复文本内容。
另外,除语音应答之外,电子设备的应答方式还可以包括表情,动作等。例如,在电子设备确定目标交互人为用户乙后,转身至面向用户乙的方向,以提高电子设备的智能化和拟人化程度。
综上,本实施例提供了一种语音交互方法,根据优先等待时段来确定目标交互人,可以合理兼顾当前交互人(例如,用户甲)和潜在交互人(例如,用户乙)的语音交互需求,提高多人交互场景下的用户体验。
例如,如果用户甲(即当前交互人)在优先等待时段内开口说话,电子设备即把用户甲确定为目标交互人,以优先满足用户甲的语音交互需求。如果用户甲未在优先等待时段内开口说话,则认为用户甲的交互意愿较小,从而有可能将用户乙(即潜在交互人)确定为目标交互人,以兼顾用户乙的语音交互需求。
本实施例为本申请技术方案的示例性说明,本领域技术人员可以进行其他变形。
例如,本实施例中,如果用户甲未在优先等待时段内开口说话,则根据语音信号Voice_A和 语音信号Voice_B在时间上的交叠状态确定目标交互人。在其他实施例中,只要用户甲未在优先等待时段内开口说话,而用户乙在优先等待时段内开口说话,则将用户乙确定为目标交互人。该实施例中,可以相对更为快速地确定目标交互人。另外,该实施例还可以提高潜在交互人的优先程度。
又如,本实施例中,语音信号Voice_A的开始时刻TAS和语音信号Voice_B的开始时刻TBS的至少一个位于优先等待时段内。在其他实施例中,当语音信号Voice_A的开始时刻TAS和语音信号Voice_B的开始时刻TBS均位于优先等待时段之外时,电子设备可以继续监测语音信号Voice_A,如果在设定时间(例如,TS+8s)之前监测到语音信号Voice_A的起点,则将语音信号Voice_A作为目标语音信号,否则,电子设备结束当前语音交互。
又如,在一些实施例中,考虑到唤醒词可能直接导致目标交互人的切换,因此,用户乙的语音(即语音信号Voice_B)中不包括电子设备的唤醒词。
又如,本实施例中,电子设备周围的潜在交互人为1个,具体为用户乙。在其他场景中,电子设备周围可以存在多个潜在交互人。参考图11,电子设备周围存在两个潜在交互人,具体为用户乙和用户丙。该实施例中,电子设备可以从中用户乙和用户丙中选择一个作为暂定交互人,并根据步骤S120所述的方法在暂定交互人和用户甲中确定目标交互人。
本申请对选择暂定交互人的方式不作限定。例如,将用户乙和用户丙中最近与电子设备进行语音交互的一者选择为暂定交互人;或者,将在当前语音采集阶段中,最先开口说话的一者选择为暂定交互人。
以下介绍本实施例提供的第一时长T的确定方法。
本实施例中,根据用户甲最有可能开口的时间确定第一时长T。即,在假设用户甲有继续交互意愿的情况下,用户甲最晚将会在第一时长内(即时刻TE之前)开口说话。
本实施例中,第一时长是根据用户甲的交互意愿值P和用户甲在设定时间段内与电子设备的交互轮次M确定的。其中,交互意愿值P用于表征用户甲与电子设备的交互意愿。交互意愿值P越大,表示用户甲与电子设备进行语音交互的可能性越大。
具体地,第一时长T=k1×P+k2×min{M,n},其中,k1、k2为预设常数,n为3~6之间的整数。以下进行具体介绍。
(1)交互意愿值P。本实施例中,交互意愿值为0~1之间的数值。但本申请不限于此。在其他实施例中,交互意愿值P也可以为其他数值,例如,1~5之间的数值。
本实施例中,交互意愿值P是根据用户甲的面部角度和/或用户甲与电子设备的距离D确定。用户甲的面部角度用于表示用户甲的面部与电子设备的正对程度。当用户面部正对电子设备的正面时,用户甲的面部角度为0°;用户甲侧向电子设备的角度越大,用户甲的面部角度越大。通常地,用户甲越正对电子设备,用户甲的交互意愿值P越大。用户甲的面部角度可以通过图像识别的方式确定。
用户甲与电子设备的距离D可以根据图像识别,距离传感器测距,声源定位等方式确定。示例性地,当用户甲与电子设备的距离D位于设定范围(例如,电子设备的高度的0.5~1倍)时,认为用户甲具有较高的交互意愿值P;距离D与该设定范围的偏差越大,认为用户甲的交互意愿值P越小。
在一些实施例中,交互意愿值P为面部角度和距离D的加权和。该实施例中,面部角度φ和距离D的权值可以是根据经验确定的常数。
在另一些实施例中,交互意愿值P可以通过AI算法确定。电子设备中可以存储预先训练好的AI模型,AI模型用于表示面部角度距离D与交互意愿值P的映射关系。电子设备在测量得到面部角度和距离D之后,可以通过该AI模型计算得到交互意愿值P。
(2)交互轮次M。交互轮次M为用户甲在设定时间段P2内与电子设备的交互轮次。本申请中,电子设备和用户完成一次问答算作一个交互轮次。例如,用户甲问“你有喜欢的动物吗”,电子设备回答“我喜欢毛茸茸的动物,看起来好暖和”,算作一个交互轮次。
交互轮次M可以反映用户甲与电子设备的交互频次,交互频次越高(即M值越大),用户甲的交互可能性越大。本实施例中,设定时间段P2为时刻TS之前第三时长的时段。也就是说,M为用户甲在最近第三时长内与电子设备的交互轮次。这样,M可以更准确地表征用户甲与电子设备继续交互的可能性。示例性地,第三时长为0.5~2min,例如,1min。
在一些实施例中,M为设定时间段P2内用户甲与电子设备连续交互的轮次,连续交互指在用户甲与电子设备的交互过程中,没有其他用户介入。如果有其他用户介入,则从0开始重新计算交互轮次M。
(3)k1,k2。k1和k2分别为交互意愿值P、交互轮次M的权值,用于调节交互意愿值P和交互轮次M在第一时长T中的权重。本实施例中,由于交互意愿值P为0~1之间的数值,而M为大于1的数值,为平衡交互意愿值P和交互轮次M在第一时长T中的权重,k1大于k2。例如,k1为k2的3~5倍。示例性地,k1=2,k2=0.5。
(4)n。n用于限定第一时长T的上限。考虑到一些场景中,用户甲会与电子设备进行较多轮次的交互,例如,M=15,这样,第一时长T可能会有过大的值。为避免第一时长T无限制的增加,将交互轮次项设置为min{M,n},以限定第一时长T的上限。
以上介绍了第一时长T的确定方法。例如,在一个示例中,交互意愿值P=0.9,交互轮次M=3,k1=2,k2=0.5,n=5,可以得出T=3.3s;在另一些示例中,交互意愿值P=0.3,交互轮次M=1,k1=2,k2=0.5,n=5,可以得出T=1.1s。
本实施例中,根据交互意愿值P和交互轮次M确定第一时长,可以准确地预测用户甲的交互可能性。用户甲的交互可能性越大,电子设备对用户甲的优先等待时长越长,从而可以合理满足用户甲的语音交互需求。
以上介绍了第一时长T的确定方法。但本申请不限于此。
例如,在另一些实施例中,可以根据交互意愿值P和交互轮次M的其中之一确定第一时长T。例如,T=3×P。
又如,本实施例中,根据交互意愿值P和交互轮次M动态地调整第一时长T。在另一些实施例中,第一时长T可以为根据经验确定的固定值,例如,第一时长为3s。该实施例可以简化第一时长T的确定过程,减少计算开销。
与其他实施例提供的语音交互方法相比,本申请实施例提供的语音交互方法可以在多人交互场景下,合理地选择目标交互人。以下与其他实施例提供的语音交互方法进行分别地对比。
图12示出了一种实施方式。具体地,图12提供了一种拾音方法及装置,初始拾音波束指向目标声源,在检测到录音设备朝向发生变化,动态调整拾音波束方向,确保目标声源与拾音波束指向相同,从而衰减或屏蔽其他噪声源的声音信号。
图12所示的实施方式是在明确了目标声源后,根据检测录音设备朝向动态调整拾音波束指向,但是没有解决在多人交互场景中,机器人如何拾音方向,以及如何在多用户中选择目标交互人的 问题。
图13示出了另一种实施方式。具体地,图13提供了一种拾音方法,包括如下步骤S1001,当接收的声源角度在预设的角度内,通过获取当前摄像头的人脸;步骤S1002,选择这些人脸中与声源角度最接近的人脸为说话人,即对角度最接近的人脸进行跟踪,以达到跟踪当前说话人的目的;最后,在步骤S1003,调整机器人角度,使得说话人的脸部中心落在机器人前方中心位置,以便于对说话人的声音信号进行响应。
图13所示实施方式只是根据当前声源角度与面前所有人脸中选择一个角度最接近的进行人脸跟踪,达到跟踪当前说话人的目的,但是没有解决在多人交互场景中,机器人如何确定拾音方向,以及如何在多用户中选择目标交互人的问题。
相对于图12和图13所示的方式,本实施例中的语音交互方法可以在多人交互场景下,合理兼顾当前交互人和潜在交互人的语音交互需求,准确地确定目标交互人,从而提高多人交互场景下的用户体验。
现在参考图14,所示为根据本申请的一个实施例的电子设备400的框图。电子设备400可以包括耦合到控制器中枢403的一个或多个处理器401。对于至少一个实施例,控制器中枢403经由诸如前端总线(Front Side Bus,FSB)之类的多分支总线、诸如快速通道连(QuickPath Interconnect,QPI)之类的点对点接口、或者类似的连接406与处理器401进行通信。处理器401执行控制一般类型的数据处理操作的指令。在一实施例中,控制器中枢403包括,但不局限于,图形存储器控制器中枢(Graphics&Memory Controller Hub,GMCH)(未示出)和输入/输出中枢(Input Output Hub,IOH)(其可以在分开的芯片上)(未示出),其中GMCH包括存储器和图形控制器并与IOH耦合。
电子设备400还可包括耦合到控制器中枢403的协处理器402和存储器404。或者,存储器和GMCH中的一个或两者可以被集成在处理器内(如本申请中所描述的),存储器404和协处理器402直接耦合到处理器401以及控制器中枢403,控制器中枢403与IOH处于单个芯片中。
存储器404可以是例如动态随机存取存储器(DRAM,Dynamic Random Access Memory)、相变存储器(PCM,Phase Change Memory)或这两者的组合。存储器404中可以包括用于存储数据和/或指令的一个或多个有形的、非暂时性计算机可读介质。计算机可读存储介质中存储有指令,具体而言,存储有该指令的暂时和永久副本。该指令可以包括:由处理器中的至少一个执行时导致电子设备400实施如图3、图5、图7所示方法的指令。当指令在计算机上运行时,使得计算机执行上述实施例公开的方法。
在一个实施例中,协处理器402是专用处理器,诸如例如高吞吐量集成众核(Many Integrated Core,MIC)处理器、网络或通信处理器、压缩引擎、图形处理器、图形处理单元上的通用计算(General-purpose computing on graphics processing units,GPGPU)、或嵌入式处理器等等。协处理器402的任选性质用虚线表示在图14中。
在一个实施例中,电子设备400可以进一步包括网络接口(Network Interface Controller,NIC)406。网络接口406可以包括收发器,用于为电子设备400提供无线电接口,进而与任何其他合适的设备(如前端模块,天线等)进行通信。在各种实施例中,网络接口406可以与电子设备400的其他组件集成。网络接口406可以实现上述实施例中的通信单元的功能。
电子设备400可以进一步包括输入/输出(Input/Output,I/O)设备405。I/O405可以包括:用户界面,该设计使得用户能够与电子设备400进行交互;外围组件接口的设计使得外围组件也 能够与电子设备400交互;和/或传感器设计用于确定与电子设备400相关的环境条件和/或位置信息。
值得注意的是,图14仅是示例性的。即虽然图14中示出了电子设备400包括处理器401、控制器中枢403、存储器404等多个器件,但是,在实际的应用中,使用本申请各方法的设备,可以仅包括电子设备400各器件中的一部分器件,例如,可以仅包含处理器401和网络接口406。图14中可选器件的性质用虚线示出。
现在参考图15,所示为根据本申请的一实施例的片上系统(System on Chip,SoC)500的框图。在图15中,相似的部件具有同样的附图标记。另外,虚线框是更先进的SoC的可选特征。在图15中,SoC500包括:互连单元550,其被耦合至处理器510;系统代理单元580;总线控制器单元590;集成存储器控制器单元540;一组或一个或多个协处理器520,其可包括集成图形逻辑、图像处理器、音频处理器和视频处理器;静态随机存取存储器(Static Random access Memory,SRAM)单元530;直接存储器存取(Direct Memory Access,DMA)单元560。在一个实施例中,协处理器520包括专用处理器,诸如例如网络或通信处理器、压缩引擎、图形处理单元上的通用计算(General-purpose computing on graphics processing units,GPGPU)、高吞吐量MIC处理器、或嵌入式处理器等。
静态随机存取存储器(SRAM)单元530可以包括用于存储数据和/或指令的一个或多个有形的、非暂时性计算机可读介质。计算机可读存储介质中存储有指令,具体而言,存储有该指令的暂时和永久副本。该指令可以包括:由处理器中的至少一个执行时导致SoC实施如图3、图5、图7所示方法的指令。当指令在计算机上运行时,使得计算机执行上述实施例中公开的方法。
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。
本申请的各方法实施方式均可以以软件、磁件、固件等方式实现。
可将程序代码应用于输入指令,以执行本文描述的各功能并生成输出信息。可以按已知方式将输出信息应用于一个或多个输出设备。为了本申请的目的,处理系统包括具有诸如例如数字信号处理器(Digital Signal Processor,DSP)、微控制器、专用集成电路(ASIC)或微处理器之类的处理器的任何系统。
程序代码可以用高级程序化语言或面向对象的编程语言来实现,以便与处理系统通信。在需要时,也可用汇编语言或机器语言来实现程序代码。事实上,本文中描述的机制不限于任何特定编程语言的范围。在任一情形下,该语言可以是编译语言或解释语言。
至少一个实施例的一个或多个方面可以由存储在计算机可读存储介质上的表示性指令来实现,指令表示处理器中的各种逻辑,指令在被机器读取时使得该机器制作用于执行本文所述的技术的逻辑。被称为“知识产权(Intellectual Property,IP)核”的这些表示可以被存储在有形的计算机可读存储介质上,并被提供给多个客户或生产设施以加载到实际制造该逻辑或处理器的制造机器中。
在一些情况下,指令转换器可用来将指令从源指令集转换至目标指令集。例如,指令转换器可以变换(例如使用静态二进制变换、包括动态编译的动态二进制变换)、变形、仿真或以其它方式将指令转换成将由核来处理的一个或多个其它指令。指令转换器可以用软件、硬件、固件、或其组合实现。指令转换器可以在处理器上、在处理器外、或者部分在处理器上且部分在处理器外。

Claims (11)

  1. 一种语音交互方法,用于电子设备,其特征在于,所述方法包括:
    与第一用户进行语音交互,并在所述语音交互的语音采集时段中,采集所述第一用户所在角度范围的第一音频信号和第二用户所在角度范围的第二音频信号,所述第二用户为在设定历史时段中与所述电子设备进行语音交互的用户;
    判断所述第一音频信号中的第一语音信号的开始时刻是否位于第一时段内,并根据判断结果从所述第一语音信号和第二语音信号中确定目标语音信号,所述第二语音信号为所述第二音频信号中包含的语音信号,所述第一时段为所述语音采集时段的开始时刻之后经过第一时长的时段;
    对所述目标语音信号进行应答。
  2. 根据权利要求1所述的方法,其特征在于,所述第一语音信号的开始时刻和所述第二语音信号的开始时刻的至少之一位于所述第一时段内;
    所述根据判断结果从所述第一语音信号和第二语音信号中确定目标语音信号,包括:
    若所述第一语音信号的开始时刻位于所述第一时段内,则将所述第一语音信号确定为目标语音信号;否则,则根据所述第二语音信号和所述第一语音信号在时间上的交叠状态确定所述目标语音信号。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述第二语音信号和所述第一语音信号在时间上的交叠状态确定所述目标语音信号,包括:
    若所述第二语音信号和所述第一语音信号在时间上有交叠,则将所述第一语音信号确定为所述目标语音信号;
    若所述第二语音信号和所述第一语音信号在时间上无交叠,则将所述第二语音信号确定为所述目标语音信号。
  4. 根据权利要求1所述的方法,其特征在于,所述第一时长根据所述第一用户的交互意愿值P和/或所述第一用户与所述电子设备在设定时间段内的交互次数M确定,其中,所述交互意愿值P用于表征所述第一用户与所述电子设备的语音交互意愿。
  5. 根据权利要求4所述的方法,其特征在于,所述交互意愿值P根据所述第一用户的面部角度和/或所述第一用户与所述电子设备的距离确定。
  6. 根据权利要求4所述的方法,其特征在于,所述第一时长为k1×P+k2×min{M,n},其中,k1、k2为预设常数,n为3~6之间的整数。
  7. 根据权利要求1所述的方法,其特征在于,所述第二语音信号中不包括所述电子设备的唤醒词。
  8. 根据权利要求1所述的方法,其特征在于,所述设定历史时段为所述语音采集时段的开始时刻之前第二时长的时段。
  9. 根据权利要求8所述的方法,其特征在于,所述第二时长为0.5~2min。
  10. 一种电子设备,其特征在于,包括:
    存储器,用于存储由所述电子设备的一个或多个处理器执行的指令;
    处理器,当所述处理器执行所述存储器中的所述指令时,可使得所述电子设备执行权利要求1~9任一项所述的方法。
  11. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有指令,该指令在计算机上执行时使得计算机执行权利要求1~9任一项所述的方法。
PCT/CN2023/089278 2022-04-22 2023-04-19 语音交互方法、电子设备以及存储介质 WO2023202635A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210430483.3 2022-04-22
CN202210430483.3A CN116978372A (zh) 2022-04-22 2022-04-22 语音交互方法、电子设备以及存储介质

Publications (1)

Publication Number Publication Date
WO2023202635A1 true WO2023202635A1 (zh) 2023-10-26

Family

ID=88419276

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/089278 WO2023202635A1 (zh) 2022-04-22 2023-04-19 语音交互方法、电子设备以及存储介质

Country Status (2)

Country Link
CN (1) CN116978372A (zh)
WO (1) WO2023202635A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110875060A (zh) * 2018-08-31 2020-03-10 阿里巴巴集团控股有限公司 语音信号处理方法、装置、系统、设备和存储介质
CN111243585A (zh) * 2020-01-07 2020-06-05 百度在线网络技术(北京)有限公司 多人场景下的控制方法、装置、设备及存储介质
CN111694433A (zh) * 2020-06-11 2020-09-22 北京百度网讯科技有限公司 语音交互的方法、装置、电子设备及存储介质
CN112739507A (zh) * 2020-04-22 2021-04-30 南京阿凡达机器人科技有限公司 一种交互沟通实现方法、设备和存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110875060A (zh) * 2018-08-31 2020-03-10 阿里巴巴集团控股有限公司 语音信号处理方法、装置、系统、设备和存储介质
CN111243585A (zh) * 2020-01-07 2020-06-05 百度在线网络技术(北京)有限公司 多人场景下的控制方法、装置、设备及存储介质
CN112739507A (zh) * 2020-04-22 2021-04-30 南京阿凡达机器人科技有限公司 一种交互沟通实现方法、设备和存储介质
CN111694433A (zh) * 2020-06-11 2020-09-22 北京百度网讯科技有限公司 语音交互的方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN116978372A (zh) 2023-10-31

Similar Documents

Publication Publication Date Title
US11823679B2 (en) Method and system of audio false keyphrase rejection using speaker recognition
EP3353677B1 (en) Device selection for providing a response
US11514917B2 (en) Method, device, and system of selectively using multiple voice data receiving devices for intelligent service
WO2021136037A1 (zh) 语音唤醒方法、设备及系统
JP2018180523A (ja) マン・マシン・ダイアログにおけるエージェント係属の管理
US9263044B1 (en) Noise reduction based on mouth area movement recognition
EP3274988A1 (en) Controlling electronic device based on direction of speech
CN105793923A (zh) 本地和远程语音处理
CN109166575A (zh) 智能设备的交互方法、装置、智能设备和存储介质
TW202008352A (zh) 方位角估計的方法、設備、語音交互系統及儲存介質
GB2573173A (en) Processing audio signals
CN113611318A (zh) 一种音频数据增强方法及相关设备
CN115206306A (zh) 语音交互方法、装置、设备及系统
KR102629796B1 (ko) 음성 인식의 향상을 지원하는 전자 장치
KR20240017404A (ko) 탠덤 네트워크들을 사용한 잡음 억제
CN112585675A (zh) 选择地使用多个语音数据接收装置进行智能服务的方法、装置和系统
WO2023202635A1 (zh) 语音交互方法、电子设备以及存储介质
JPWO2019163247A1 (ja) 情報処理装置、情報処理方法、および、プログラム
WO2019069529A1 (ja) 情報処理装置、情報処理方法、および、プログラム
CN116582382B (zh) 智能设备控制方法、装置、存储介质及电子设备
WO2022052691A1 (zh) 基于多设备的语音处理方法、介质、电子设备及系统
US11743588B1 (en) Object selection in computer vision
JP2023545981A (ja) 動的分類器を使用したユーザ音声アクティビティ検出
US11562741B2 (en) Electronic device and controlling method using non-speech audio signal in the electronic device
WO2023231936A1 (zh) 一种语音交互方法及终端

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23791290

Country of ref document: EP

Kind code of ref document: A1