WO2022052691A1 - Multi-device voice processing method, medium, electronic device, and system - Google Patents

Multi-device voice processing method, medium, electronic device, and system Download PDF

Info

Publication number
WO2022052691A1
WO2022052691A1 PCT/CN2021/110865 CN2021110865W WO2022052691A1 WO 2022052691 A1 WO2022052691 A1 WO 2022052691A1 CN 2021110865 W CN2021110865 W CN 2021110865W WO 2022052691 A1 WO2022052691 A1 WO 2022052691A1
Authority
WO
WIPO (PCT)
Prior art keywords
electronic device
voice
information
audio
electronic
Prior art date
Application number
PCT/CN2021/110865
Other languages
French (fr)
Chinese (zh)
Inventor
潘邵武
万柯
谷岳
印文帅
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022052691A1 publication Critical patent/WO2022052691A1/en

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B15/00Systems controlled by a computer
    • G05B15/02Systems controlled by a computer electric
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Definitions

  • the present application relates to speech processing technology in the field of artificial intelligence, and in particular, to a multi-device-based speech processing method, medium, electronic device and system.
  • a voice assistant is an application (Application, APP) based on artificial intelligence (Artificial Intelligence, AI).
  • Smart devices such as mobile phones receive and recognize voice commands spoken by users through voice assistants, providing users with voice control functions such as interactive dialogue, information query, and device control.
  • voice assistants With the widespread popularity of smart devices with voice assistants, there are usually multiple devices with voice assistants installed in the user's environment (such as the user's home). In this multi-device scenario, if there are devices with the same wake word in multiple devices , then after the user speaks the wake-up word, the voice assistants of the devices with the same wake-up word will be woken up, and will all recognize and respond to the user's subsequent voice commands.
  • multiple devices can cooperate to select the device closest to the user from multiple devices with the same wake-up word to wake up the voice assistant, so that the device can pick up, recognize and respond to the voice assistant.
  • the user's voice command if there is strong external noise near the selected device, or if the device has poor sound pickup capability, the accuracy of the voice command recognition result of the selected device in the above automatic speech recognition process is low, Therefore, the operation indicated by the voice instruction cannot be performed accurately.
  • Embodiments of the present application provide a multi-device-based voice processing method, medium, electronic device, and system.
  • the sound-picking device selected from the multi-device may have the closest distance to the user, the farthest distance from an external noise source, and an internal One or more of the favorable factors such as noise reduction capability, so as to alleviate the influence of electronic equipment deployment location, internal noise interference or external noise interference on the voice pickup effect and voice recognition accuracy of the voice assistant in multi-device scenarios, improve User interaction experience and environmental robustness of speech recognition in multi-device scenarios.
  • an embodiment of the present application provides a multi-device-based voice processing method, the method includes: a first electronic device in a plurality of electronic devices picks up a voice to obtain a first to-be-recognized voice; The second electronic device that broadcasts audio in the electronic device receives audio information related to the audio broadcast by the second electronic device; Second, the voice to be recognized.
  • the electronic device used for sound pickup ie, the first electronic device
  • the sound pickup device is the sound pickup device hereinafter, such as an electronic device with better sound pickup effect selected from the multiple devices.
  • the above-mentioned electronic device that broadcasts audio is the internal noise device in the multi-device, and the audio information of the external audio from the second electronic device is the noise reduction information of the internal noise device described below.
  • the first electronic device performs noise reduction processing on the first to-be-recognized voice obtained by picking up the sound by using the audio information of the audio played by the second electronic device to obtain the second to-be-recognized voice, which can alleviate the problem of audio being played out in a multi-device scenario.
  • the influence of the internal noise of electronic equipment on the voice pickup effect of the voice assistant ensures the voice pickup effect of the voice assistant based on multiple devices, which in turn helps to ensure the voice recognition accuracy of the voice assistant, and improves the environmental robustness of voice recognition in multi-device scenarios.
  • the above-mentioned audio information includes at least one of the following: audio data of the externally played audio, and VAD information corresponding to the voice activation detection of the audio.
  • the audio information of the audio can reflect the audio itself, and by performing noise reduction processing on the internal noise generated by the external audio, the internal noise can be eliminated to other voice data (such as the voice data picked up by the user, such as The influence of the voice data corresponding to the voice to be recognized) to improve the quality of the picked-up voice data.
  • the above-mentioned method further includes: the first electronic device sends the second to-be-recognized speech to a third electronic device used for recognizing speech among the plurality of electronic devices; or, the first electronic device Recognizing the second speech to be recognized.
  • the electronic device for recognizing voice ie, the third electronic device
  • the answering device hereinafter.
  • the electronic device used for recognizing voice and the electronic device used for picking up sounds may be the same or different, that is, the first electronic device (or the first electronic device (or the first electronic device) may be The microphone module of the electronic device) is used as a peripheral device to pick up the user's voice command, so that the peripheral resources of multiple electronic devices equipped with a microphone module and a voice assistant can be effectively aggregated.
  • the above-mentioned method further includes: the first electronic device sends the first electronic device to the third electronic device.
  • Pickup election information of an electronic device wherein the pickup election information of the first electronic device is used to indicate the sound pickup situation of the first electronic device; the first electronic device is the third electronic device based on the acquired sound pickup of multiple electronic devices
  • the election information is an electronic device selected from a plurality of electronic devices for pickup.
  • the user after the user speaks a voice command, the user does not need to specifically operate an electronic device to pick up the voice command to be recognized (such as the voice command corresponding to the second voice data below), but instead
  • the answering device ie the third electronic device
  • automatically uses the pickup device ie the second electronic device
  • the above-mentioned method further includes: the first electronic device receives a sound pickup instruction (that is, the sound pickup instruction hereinafter) sent by the third electronic device, wherein the sound pickup instruction is used for Instruct the first electronic device to pick up sound and send the noise-reduced voice to be recognized to the third electronic device.
  • the first electronic device can know that it needs to send the to-be-recognized voice (such as the second to-be-recognized voice) obtained by the pickup to the third electronic device, but will not recognize the to-be-recognized voice, etc. Subsequent processing.
  • the above-mentioned voice pickup election information includes at least one of the following: echo cancellation AEC capability information, microphone module information, device status information, and voice information corresponding to the wake-up word obtained by voice pickup, The voice information corresponding to the voice command obtained by picking up the voice; wherein, the voice command is obtained by picking up the voice after the wake-up word is obtained; the device status information includes at least one of the following: network connection status information, headset connection status information, microphone occupation Status information, profile information. It can be understood that the different information in the sound-picking election information represents different factors that affect the sound-picking effect of the electronic equipment.
  • the embodiment of the present application can comprehensively consider the different factors of the sound-picking effect of the electronic equipment to elect the sound-picking equipment, such as election.
  • the electronic device with the best sound pickup effect is used for sound pickup, that is, as a pickup device in a multi-device.
  • an embodiment of the present application provides a multi-device-based voice processing method, the method includes: a second electronic device in the plurality of electronic devices plays audio; the second electronic device is used in the plurality of electronic devices for The first electronic device that picks up the sound sends audio information related to the audio, where the audio information can be used by the first electronic device to perform noise reduction processing on the to-be-identified audio that is picked up by the first electronic device.
  • the first electronic device for picking up the sound performs noise reduction processing on the first to-be-recognized voice obtained by the sound pickup according to the audio information, so as to realize The influence of the internal noise generated by the audio on the sound pickup is eliminated to improve the sound pickup effect of the first electronic device, that is, to improve the quality of the voice data obtained by the sound pickup (ie, the voice data of the second to-be-recognized voice).
  • the influence of the internal noise of the electronic device that is playing audio externally on the sound pickup effect of the voice assistant in the multi-device scenario can be alleviated, and the sound pickup effect of the voice assistant based on multiple devices can be ensured, thereby helping to ensure the voice recognition accuracy of the voice assistant. , and improve the environmental robustness of speech recognition in multi-device scenarios.
  • the audio information includes at least one of the following: audio data of the audio, and voice activation detection VAD information corresponding to the audio.
  • the above-mentioned method further includes: the second electronic device receives a shared instruction (ie, the noise reduction instruction hereinafter) from a third electronic device used for recognizing voice among the plurality of electronic devices; Or the second electronic device receives a sharing instruction from the first electronic device; wherein, the sharing instruction is used to instruct the second electronic device to send the audio information to the first electronic device.
  • the electronic device (such as the first electronic device or the third electronic device) that sends the sharing instruction can monitor whether the second electronic device is playing audio, and only transmit the audio to the second electronic device when the second electronic device is playing the audio. Send a share command.
  • the method further includes: The second electronic device sends the voice-picking election information of the second electronic device to the third electronic device, wherein the voice-picking election information of the second electronic device is used to indicate the voice-picking situation of the second electronic device; the first electronic device is the third electronic device The device elects an electronic device for sound pickup from the plurality of electronic devices based on the acquired sound pickup election information of the plurality of electronic devices.
  • the third electronic device as the answering device hereinafter may elect the electronic device with the best audio quality for picking up the voice command (that is, the electronic device with the best sound pickup) as the sound pickup device (such as the first electronic device) to support
  • the answering device completes the voice interaction process with the user through the voice assistant.
  • the pickup device may be an electronic device that is closest to the user and has better SE processing capability.
  • an embodiment of the present application provides a multi-device-based voice processing method, the method comprising: a third electronic device in the plurality of electronic devices detects that there is a second electronic device in the plurality of electronic devices that is playing audio device; in the case that the second electronic device is different from the third electronic device, the third electronic device sends a sharing instruction to the second electronic device, wherein the sharing instruction is used to instruct the second electronic device to send a voice pickup device to the plurality of devices.
  • the first electronic device sends audio information related to the audio played by the second electronic device; if the second electronic device is the same as the third electronic device, the third electronic device sends the audio information to the first electronic device; wherein, The audio information can be used by the first electronic device to perform noise reduction processing on the first to-be-recognized voice picked up by the first electronic device to obtain the second to-be-recognized voice. Specifically, because the second electronic device that is playing audio externally under the instruction of the third electronic device can provide the audio information of the audio, the first electronic device used for picking up the sound can obtain the first sound picked up by the audio information according to the audio information.
  • Noise reduction processing is performed on the voice to be recognized, so as to eliminate the influence of the internal noise generated by the audio on the pickup, so as to improve the pickup effect of the first electronic device, that is, to improve the voice data obtained by pickup (ie, the voice of the second to-be-recognized voice). data) quality. Therefore, the influence of the internal noise of the electronic device that is playing audio externally on the sound pickup effect of the voice assistant in the multi-device scenario can be alleviated, and the sound pickup effect of the voice assistant based on multiple devices can be ensured, thereby helping to ensure the voice recognition accuracy of the voice assistant. , and improve the environmental robustness of speech recognition in multi-device scenarios.
  • the audio information includes at least one of the following: audio data of the audio, and voice activation detection VAD information corresponding to the audio.
  • the first electronic device is different from the third electronic device, and the above method further includes: the third electronic device obtains, from the first electronic device, the first electronic device obtained by picking up the sound of the first electronic device. Second, the voice to be recognized; the first electronic device recognizes the second voice to be recognized. Further, it is beneficial to improve the accuracy of speech recognition during the speech control process, and to improve user experience. In this way, even if the selected answering device in the multi-device scenario (such as the third electronic device closest to the user) has poor sound pickup effect, or there is noise generated by the electronic device that is playing audio, multiple devices can cooperate to pick up and Recognize voice data with better audio quality without requiring the user to move location or manually control the pickup of specific electronic devices.
  • the selected answering device in the multi-device scenario such as the third electronic device closest to the user
  • multiple devices can cooperate to pick up and Recognize voice data with better audio quality without requiring the user to move location or manually control the pickup of specific electronic devices.
  • the above method further includes: the third electronic device acquires voice selection information of multiple electronic devices, wherein the The sound pickup election information of the plurality of electronic devices is used to indicate the sound pickup situation of the plurality of electronic devices; the third electronic device elects at least one electronic device from the plurality of electronic devices based on the sound pickup election information of the plurality of devices as the first electronic device.
  • the peripheral resources of multiple electronic devices equipped with microphone modules and voice assistants can be effectively aggregated, so as to alleviate the recognition accuracy of voice assistants due to various factors such as the deployment location of electronic devices, internal noise interference, and external noise interference in multi-device scenarios. The impact of this improves the user interaction experience and the environmental robustness of speech recognition in multi-device scenarios.
  • the above method further includes: the third electronic device sends a sound pickup instruction to the first electronic device, wherein the sound pickup instruction is used to instruct the first electronic device to pick up sound and send it to the first electronic device.
  • the third electronic device sends the second voice to be recognized obtained by picking up the voice. It can be understood that, under the instruction of the above voice pickup instruction, the first electronic device can know the to-be-recognized voice that needs to be sent to the third electronic device by picking up the voice, without performing subsequent processing such as recognizing the to-be-recognized voice.
  • the above-mentioned voice pickup election information includes at least one of the following: echo cancellation AEC capability information, microphone module information, device status information, and voice information corresponding to the wake-up word obtained by voice pickup, The voice information corresponding to the voice command obtained by picking up the voice; wherein, the voice command is obtained by picking up the voice after the wake-up word is obtained;
  • the device status information includes at least one of the following: network connection status information, headset connection status information, microphone occupation Status information, profile information.
  • the above-mentioned third electronic device elects at least one electronic device from the plurality of electronic devices as the first electronic device based on the voice pickup election information of the plurality of electronic devices, including the following: At least one of: when the third electronic device is in the preset network state, the third electronic device determines the third electronic device as the first electronic device; when the third electronic device is connected to the headset, the third electronic device determines the third electronic device as the first electronic device; The third electronic device determines the third electronic device as the first electronic device; the third electronic device determines at least one of the electronic devices in the preset scene mode among the plurality of electronic devices as the first electronic device.
  • the electronic device is in a device state that is not conducive to the sound pickup of the electronic device, such as the electronic device has a poor network connection status, a wired or wireless headset is connected, the microphone is already occupied, or is in flight mode, it means the sound pickup effect of the electronic device. It is difficult to guarantee, or the electronic device cannot normally cooperate with other devices to pick up sounds, for example, it cannot normally send the voice data obtained by picking up sounds to other electronic devices. In this way, a sound pickup device (such as the above-mentioned first electronic device) with better sound pickup effect can be selected according to the above-mentioned selection steps of the sound pickup device.
  • the above-mentioned third electronic device elects at least one electronic device from the plurality of electronic devices as the first electronic device based on the voice pickup election information of the plurality of electronic devices, including the following: At least one of: the third electronic device uses at least one of the electronic devices whose AEC is in effect among the plurality of electronic devices as the first electronic device; At least one of the electronic devices is used as the first electronic device; the third electronic device uses at least one of the electronic devices whose distance from the multiple electronic devices to the user is less than the first predetermined distance as the first electronic device; the third electronic device At least one of the plurality of electronic devices whose distance from the external noise source is greater than the second predetermined distance is used as the first electronic device.
  • the predetermined noise reduction condition indicates that the electronic device SE has a better processing effect, such as AEC is effective or has internal noise reduction capability;
  • the first predetermined distance eg 0.5m
  • the second predetermined distance eg 3m
  • the sound pickup effect of electronic equipment closer to the user is better, and the sound pickup effect of electronic equipment farther away from external noise is better;
  • the noise reduction performance of the microphone module is better or the electronic equipment with AEC effective
  • a sound pickup device ie, the above-mentioned first electronic device with better sound pickup effect can be selected from multiple devices.
  • the preset network state includes at least one of the following: a network with a network communication rate less than or equal to a predetermined rate, and a network wire frequency greater than or equal to a predetermined frequency;
  • the preset scene mode includes at least one of the following One: Subway Mode, Airplane Mode, Driving Mode, Travel Mode.
  • the network communication rate is less than or equal to the predetermined rate, and the frequency of the network wire is greater than or equal to the predetermined frequency, it means that the network communication rate of the electronic device is poor, and the specific values of the predetermined rate and the predetermined frequency can be determined according to actual needs.
  • the electronic device in the preset network state is generally not suitable for participating in the election of a sound pickup device or as a sound pickup device (eg, the first electronic device for sound pickup).
  • the third electronic device uses a neural network algorithm or a decision tree algorithm to select the first electronic device from multiple electronic devices. It can be understood that the pickup election information of multiple devices can be used as the input of the neural network algorithm or the decision tree algorithm, and the result of deciding that the first electronic device is the pickup device is output based on the neural network algorithm or the decision tree algorithm.
  • the application provides a multi-device-based voice processing method, the method comprising: a third electronic device in the plurality of electronic devices obtains voice-picking election information of the plurality of electronic devices, wherein the voice-picking election information is used for Indicates the sound pickup situation of the plurality of electronic devices; the third electronic device elects at least one electronic device from the plurality of electronic devices as the first electronic device for sound pickup based on the sound pickup election information of the plurality of devices, wherein the first electronic device is used for sound pickup.
  • the electronic device is the same as or different from the third electronic device; the third electronic device acquires the voice to be recognized obtained by the first electronic device from the first electronic device; and the third electronic device recognizes the acquired voice to be recognized.
  • the selected third electronic device such as the electronic device closest to the user
  • multiple devices can collaboratively pick up and recognize voice data with better audio quality without the need for the user to move. position or manually control the pickup of specific electronics.
  • it is beneficial to improve the accuracy of speech recognition during the speech control process, and to improve user experience.
  • it can alleviate the influence of various factors such as the deployment location of electronic devices, external noise interference and other factors on the voice pickup effect of the voice assistant and the accuracy of speech recognition in multi-device scenarios, improving user interaction experience and the environment for speech recognition in multi-device scenarios. robustness.
  • the above-mentioned voice-collecting election information includes at least one of the following: echo cancellation AEC capability information, microphone module information, device status information, and voice information corresponding to the wake-up word obtained by voice-collecting, The voice information corresponding to the voice command obtained by picking up the voice; wherein, the voice command is obtained by picking up the voice after the wake-up word is obtained;
  • the device status information includes at least one of the following: network connection status information, headset connection status information, microphone occupation Status information, profile information.
  • the third electronic device elects at least one electronic device from the plurality of electronic devices as the first electronic device based on the voice pickup election information of the plurality of electronic devices, including the following: At least one of: when the third electronic device is in the preset network state, the third electronic device determines the third electronic device as the first electronic device; when the third electronic device is connected to the headset, the third electronic device determines the third electronic device as the first electronic device; The third electronic device determines the third electronic device as the first electronic device; the third electronic device determines at least one of the electronic devices in the preset scene mode among the plurality of electronic devices as the first electronic device.
  • the third electronic device elects at least one electronic device from the plurality of electronic devices as the first electronic device based on the voice pickup election information of the plurality of electronic devices, including the following: At least one of the following: the third electronic device uses at least one of the electronic devices whose AEC is in effect among the above-mentioned multiple electronic devices as the first electronic device; At least one of the electronic devices of the condition is used as the first electronic device; the third electronic device uses at least one of the electronic devices whose distance from the plurality of electronic devices to the user is less than the first predetermined distance as the first electronic device; the third electronic device The electronic device uses at least one of the plurality of electronic devices whose distance from the external noise source is greater than the second predetermined distance as the first electronic device.
  • the preset network state includes at least one of the following: a network with a network communication rate less than or equal to a predetermined rate, and a network wire frequency greater than or equal to a predetermined frequency;
  • the preset scene mode includes the following At least one of: Subway Mode, Airplane Mode, Driving Mode, Travel Mode.
  • the third electronic device uses a neural network algorithm or a decision tree algorithm to select the first electronic device from multiple electronic devices.
  • the above method further includes: the third electronic device detects that there is a second electronic device that is playing audio externally among the plurality of electronic devices; the third electronic device sends a message to the second electronic device.
  • a sharing instruction wherein the sharing instruction is used to instruct the second electronic device to send to the first electronic device audio information related to the audio played by the second electronic device, wherein the audio information can be used by the first electronic device to pick up the first electronic device.
  • the to-be-identified audio obtained from the sound is subjected to noise reduction processing.
  • the third electronic device is different from the first electronic device, and the method further includes: the third electronic device plays external audio; the third electronic device sends the third electronic device to the first electronic device The device broadcasts audio-related audio information, wherein the audio information can be used by the first electronic device to perform noise reduction processing on to-be-identified audio obtained by the first electronic device.
  • the audio information includes at least one of the following: audio data of the external audio, and voice activation detection VAD information corresponding to the audio.
  • the present application provides an apparatus, the apparatus is included in an electronic device, and the apparatus has the function of implementing the behavior of the electronic device in the above-mentioned aspects and possible implementations of the above-mentioned aspects.
  • the functions can be implemented by hardware, or by executing corresponding software by hardware.
  • the hardware or software includes one or more modules or units corresponding to the above functions. For example, a pickup unit or module (such as a microphone or a microphone array), a receiving unit or module (such as a transceiver), a noise reduction module or unit (such as a processor with the function of the module or unit), and the like.
  • the sound pickup unit or module is used to support the first electronic device in the multiple electronic devices to pick up the voice to obtain the first voice to be recognized;
  • the receiving unit or module (such as a transceiver) is used to support the first electronic device from multiple electronic devices.
  • the second electronic device that broadcasts audio in the electronic device receives audio information related to the audio broadcast by the second electronic device;
  • the noise reduction module or unit is used to support the first electronic device to pick up the audio according to the audio information received by the receiving unit or module.
  • the first to-be-recognized speech obtained from the sound is subjected to noise reduction processing to obtain the second to-be-recognized speech.
  • the present application provides a readable medium on which an instruction is stored, and when the instruction is executed on an electronic device, causes the electronic device to perform the multi-device-based multi-device in the above-mentioned first to fourth aspects. speech processing methods.
  • the present application provides an electronic device, comprising: one or more processors; one or more memories; the one or more memories stores one or more programs, when the one or more programs are When executed by the one or more processors, the electronic device is caused to execute the multi-device-based voice processing method in the above-mentioned first to fourth aspects.
  • the electronic device may further include a transceiver (which may be a separate or integrated receiver and transmitter) for receiving and transmitting signals or data.
  • the present application provides an electronic device, comprising: a processor, a memory, a communication interface, and a communication bus; the memory is used to store at least one instruction, and the at least one processor, the memory, and the communication interface communicate through the at least one processor, the memory, and the communication interface.
  • the bus is connected, when the at least one processor executes the at least one instruction stored in the memory, so that the electronic device executes the multi-device-based voice processing method in the above-mentioned first to fourth aspects.
  • FIG. 1 is a schematic diagram of a scenario of multi-device-based voice processing provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a voice assistant interaction session provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of another multi-device-based voice processing scenario provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a method for multi-device-based voice processing provided by an embodiment of the present application
  • FIG. 5 is a schematic flowchart of another method for multi-device-based voice processing provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of another multi-device-based voice processing scenario provided by an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of another method for multi-device-based voice processing provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of another multi-device-based voice processing scenario provided by an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of another method for multi-device-based voice processing provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of another multi-device-based voice processing scenario provided by an embodiment of the present application.
  • FIG. 11 is a schematic flowchart of another method for voice processing based on multiple devices provided by an embodiment of the present application.
  • FIG. 12 shows a schematic structural diagram of an electronic device according to some embodiments of the present application.
  • Illustrative embodiments of the present application include, but are not limited to, multi-device based speech processing methods, media, and electronic devices.
  • the multi-device scenario of the multi-device-based speech processing application provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings.
  • FIG. 1 shows a multi-device scenario of a multi-device-based voice processing application provided by an embodiment of the present application.
  • the multi-device scenario 10 only shows three electronic devices, such as electronic device 101 , electronic device 102 , and electronic device 103 , but it is understood that the technical solutions of the present application are applicable to many
  • a device scene can include any number of electronic devices, not limited to 3.
  • an answering device may be elected from a plurality of electronic devices, for example, the electronic device 101 may be elected as the answering device.
  • the answering device selects the sound pickup device with the best sound pickup effect (eg, the electronic device with the best voice enhancement effect) from the multiple devices.
  • the electronic device 101 elects the electronic device 103 as the sound pickup device.
  • the voice pickup device such as the electronic device 103
  • the answering device can receive, recognize and respond to the voice data, so that the quality of the voice data processed by the answering device is improved. better.
  • noise reduction processing can be performed on the voice data picked up by the sound pickup device according to the noise reduction information of the internal noise device, so as to further improve the processing by the answering device. the quality of the voice data. Therefore, even if the selected answering device in the multi-device scenario (such as the electronic device closest to the user) has poor sound pickup effect, or there is noise generated by the electronic device that is playing audio, multiple devices can cooperate to pick up and identify the audio Better quality voice data without requiring the user to move locations or manually control specific electronic device pickups. Further, it is beneficial to improve the accuracy of speech recognition during the speech control process, and to improve user experience.
  • the electronic devices 101-103 in the multi-device scenario 10 are interconnected via a wireless network, eg, Wi-Fi (eg, Wireless Fidelity (Wireless Fidelity), Bluetooth (BT), Near Field Communication (Near Field Communication, NFC) and other wireless networks, but not limited to this.
  • Wi-Fi eg, Wireless Fidelity (Wireless Fidelity), Bluetooth (BT), Near Field Communication (Near Field Communication, NFC) and other wireless networks, but not limited to this.
  • Wi-Fi eg, Wireless Fidelity (Wireless Fidelity), Bluetooth (BT), Near Field Communication (Near Field Communication, NFC)
  • BT Wireless Fidelity
  • NFC Near Field Communication
  • the same group of devices has identification information of each device, so that the group of devices can communicate with each other according to their respective identification information.
  • the types of wireless networks between different electronic devices in a multi-device scenario may be the same or different.
  • the electronic device 101 and the electronic device 102 are connected through a Wi-Fi network, and the electronic devices 101 and 103 are connected through Bluetooth.
  • the types of electronic devices in the multi-device scenario may be the same or different.
  • electronic devices suitable for use in the present application may include, but are not limited to, cell phones, tablet computers, desktops, laptops, handheld computers, notebook computers, desktop computers, ultra-mobile personal computers (UMPCs), netbooks , as well as cellular phones, personal digital assistants (PDAs), augmented reality (AR) ⁇ virtual reality (VR) devices, media players, smart TVs, smart speakers, smart watches, smart headphones, etc.
  • PDAs personal digital assistants
  • AR augmented reality
  • VR virtual reality
  • the types of electronic devices 101-103 shown in FIG. 1 are all different, and are illustrated by taking a mobile phone, a tablet computer, and a smart TV as examples.
  • the embodiments of the present application do not specifically limit the specific form of the electronic device.
  • FIG. 12 For the specific structure of the electronic device, reference may be made to the description corresponding to FIG. 12 below, which will not be repeated here.
  • the electronic devices in the multi-device scenario all have a voice control function, for example, voice assistants with the same wake-up words are installed, for example, the wake-up words are all "Xiaoyi Xiaoyi".
  • the electronic devices in the multi-device scenario are all within the effective working range of the voice assistant.
  • the distance from the user that is, the pickup distance
  • the preset distance such as 5m
  • the screen is in use (such as The screen is placed face up, or the screen cover is not closed)
  • the Bluetooth is not turned off, the Bluetooth communication range is not exceeded, etc., but not limited to this.
  • a voice assistant is an application program (APP) based on artificial intelligence. With the help of speech semantic recognition algorithm, it helps users complete information query, device control, text input and other operations through instant question-and-answer voice interaction with users.
  • the voice assistant can be a system application in an electronic device or a third-party application.
  • voice assistants usually use staged cascade processing, followed by voice wake-up, voice enhancement processing (Speech Enhancement, SE) (or, voice front-end processing), automatic speech recognition (Automatic Speech Recognition, ASR), Processes such as Natural Language Understanding (NLU), Dialog Management (DM), Natural Language Generation (NLG), Text To Speech (TTS), and response output realize the above functions.
  • speech Enhancement SE
  • ASR Automatic Speech Recognition
  • NLU Natural Language Understanding
  • DM Dialog Management
  • NLG Natural Language Generation
  • TTS Text To Speech
  • the voice data picked up by the electronic device in the present application is the voice data directly collected by the microphone, or the voice data processed by SE after the collection, and is used for input to the ASR for processing.
  • the text processing result of the voice data output by the ASR is the basis for the voice assistant to accurately complete subsequent recognition and respond to voice data and other operations. Therefore, the quality of the voice data obtained by the voice assistant and input to the ASR will affect the accuracy of the voice assistant's recognition and response to the voice data.
  • the embodiment of the present application comprehensively considers a variety of factors in a multi-device scenario to perform a multi-device-based The flow of speech processing.
  • the factors that affect the pickup effect of electronic equipment include environmental factors 1)-3) and equipment factors 4)-6), as follows:
  • the microphone module of the electronic device such as whether the microphone module is a single microphone or a microphone array, whether it is a near-field microphone array or a far-field microphone array, and the cutoff frequency of the microphone module.
  • the microphone array has a better sound pickup effect than a single microphone.
  • the far-field microphone array has a better sound pickup effect than the near-field microphone array, and the higher the cut-off frequency of the microphone module, the sound pickup The better the effect.
  • the SE capability of the electronic device such as the noise reduction performance of the microphone module of the electronic device, and the AEC capability of the electronic device, such as whether the AEC of the electronic device is valid.
  • the noise reduction performance of the microphone module is better or the electronic device with AEC effective, indicating that the SE processing effect of the electronic device is better, that is, the sound pickup effect of the electronic device is better.
  • the noise reduction performance of the microphone array is better than that of a single microphone.
  • the device status of the electronic device such as one or more factors such as device network connection status, headset connection status, microphone occupancy status, and profile information. For example, if the electronic device is in a device state that is not conducive to the sound pickup of the electronic device, such as the electronic device's network connection status is poor, wired or wireless headphones are connected, the microphone is already occupied, or it is in airplane mode, it means that the sound pickup effect of the electronic device is difficult. Guaranteed, or the electronic device cannot normally cooperate with other devices to pick up sounds, for example, it cannot normally send the voice data obtained by picking up sounds to other electronic devices.
  • FIG. 3 to FIG. 11 propose various embodiments of co-processing speech among multiple electronic devices according to the above-mentioned different influencing factors.
  • FIG. 3 shows a scenario of co-processing voice among multiple electronic devices in different deployment locations.
  • the mobile phone 101a, the tablet computer 102a and the smart TV 103a are interconnected through a wireless network and are respectively deployed at different distances from the user. For example, they are deployed at positions 0.3 meters (m), 1.5 m, and 3.0 m away from the user, respectively.
  • the mobile phone 101a is held by the user, the tablet computer 102a is placed on the desktop, and the smart TV 103a is wall mounted on the wall.
  • the ambient noise is less than or equal to 20 decibels (dB), and there is no internal noise generated by electronic devices that play external audio in this scenario. Therefore, it is not necessary to consider the influence of external noise and internal noise on the sound pickup effect of electronic devices, but mainly consider the deployment location of electronic devices, such as the influence of which electronic device is closest to the user on the multi-device-based voice processing.
  • FIG. 4 is a flowchart of a specific method for collaboratively processing speech in the scenario shown in FIG. 3 .
  • the process of the method for cooperatively processing speech by the mobile phone 101a, the tablet computer 102a and the smart TV 103a includes:
  • Step 401 The mobile phone 101a, the tablet computer 102a and the smart TV 103a respectively pick up the first voice data corresponding to the wake-up word spoken by the user.
  • the pre-registered wake-up words in the mobile phone 101a, the tablet computer 102a and the smart TV 103a are all "Xiaoyi Xiaoyi”.
  • the mobile phone 101a, the tablet computer 102a and the smart TV 103a can all detect the voice corresponding to "Xiaoyi Xiaoyi", and then determine whether to wake up the corresponding voice assistant.
  • the electronic device can monitor the corresponding voice data through the microphone and cache it.
  • electronic devices such as mobile phone 101a, tablet computer 102a, and smart TV 103a, in the absence of other software and hardware using microphones to pick up voice data, can use the microphone to monitor in real time whether the user has voice data input, and cache the picked-up voice data , as the above-mentioned first voice data.
  • Step 402 The mobile phone 101a, the tablet computer 102a and the smart TV 103a respectively check the picked up first voice data to determine whether the corresponding first voice data is a pre-registered wake-up word.
  • the mobile phone 101a, the tablet computer 102a and the smart TV 103a all successfully verify the first voice data, it indicates that the picked up first voice data is a wake-up word, and the following step 403 can be executed. If the mobile phone 101a, the tablet computer 102a and the smart TV 103a all fail to verify the first voice data, it indicates that the picked up first voice data is not a wake-up word, and the following step 409 is performed.
  • electronic devices that successfully verify the first voice data corresponding to the wake-up word can be recorded in a list.
  • Cell phone 101a, tablet 102a, and smart TV 103a are recorded by a list (eg, a candidate answering device list).
  • the devices in the above candidate answering device list will be used to participate in the following multi-device answering election, so as to elect an electronic device that wakes up the voice assistant and recognizes the user's voice (ie, the answering device hereinafter).
  • the multi-device response election is performed among multiple devices that successfully detect the wake-up word, that is, among the electronic devices that successfully verify the first voice data.
  • Step 403 The mobile phone 101a, the tablet computer 102a and the smart TV 103a elect the smart TV 103a as the answering device.
  • the answering device is generally an electronic device that the user is used to or tends to use, or an electronic device that has a high probability of success in recognizing and responding to the user's voice data.
  • the answering device is used to recognize and respond to the user's voice data, such as performing processing steps such as ASR and NLU on the voice data.
  • there is usually only one answering device such as an electronic device in the list of candidate answering devices.
  • the electronic device such as the smart TV 103a
  • the answering device wakes up the voice assistant, it can play a wake-up answering tone, such as "I'm here".
  • electronic devices other than the answering device such as the mobile phone 101a and the tablet computer 102a, do not respond according to the candidate voice pickup instructions, that is, do not output a wake-up response tone.
  • the answering device (such as the smart TV 103a) may perform a cooperative voice pickup election to elect a voice pickup device, and the following step 404 is specifically performed.
  • Step 404 The smart TV 103a obtains the phone 101a, the tablet 102a and the smart TV 103a corresponding to the phone 101a, respectively, and elects the phone 101a as the phone 101a according to the phone 103a.
  • the sound pickup election information may be a parameter used to determine whether the sound pickup effect of each electronic device is good or bad.
  • the pick-up election information may include detected voice information of the user's voice (such as the voice information of the first voice data), microphone module information of each electronic device, and device status information of each electronic device and at least one item of AEC capability information of each electronic device.
  • the information used for the election of the sound pickup device may also include other information, as long as the information that can evaluate the sound pickup function of the electronic device is applicable, and no limitation is imposed herein.
  • the sound information may include a signal-to-noise ratio (Signal to Noise Ratio, SNR), sound intensity (or energy value), reverberation parameters (such as reverberation delay), and the like.
  • SNR Signal to Noise Ratio
  • sound intensity or energy value
  • reverberation parameters such as reverberation delay
  • the voice information of the user's voice can be used to elect a sound pickup device.
  • the microphone module information is used to indicate whether the microphone module of the electronic device is a single microphone or a microphone array, whether it is a near-field microphone array or a far-field microphone array, and what the cutoff frequency of the microphone module is.
  • the noise reduction capability of the far-field microphone is higher than that of the near-field microphone, so the sound pickup effect of the far-field microphone is better than that of the near-field microphone.
  • the noise reduction capabilities of the single microphone, the line array microphone and the ring array microphone are successively improved, and the sound pickup effect of the corresponding electronic equipment is successively improved.
  • the higher the cut-off frequency of the microphone module the better the noise reduction ability, and the better the sound pickup effect of the corresponding electronic device. Therefore, the microphone module information can also be used to elect a pickup device.
  • the device status information refers to a device status that can affect the sound pickup effect of multiple electronic devices for cooperative sound pickup, such as network connection status, headphone connection status, microphone occupancy status, and scene mode information.
  • the scene modes include: driving mode, riding mode (such as bus mode, high-speed rail mode or airplane mode, etc.), walking mode, sports mode, home mode and other modes. These scene modes can be automatically determined by the electronic device reading and analyzing sensor information, short messages or emails, setting information or historical operation records and other information of the electronic device.
  • the sensor information is a Global Positioning System (Global Positioning System, GPS), an inertial sensor, a camera or a microphone, and the like.
  • the headset connection state is in the occupied state, it means that the electronic device is being used by the user, and the headset microphone that is closer to the user is supported; if the microphone occupancy state indicates that the microphone module is in the occupied state, it means that the electronic device may not be able to pass The microphone module picks up sound; if the network connection status indicates that the wireless network of the electronic device is poor, the success rate of the electronic device transmitting information through the wireless network, such as the success rate of sending voice pickup election information to the answering device, is affected.
  • scenario mode is the above scenario mode such as driving mode and car ride mode, it means that the stability and/or connection rate of the wireless network connection of the electronic device may be low, which in turn affects the electronic device’s ability to participate in the pickup election process or the collaborative pickup process. Success rate. Therefore, the above-mentioned device status information can also be used to elect a sound pickup device.
  • the AEC capability information is used to indicate whether the electronic device has the AEC capability and whether the AEC of the electronic device is valid.
  • the AEC capability is specifically the AEC capability of the microphone module in the electronic device. It can be understood that, compared with electronic devices that do not have AEC in effect or do not have AEC capabilities, electronic devices with AEC in effect have better SE processing capabilities, better noise reduction performance, and better sound pickup effects. Therefore, the above AEC capability information can also be used to elect a pickup device.
  • the electronic equipment for which AEC takes effect is usually the electronic equipment that is playing audio.
  • AEC is a speech enhancement technology. It eliminates the noise generated by the return path of the microphone and the speaker due to the air generated by the sound wave interference. Improve the quality of voice data obtained by electronic equipment pickup.
  • SE is used to preprocess the user's voice data collected by the microphone of the electronic device by means of hardware or software, using audio signal processing algorithms such as reverberation cancellation, AEC, blind source separation, and beamforming, so as to improve the obtained voice data. the quality of.
  • the smart TV 103a can elect a sound pickup device based on the sound pickup election information of each electronic device, and the specific election scheme will be described in detail below. For the convenience of description, it is assumed that the smart TV 103a elects the mobile phone 101a as the sound pickup device.
  • the remote peripheral virtualization technology can be used, and the pickup device or the microphone of the pickup device is used as the virtual peripheral node of the answering device, which is called by the voice assistant running on the answering device to complete the subsequent crossover. Equipment pickup process.
  • the answering device may send a voice pickup instruction to the electronic device to instruct the electronic device to pick up the user's voice data.
  • the answering device may send a voice-picking stop instruction to other electronic devices other than the voice-picking device in the multi-device scenario, so as to instruct these electronic devices to no longer pick up the user's voice data.
  • other electronic devices other than the pickup device in the multi-device scenario do not receive any indication within a period of time (such as 5 seconds) after sending the pickup election information to the answering device, these electronic devices determine that they are not pickups. audio equipment.
  • Step 405 The mobile phone 101a picks up the second voice data corresponding to the voice command spoken by the user.
  • the mobile phone 101 is used as a sound pickup device to pick up various voice commands spoken by the user. For example, when the user speaks the voice command "What's the weather like in Beijing tomorrow?", the mobile phone 101a directly collects the voice command through the microphone module to obtain the second voice data, or the microphone module in the mobile phone 101a collects the voice command and processes it through SE. Obtain second voice data.
  • the "voice instruction” that appears alone in this embodiment of the present application may be a voice instruction corresponding to an event or operation received by the electronic device after waking up the voice assistant.
  • the user's voice instruction is the above-mentioned "how is the weather tomorrow?" or "play music”.
  • the names such as "voice”, "voice instruction” and “voice data” in this document can sometimes be used interchangeably. It should be noted that the meanings to be expressed are the same when the differences are not emphasized.
  • Step 406 The mobile phone 101a sends the second voice data to the smart TV 103a.
  • the mobile phone 101a as a voice pickup device, directly forwards the voice data of the voice command to the answering device after picking up the voice command issued by the user, and does not recognize or respond to the voice command issued by the user.
  • the answering device and the sound pickup device are the same device, this step is not required, and the answering device or the pickup device directly performs speech recognition on the voice data after picking up the voice command from the user. .
  • Step 407 The smart TV 103a recognizes the second voice data.
  • the smart TV 103a After the smart TV 103a receives the voice data obtained by the phone 101a as the answering device, it can recognize the second voice data after noise reduction processing through the ASR, NLU, DM, NLG, and TTS hierarchical processing procedures.
  • ASR can convert the second voice data processed by SE into corresponding text (or text), and normalize and correct the spoken text. Error, written and other textual processing, such as getting the text "How will the weather in Beijing be tomorrow?".
  • Step 408 The smart TV 103a responds to the user's voice command according to the recognition result or controls other electronic devices to respond to the user's voice command.
  • the response device for the recognized voice command of the user, if the response device can execute or can only be executed by the response device, the response device makes a response corresponding to the voice command. For example, for the above-mentioned voice command "How will the weather in Beijing be tomorrow?", the smart TV 103a answers "Tomorrow is sunny in Beijing", and for the voice command "Please turn off the TV", the smart TV 103a performs a shutdown function.
  • the above-mentioned voice "Tomorrow will be sunny in Beijing” is the answer voice output by the answering device through the TTS.
  • the answering device can also control the system software, display screen, vibration motor and other software and hardware to perform answering operations, such as displaying the answer text generated by NLG through the display screen.
  • the answering device can send the voice commands to the corresponding electronic devices after recognizing the voice commands. For example, for the voice command "open the curtains", after the smart TV 103a recognizes that the response operation is to open the curtains, it can send the operation instruction to open the curtains to the smart curtains, so that the smart curtains can complete the action of opening the curtains through hardware.
  • the above-mentioned other electronic devices may be Internet of Things (The Internet of Things, IOT) devices, such as smart home devices such as smart refrigerators, smart water heaters, and smart curtains.
  • IOT Internet of Things
  • the above-mentioned other electronic devices do not have a voice control function. If a voice assistant is not installed, the other electronic devices perform operations corresponding to the user's voice commands when triggered by the answering device.
  • the user can continue to speak the subsequent voice command data stream, such as the voice command "What clothes should you wear tomorrow?".
  • the subsequent voice command data stream such as the voice command "What clothes should you wear tomorrow?".
  • Step 409 The mobile phone 101a, the tablet computer 102a and the smart TV 103a do not respond to the first voice data, and delete the cached first voice data.
  • the wake-up response voice "I am here" will not be output to the user.
  • the user continues to speak a voice command, such as "How is the weather in Beijing tomorrow?", these devices will not respond to the voice data corresponding to the voice command.
  • the user after the user speaks a voice command, the user does not need to specifically operate an electronic device to pick up the voice command (for example, the voice command corresponding to the second voice data), but is instead
  • the answering device automatically uses the pickup device as a peripheral to pick up the user's voice command, and then realizes the voice control function through the response of the answering device to the user's voice command.
  • the multi-device-based voice processing method provided by the embodiment of the present application can select the electronic device with the best audio quality for picking up the voice command as the voice-picking device through the interaction and cooperation of multiple electronic devices, so as to support the answering device through the voice assistant Complete the voice interaction process with the user.
  • the pickup device can be an electronic device that is closest to the user and has better SE processing capability.
  • the peripheral resources of multiple electronic devices equipped with microphone modules and voice assistants can be effectively aggregated, the impact of the deployment location of electronic devices on the recognition accuracy of voice assistants in multi-device scenarios is alleviated, and the user interaction experience and multi-device scenarios are improved.
  • the electronic device in the multi-device scenario may perform multi-device response election according to at least one of the following response election strategies, and elect a response device:
  • Answering strategy 1 Elect the electronic device closest to the user as the answering device.
  • the mobile phone 101a can be elected as the answering device.
  • the distance between the electronic device and the user can be represented by the sound information of the voice data corresponding to the wake-up word picked up by the electronic device. For example, the higher the signal-to-noise ratio of the first voice data, the higher the sound intensity, and the lower the reverberation delay, it means that the electronic device is closer to the user.
  • Answering strategy 2 Elect the electronic device actively used by the user as the answering device.
  • the electronic device is actively used by the user, for example, the user has recently lifted the screen, it means that the user may be using the electronic device, and the user is more inclined to use it to recognize and respond to the user's voice data.
  • whether the electronic device is actively used by the user can be characterized by the device usage record information.
  • the device usage record information includes at least one of the following: screen-on time, screen-on-screen frequency, frequency of using a voice assistant, and the like. It can be understood that the longer the screen is on, the higher the screen is on, and the higher the frequency of using the voice assistant, the higher the degree of active use of the electronic device by the user. For example, according to the device usage record information of the mobile phone 101a, the tablet computer 102a and the smart TV 103a, the smart TV 103a actively used by the user can be elected as the answering device.
  • Answering strategy 3 Election of an electronic device equipped with a far-field microphone array as the answering device.
  • whether the electronic device is equipped with a far-field microphone array is characterized by the microphone module information.
  • the mobile phone 101a, the tablet computer 102a and the smart TV 103a select the smart TV 103a equipped with the far-field microphone array as the answering device according to the microphone module information.
  • Answering strategy 4 Election of public equipment as answering equipment.
  • whether the electronic device is a public device can also be characterized by the public device indication information.
  • the public device indication information of the smart TV 103a indicates that the smart TV 103a is a public device, and the multi-device scenario 11 elects the smart TV 103a as the answering device.
  • the response strategy 4 reference may be made to the relevant description of the response strategy 3), which will not be repeated here.
  • any one of the electronic devices may be selected as the answering device.
  • any electronic device that successfully checks the first voice data may also be selected as the response device.
  • any electronic device in the multi-device scenario may act as a master device to perform the step of electing a response device.
  • the mobile phone 101a as the master device elects the smart TV 103a as the response device, and sends a response instruction to the smart TV 103a to instruct the smart TV 103a to subsequently recognize and respond to the voice data corresponding to the user's voice command.
  • the master device may send candidate voice pickup instructions to other electronic devices except the answering device in the multi-device scenario, so as to instruct these electronic devices not to recognize the user's voice command.
  • other electronic devices other than the answering device in the multi-device scenario do not receive any indication within a preset time (for example, 10 seconds) after successfully verifying the first voice data, these electronic devices determine that they are not answering equipment.
  • each electronic device in a multi-device scenario may perform the operation of an election answering device.
  • the mobile phone 101a, the tablet computer 102a, and the smart TV 103a all perform multi-device response election, and respectively elect the smart TV 103a as the response device.
  • the smart TV 103a can determine that it is an answering device, and then wake up the voice assistant to recognize and respond to the voice data corresponding to the user's voice command.
  • the mobile phone 101a and the tablet computer 102a respectively determine that they are not answering devices, and do not recognize and respond to the user's voice command.
  • the electronic device performing the multi-device response election obtains the response election information of each electronic device in the multi-device scenario, and elects the response device according to the response election information.
  • the response election information of an electronic device includes at least one of the following: sound information of the first voice data, device usage record information, microphone module information, and public device indication information, but is not limited thereto.
  • the answering device obtains the answering election information of each electronic device, the information may be cached.
  • the smart TV 103a may receive the corresponding voice-picking election information respectively sent by the mobile phone 101a and the tablet computer 102a, and read its own voice-picking election information.
  • the sending order of the voice-picking election information corresponding to the mobile phone 101a and the tablet computer 102a, and the sending order of different information in each voice-picking election information are not limited, and can be any achievable sending. order.
  • the answering device has calculated and cached some information of each electronic device in the above step 403, such as the sound information of the first voice data, then in step 404, the cached data can be read. information without recomputing it.
  • the embodiment of the present application can comprehensively consider different information in the sound pickup election information corresponding to the electronic device, that is, different factors affecting the sound pickup effect of the electronic device, and set a sound pickup election strategy to compare the sound pickup effect in a multi-device scenario.
  • the multi-device voice pickup election is performed between multiple devices that successfully detect the wake-up word, that is, between electronic devices that successfully verify the first voice data.
  • the devices in the above-mentioned candidate answering device list may be used to participate in a multi-device sound pickup election to elect a sound pickup device.
  • the candidate answering device list may be referred to as a candidate sound pickup device list.
  • the electronic devices in the candidate voice pickup device list can all be used as candidate voice pickup devices, for example, the above-mentioned mobile phone 101a, tablet computer 102a and smart TV 103a can all be used as candidate voice pickup devices
  • a device that is, an electronic device that conducts voice election based on the voice election information.
  • an electronic device with a better sound pickup effect in the above-mentioned candidate sound pickup device list may be used as a sound pickup device through an end-to-end method such as an artificial neural network, an expert system, etc., using a sound pickup election strategy. Specifically, taking the sound pickup election information corresponding to each electronic device in the candidate sound pickup device list as the input of the artificial neural network or the expert system, the output result of the artificial neural network or the expert system is the sound pickup device.
  • the output result of the artificial neural network or the expert system is the mobile phone 101a, that is, the mobile phone is elected.
  • 101a is a pickup device.
  • the above artificial neural network can be a deep neural network (Deep Neural Network, DNN), a convolutional neural network (Convolutional Neural Network, CNN), a long short term memory network (Long Short Term Memory, LSTM) or a recurrent neural network (Recurrent Neural Network, RNN), etc., which are not specifically limited in the embodiments of the present application.
  • DNN Deep Neural Network
  • CNN convolutional Neural Network
  • LSTM Long Short Term Memory
  • RNN recurrent Neural Network
  • the method of cascading processing in stages can be used to implement a sound pickup election strategy to use an electronic device with a better sound pickup effect in the candidate sound pickup device list as a sound pickup device.
  • feature extraction or numerical quantification may be performed on each parameter vector (ie, each voice-picking election information) in the sound-picking election information corresponding to each electronic device in the candidate sound-picking equipment list, and then a decision tree, logistic regression may be used. and other algorithm decisions to output the selection result of the pickup device.
  • each parameter vector in the pickup election information corresponding to the mobile phone 101a, the tablet computer 102a and the smart TV 103a can be subjected to feature extraction or numerical quantification, and then a decision tree, logistic regression, etc. can be used.
  • the algorithm decides to output the selection result of the sound pickup device as the mobile phone 101a, that is, elects the mobile phone 101a as the sound pickup device.
  • the answering device may perform a multi-device pickup election process through at least one of the first type of coordinated pickup strategy and the second type of coordinated pickup strategy.
  • the process may include a two-part process.
  • the first part of the process is that the answering device first removes some disadvantaged devices that are obviously not suitable for participating in the subsequent collaborative voice pickup from the list of candidate voice pickup devices through the first type of collaborative voice pickup strategy, or directly decides Select the answering device as the most suitable pickup device.
  • the second part of the process is that the answering device selects the electronic device with better sound pickup effect as the sound pickup device according to the sound pickup election information corresponding to each electronic device in the candidate sound pickup device list through the second type of cooperative sound pickup strategy. It can be understood that if the first part of the process does not decide a sound pickup device, then the second part of the process will be executed to elect a sound pickup device.
  • the above-mentioned first type of cooperative voice pickup strategy may include at least one of the following strategies a1) to a6).
  • a1) Determine the electronic device that is connected to the headset and is not an answering device as a non-candidate pickup device.
  • the state that the electronic device has been connected to the earphone is indicated by the earphone connection state information. Specifically, if an electronic device is connected to a wired or wireless headset and is not an answering device, since the electronic device only supports close-range pickup through the headset microphone, the electronic device has a high probability of being far away from the user or not currently being used by the user, Selecting this device as a voice pickup device may cause speech recognition failure, so the electronic device is marked as a non-candidate voice pickup device that is not suitable for participating in the multi-device voice pickup election, and is removed from the candidate voice pickup device list. It can be understood that non-candidate pickup devices will not participate in the multi-device pickup election, that is, they will not be elected as pickup devices.
  • the electronic device is in a preset network state is indicated by the network connection state information. Specifically, if the network status of an electronic device is poor (such as low network communication rate, weak wireless network signal, frequent network disconnection recently, etc.) and it is not an answering device, in order to prevent the electronic device from being called by the answering device. If the data is lost or delayed, which affects the subsequent collaborative voice pickup and voice interaction process, the electronic device is marked as a non-candidate electronic device that is not suitable for participating in the multi-device voice pickup election, and is removed from the list of candidate voice pickup devices.
  • the network connection state information Specifically, if the network status of an electronic device is poor (such as low network communication rate, weak wireless network signal, frequent network disconnection recently, etc.) and it is not an answering device, in order to prevent the electronic device from being called by the answering device. If the data is lost or delayed, which affects the subsequent collaborative voice pickup and voice interaction process, the electronic device is marked as a non-candidate electronic device that is not suitable for participating in the multi
  • the fact that the microphone module is in an occupied state is indicated by the microphone occupancy information. If the microphone module of an electronic device is occupied by an application other than a voice assistant (such as a voice recorder), and it is not an answering device, it is regarded as a non-candidate pickup device and removed from the list of candidate pickup devices. Specifically, if the microphone module of the electronic device is occupied by other applications, indicating that the electronic device may not be able to use the microphone module for sound pickup, the electronic device is marked as a device that is not suitable for participating in collaborative sound pickup.
  • a voice assistant such as a voice recorder
  • the network connection status of the answering device is poor, in order to avoid the failure of the answering device to call other candidate pickup devices, it is directly decided to select the answering device as the most suitable pickup device, and the answering device is used as the pickup device to call the local microphone module for follow-up. pickup.
  • the answering device has been connected to a wired or wireless headset, then the answering device has a high probability of being the device closest to the user or the device being used by the user, so it is directly decided to select the answering device as a sound pickup device.
  • the answering device if it is in a preset scene mode (such as subway mode, flight mode, driving mode, travel mode), it can directly decide to select the electronic device corresponding to the scene mode as the sound pickup device to ensure system performance.
  • a preset scene mode such as subway mode, flight mode, driving mode, travel mode
  • the electronic device with better noise reduction capability of the microphone can be fixedly selected as the sound pickup device.
  • the travel mode in order to avoid the increase in the communication power consumption of the device and the decrease in the battery life, the answering device can be fixedly selected as the sound pickup device.
  • the above-mentioned second type of voice selection strategy may include at least one of strategies b1) to b4).
  • the electronic device whose AEC capability information in the candidate sound pickup device list indicates that the AEC is valid is used as the sound pickup device, and the sound pickup effect of the electronic device with the AEC valid is better.
  • the electronic devices for which AEC takes effect are usually electronic devices that are playing audio.
  • the electronic device is playing audio and does not have the AEC capability or the AEC does not take effect, it will cause serious interference to the electronic device itself, such as the sound pickup effect of the electronic device.
  • the electronic device that is playing audio externally has the ability to reduce internal noise and AEC is in effect, then the influence of the internal noise generated by the external audio on its pickup effect can be eliminated.
  • the microphone model parameter in the candidate sound pickup equipment list indicates that the electronic equipment with better noise reduction capability of the microphone module is used as the sound pickup equipment.
  • the electronics of the field microphone array act as a pickup device. Specifically, by judging whether the microphone module of the electronic device is a near-field microphone or a far-field microphone, a sound-collecting device of a far-field microphone with better noise reduction capability can be selected.
  • the electronic device closest to the user in the candidate sound pickup device list is used as the sound pickup device.
  • the voice data (such as the first voice data) corresponding to the user's voice obtained by picking up voices in the candidate voice-picking device list has the highest sound intensity, the highest signal-to-noise ratio, and/or the lowest reverberation delay, indicating that the electronic device is far from the user. The most recent and the best pickup.
  • the electronic device that is farthest from the external noise source in the candidate sound pickup device list is used as the sound pickup device.
  • the voice data (such as the first voice data) corresponding to the user's voice obtained by picking up voices in the candidate voice-picking device list has the highest sound intensity, the highest signal-to-noise ratio, and/or the lowest reverberation delay, indicating that the electronic device is far away from the outside
  • the noise source is the farthest and the pickup is the best.
  • the above-mentioned voice selection strategy (such as the first type of voice selection strategy or the second type of voice selection strategy) includes, but is not limited to, the above examples. Specifically, for the descriptions that satisfy multiple items of strategies a1) to a6) and strategies b1) to b4) in the above-mentioned sound-collecting election strategy, reference can be made to the above-mentioned related descriptions that the sound-collecting equipment satisfies each sound-collecting election strategy respectively. , and will not be repeated here.
  • different priorities can be set for different voice selection policies in advance, and the selection of voice pickup devices is preferentially performed according to the voice selection election policy with a higher priority.
  • the priority of the pickup election strategy may be the priority of a single pickup election strategy, or the priority of a combination of multiple pickup election strategies.
  • the priority of the combination of strategy b1) and strategy b3) is greater than the priority of strategy b3).
  • the smart TV 103a is used as the answering device.
  • it can be selected from the mobile phone 101a, the tablet computer 102a, and the smart TV 103a according to the above strategies b2) and b3).
  • the mobile phone 101a with better SE processing capability, stronger voice sound or the highest signal-to-noise ratio, and the lowest reverberation delay is the sound pickup device.
  • the mobile phone 101a is the electronic device closest to the user, that is, 0.3 m away from the user. In this way, the influence of the deployment position of the electronic device on the sound pickup effect of the electronic device in the multi-device scenario can be avoided.
  • the sound information such as the sound intensity and signal-to-noise ratio of the voice data of the wake-up word cannot be accurately expressed by the electronic device to pick up the user's voice command. Therefore, the sound information of the voice data corresponding to the voice command uttered by the user after uttering the wake-up word can be used as the voice-picking election information to elect a voice-picking device.
  • Fig. 5 shows a flow chart of another method for voice processing based on multiple devices.
  • the method flow is different from the method flow shown in Fig. 4 in that the sound of the voice data according to the voice command spoken by the user is added.
  • Steps 501 to 503 are the same as the above-mentioned steps 401 to 403, and are not repeated here.
  • Step 504 The smart TV 103a obtains the phone 101a, the tablet computer 102a, and the smart TV 103a corresponding to the pickup election information respectively.
  • Step 505 The smart TV 103a picks up the voice command spoken by the user according to the mobile phone 101a, the tablet computer 102a and the smart TV 103a respectively, and selects the voice data corresponding to the voice within the first duration in the voice command as the third voice data.
  • the first duration is X seconds (eg, 3s).
  • the voice corresponding to the third voice data is the voice whose arbitrary duration is the first duration in the voice command spoken by the user.
  • the voice within the above-mentioned first duration is the voice within the first X seconds in the voice command spoken by the user.
  • the mobile phone 101a, the tablet computer 102a and the smart TV 103a only pick up the third voice data, respectively,
  • the voice data corresponding to the voice command spoken by the user after the first X seconds will not be picked up.
  • the above-mentioned third voice data may be a piece of voice command uttered by the user before the voice command corresponding to the above-mentioned second voice data (for example, "How is the weather in Beijing tomorrow?").
  • each electronic device in the multi-device scenario can avoid the waste of resources of the electronic device caused by a long time of performing the step of picking up the voice command spoken by the user.
  • the mobile phone 101a, the tablet computer 102a, and the smart TV 103a respectively pick up the voice data corresponding to a complete voice command spoken by the user, such as the second corresponding to "What's the weather in Beijing tomorrow?" voice data, and then select the third voice data of the first X seconds from the second voice data.
  • the above-mentioned third voice data may be a voice command starting from the voice command corresponding to the above-mentioned second voice data (such as "How is the weather in Beijing tomorrow?") spoken by the user.
  • the third voice data is "tomorrow.” ".
  • Step 506 The smart TV 103a obtains the sound information of the third voice data obtained by the mobile phone 101a, the tablet computer 102a and the smart TV 103a respectively.
  • the sound information of the third speech data includes at least one of the following: a signal-to-noise ratio, a sound intensity (or energy value), a reverberation parameter, and the like.
  • a signal-to-noise ratio the higher the signal-to-noise ratio, the higher the sound intensity and the lower the reverberation delay of the third voice data detected by the electronic device, the better the quality of the third voice data and the closer the third voice data is.
  • the sound information of the third voice data can be used as the sound-picking election information of the electoral sound-picking device.
  • the mobile phone 101a and the tablet computer 102a can separately obtain the sound information of the third voice data, and then send the sound information of the third voice data to the smart TV 103a.
  • the mobile phone 101a and the tablet computer 102a can respectively send the detected third voice data to the smart TV 103a, and then the smart TV 103a calculates the sound information of the third voice data corresponding to the mobile phone 101a and the tablet computer 102a respectively.
  • Step 507 The smart TV 103a adds the voice information of the third voice data to the voice-picking election information, and elects the mobile phone 101a as the voice-picking device according to the voice-picking election information corresponding to the mobile phone 101a, the tablet computer 102a and the smart TV 103a respectively.
  • steps 504 to 507 are similar to the above-mentioned step 404, and the similarities will not be repeated.
  • the smart TV 103a additionally obtains the voice data (ie, the third voice data) corresponding to the voice commands spoken by the user for the first X seconds, so that the smart TV 103a can use the various electronic devices
  • the detected sound information of the third voice data determines that the sound pickup device is the mobile phone 101a.
  • step 507 it is judged whether the smart TV 103a satisfies the above-mentioned selection strategy b3) and/or b4 according to the voice information of the third voice data corresponding to the mobile phone 101a, the tablet computer 102a and the smart TV 103a respectively according to the smart TV 103a ). Specifically, if it is determined according to the sound information of the third voice data that the smart TV 103a is the electronic device closest to the user or the electronic device farthest from the noise, then the smart TV 103a is selected as the sound pickup device.
  • the third voice data detected in the candidate sound pickup device list has the highest sound intensity, the highest signal-to-noise ratio, and/or the lowest reverberation delay, indicating that the electronic device is closest to the user. At this time, the electronic device detects that the quality of the third voice data is the best, and the pickup effect is the best.
  • Steps 508-512 are similar to the above-mentioned steps 405-409, and are not repeated here.
  • the voice pickup device in the multi-device scenario, not only can the voice pickup device be selected according to information such as the voice information of the first language data corresponding to the user's wake-up word, but also can be selected according to ( For example, the voice information of the third language data corresponding to the voice corresponding to the first X seconds) elects a sound pickup device.
  • the voice-picking election information of the voice information corresponding to the user's voice command to elect the voice-picking equipment, the accuracy of the voice-picking equipment can be further improved, thereby improving the multi-device scene. Accuracy of speech recognition.
  • the distance between the electronic device and the external noise can be mainly considered, that is, The influence of external noise on the pickup effect of electronic equipment, and the selection of pickup equipment. It can be understood that, if different electronic devices are of the same type and have the same distance from the user, the sound pickup effects of these electronic devices are the same.
  • FIG. 6 shows a multi-device scenario based on multi-device speech processing under external noise interference. They are interconnected through wireless networks and deployed at positions 1.5m, 1.5m, and 3.0m away from users. At this time, the mobile phone 101b and the mobile phone 102b can be placed idle on the desktop, and the smart TV 103b can be wall-mounted and installed on the wall.
  • the multi-device scenario 12 there is an external noise source 104 near the mobile phone 102b.
  • the external noise source can be a running air conditioner or other device that plays audio. Therefore, in this scenario, the influence of the external noise source 104 on the sound pickup effect of each electronic device is mainly considered, and the flow of voice processing based on multiple devices is performed.
  • FIG. 7 is a flowchart of a specific method for collaboratively processing speech based on FIG. 6 .
  • the process of the method for the mobile phone 101b, the mobile phone 102b, and the smart TV 103b to cooperatively process voice includes:
  • Steps 701 to 709 are similar to the above-mentioned steps 401 to 409, and the similarities are not repeated.
  • the execution subject changes.
  • the electronic devices interconnected through the wireless network are changed from mobile phone 101a, tablet computer 102a and smart TV 103a to mobile phone 101b, mobile phone 102b and smart TV 103b.
  • the answering device selected in step 703 is the smart TV 103b
  • the sound pickup device selected in step 704 is the mobile phone 101b.
  • the distance between the mobile phone 101b and the mobile phone 102b in the multi-device scenario 12 is 1.5m from the user.
  • the mobile phone 101b and the mobile phone 102b are both electronic devices closest to the user.
  • the factor that distinguishes the sound pickup effects of the mobile phone 101b and the mobile phone 102b is only the distance from the external noise source 104 .
  • the distance between the mobile phone 101b and the external noise source 104 is farther.
  • the smart TV 103b is used as the answering device.
  • the selection strategy such as strategy b4 that is far away from external noise.
  • the mobile phone 101b with the highest source, voice sound intensity or signal-to-noise ratio, and the lowest reverberation delay is the sound pickup device. In this way, the influence of external noise on the sound pickup effect of the electronic device in the multi-device scene can be avoided.
  • the smart TV 103b in the multi-device scenario 12 can also acquire the sound of the third voice data corresponding to the voice within the first duration in the voice command spoken by the user picked up by each electronic device information, and add the sound information to each sound pickup election information in step 705, and elect the mobile phone 101b far away from the external noise source 104 as the sound pickup device, which will not be repeated here.
  • the pickup device may have the advantages of being the closest to the user, having the capability of reducing internal noise (such as SE processing capability), and being far away from external noise sources. one or more.
  • internal noise such as SE processing capability
  • the influence of external noise interference on the recognition accuracy of the voice assistant in the multi-device scenario can be alleviated, and the user interaction experience and the environmental robustness of speech recognition in the multi-device scenario can be improved.
  • the influence of the internal noise on the sound pickup effect of the multi-device cooperative sound pickup can be mainly considered. For example, by using an electronic device that emits audio as a sound pickup device, multi-device sound pickup selection can be realized.
  • FIG. 8 shows a multi-device-based speech processing scenario under the interference of internal noise.
  • this multi-device scenario (referred to as multi-device scenario 13 ) the mobile phone 101c, tablet computer 102c, and smart TV 103c are connected through a wireless network They are interconnected and deployed at positions 0.3m, 1.5m, and 3.0m away from users.
  • the smart TV 103c is in the state of playing audio, and the smart TV 103c has internal noise reduction (ie, noise reduction capability) or AEC capability.
  • the volume of the audio played by the smart TV 103c is 60-80 dB, which will strongly interfere with the sound pickup effect of the mobile phone 101c and the tablet computer 102c. Therefore, in this scenario, the flow of voice processing based on multi-devices is performed mainly considering the influence of the internal noise of the smart TV 103c on the sound pickup effect of the electronic device.
  • FIG. 9 is a flow chart of a specific method for collaboratively processing speech in the multi-device scenario shown in FIG. 8 .
  • the process of the method for the mobile phone 101c, the tablet computer 102c, and the smart TV 103c to cooperatively process voice includes:
  • Steps 901-905 are similar to the above-mentioned steps 401-405, and the similarities are not repeated.
  • the execution subject changes.
  • the electronic devices interconnected through the wireless network are changed from mobile phone 101a, tablet computer 102a, and smart TV 103a to mobile phone 101c, tablet computer 102c, and smart TV 103c.
  • the answering device obtained by the cooperative answering election in step 903 is the smart TV 103c
  • the sound pickup device obtained by the cooperative voice pickup in step 904 is also the smart TV 103c, that is, the sound pickup device and the answering device are the same.
  • the influence of internal noise on the effect of electronic devices is considered.
  • the smart TV 103c can follow the second type of the embodiment in the above-mentioned embodiment.
  • the smart TV 103c with relatively high voice signal-to-noise ratio and noise reduction capability is selected as the sound pickup device.
  • the smart TV 103c has the internal noise reduction capability or AEC is in effect, while the mobile phone 101c and the tablet 102c do not have the internal noise reduction capability, and the internal noise reduction capability is lower than that of the smart TV 103c. capability, do not have AEC capability, or AEC is not in effect.
  • electronic devices with SE processing capabilities such as electronic devices with internal noise reduction capabilities or AEC capabilities, can eliminate internal noise (that is, the audio) through the noise reduction information of the audio when the audio is played externally. Due to the influence of the pickup effect, the pickup can get better quality speech data.
  • the answering device can also query the internal noise device that is broadcasting audio, so that the internal noise device can share its noise reduction information.
  • Step 906 the smart TV 103c queries the smart TV 103c that is broadcasting audio from the mobile phone 101c, the tablet computer 102c, and the smart TV 103c, and provides noise reduction information as an internal noise device.
  • the smart TV 103c determines the electronic device that is playing audio by querying the speaker occupancy status of each device or the audio/video software status (such as whether the audio/video software is turned on, and the volume of the electronic device). For example, if the smart device 103c finds that its own speaker is in an occupied state, the volume is high (such as more than 60% of the maximum volume), or the audio/video software is in an open state, then it is determined that the smart TV 103c itself is playing audio, and the Share noise reduction information.
  • the mobile phone 101c and the tablet computer 102c can report to the smart TV 103c through the wireless network information about whether they are in an external audio state, such as reporting information on speaker occupancy status, volume, and/or audio/video software status.
  • the smart TV 103c continues to play audio through the speaker while picking up the voice data corresponding to the voice command spoken by the user.
  • the smart TV 103c can perform noise reduction processing on the voice data corresponding to the subsequently picked up voice command after inquiring that it is an internal noise device.
  • Step 907 The smart TV 103c performs noise reduction processing on the second voice data obtained by picking up the sound according to the noise reduction information.
  • Step 908 The smart TV 103c identifies the second voice data after noise reduction processing.
  • Step 909 the smart TV 103c responds to the user's voice command according to the recognition result or controls other electronic devices to respond to the user's voice command.
  • steps 908 and 909 are similar to the above steps 406 to 408, except that the voice data recognized by the answering device (ie, the smart TV 103c) is the voice data subjected to noise reduction processing through the noise reduction information.
  • steps 906 and 907 are newly added, that is, the smart TV 103c is used as the answering device to query the internal noise device, specifically, the smart TV 103c that is playing audio is provided as the internal noise device to provide noise reduction information.
  • the noise reduction information supports the sound pickup device to perform noise reduction processing on the speech data corresponding to the subsequently picked up speech. Obviously, the sound pickup device and the internal noise device are the same in this scene.
  • an electronic device with internal noise reduction capability that is, noise reduction capability
  • AEC effective can introduce the audio data of the external audio into the noise reduction process, and alleviate its interference by reducing the internal noise generated by the self-playing audio.
  • the above-mentioned internal noise device obtains noise reduction information based on the audio data of the external audio, such as the audio data itself (that is, the internal noise information), or, the voice activity detection (Voice Activity Detection, VAD) information corresponding to the audio (or called mute suppression information).
  • VAD Voice Activity Detection
  • the electronic device (such as the smart TV 103c) can provide noise reduction information of the external audio, and perform noise reduction processing on the internal noise through its noise reduction information, so as to eliminate the internal noise and affect other voice data (such as the voice data picked up by the user) effect to improve the quality of the picked-up speech data.
  • the smart TV 103c directly picks up the second voice data, due to the influence of the external audio, the The second voice data may be directly identified as "How will the day be in Beijing tomorrow?", that is, the user's actual voice command "How will the weather in Beijing tomorrow?” is not accurately identified.
  • the smart TV 103c can eliminate the influence of the external audio through the internal noise reduction information, so that the quality of the second voice data after noise reduction processing by the smart TV 103c is higher, and an accurate recognition result of the second voice data can be obtained subsequently. "What's the weather like in Beijing tomorrow?".
  • Step 910 is similar to the above-mentioned step 409 and will not be repeated here.
  • the answering device can also obtain the noise reduction information of the external audio of the internal noise information, and obtain the voice to be recognized by directly picking up the sound by the sound pickup device (that is, the internal noise of the external audio is not eliminated. Recognize speech), and then perform the step of performing noise reduction processing on the acquired speech to be recognized according to the acquired noise reduction information.
  • the noise reduction process can be alleviated.
  • the influence of the internal noise of the electronic device that emits audio in the device scene on the sound pickup effect of the voice assistant ensures that the voice assistant is based on the sound pickup effect of multiple devices, thereby helping to ensure the voice recognition accuracy of the voice assistant. Further, the user experience in the speech recognition process is improved, and the environmental robustness of speech recognition in multi-device scenarios is improved.
  • the external audio electronic device When there is internal noise in the multi-device scene, in order to avoid the influence of the internal noise on the sound pickup effect of the multi-device scene collaborative sound pickup, not only can the external audio electronic device be used as the sound pickup device, but also the external audio electronic device can be used as the sound pickup device.
  • the device shares the noise reduction information of the internal noise with other electronic devices that are sound pickup devices, so that the sound pickup device can eliminate the influence of the internal noise on the sound pickup effect across devices according to the noise reduction information.
  • FIG. 10 shows another multi-device-based speech processing scenario under the interference of internal noise.
  • this multi-device scenario referred to as multi-device scenario 14
  • the mobile phone 101d and the tablet computer 102d are interconnected through a wireless network, and They are deployed at positions 0.3m and 0.6m away from the user, respectively.
  • the mobile phone 101d is held by the user, and the tablet computer 102d is left idle on the desktop.
  • the tablet computer 102d is in the state of external audio playback, and has internal noise reduction (ie, noise reduction capability) or AEC capability. Therefore, in this scenario, the influence of the internal noise of the tablet computer 102d on the sound pickup effect of cooperative sound pickup in the multi-device scenario can be mainly considered.
  • FIG. 11 is a flowchart of a specific method for collaboratively processing speech in the multi-device scenario shown in FIG. 10 , including:
  • Steps 1101 to 1102 are similar to the above-mentioned steps 401 to 402, and the similarities are not repeated.
  • the difference is that the electronic devices interconnected through the wireless network in the multi-device scenario 14 are changed from a mobile phone 101c, a tablet computer 102c, and a smart TV 103c to a mobile phone 101d and a tablet computer 102d.
  • Step 1103 The mobile phone 101d and the tablet computer 102d elect the mobile phone 101d as the answering device and the sound pickup device.
  • Step 1104 The mobile phone 101d picks up the second voice data corresponding to the voice command spoken by the user.
  • the above steps 1103 to 1104 are similar to the above steps 403 to 404, the difference is that, after the responding device is obtained through the cooperative response election in step 1103, it is possible to directly decide that the responding device is a sound pickup device, without performing the sound pickup method in the above embodiment.
  • Election Policy Steps to elect a pickup device That is, the answering device and the pickup device are the same, such as the mobile phone 101d.
  • Step 1105 The mobile phone 101d inquires from the mobile phone 101d and the tablet computer 102d that the tablet computer 102d is playing external audio, and shares the noise reduction information as an internal noise device.
  • step 1105 is similar to step 906, the difference is that in step 1105 the answering device finds out that the internal noise device (tablet computer 102d) sharing noise reduction information is different from the sound pickup device (mobile phone 101d). Therefore, in this embodiment, step 1106 is added to realize that the internal noise device shares the noise reduction information with the sound pickup device (ie, the mobile phone 101d).
  • the tablet computer 102d as the answering device finds out that the mobile phone 101d is an internal noise device, it can send a noise reduction instruction to the mobile phone 101d, so that the mobile phone 101d can send a noise reduction instruction to the sound pickup device according to the noise reduction instruction.
  • Tablet 102d shares noise reduction information.
  • Step 1106 The tablet computer 102d sends the noise reduction information of the tablet computer 102d to the mobile phone 101d.
  • the tablet computer 102d can send the noise reduction information of the tablet computer 102d to the mobile phone 101d through the wireless network with the mobile phone 101d.
  • Step 1107 The mobile phone 101d performs noise reduction processing on the second voice data obtained by picking up the sound according to the noise reduction information of the tablet computer 102d.
  • Step 1108 The mobile phone 101d identifies the second voice data after noise reduction processing.
  • Step 1109 The mobile phone 101d responds to the user's voice command according to the recognition result or controls other electronic devices to respond to the user's voice command.
  • steps 1107 to 1109 are similar to the above steps 907 to 909, the difference is that in step 1107, the sound pickup device (ie the mobile phone 101d) uses the noise reduction information of other devices (ie the tablet computer 102d) to correspond to the voice picked up by itself Noise reduction processing is carried out on the voice data of the device, and cross-device noise reduction processing is realized.
  • the sound pickup device ie the mobile phone 101d
  • the noise reduction information of other devices ie the tablet computer 102d
  • the mobile phone 101d directly picks up the second voice data corresponding to the voice command, because the tablet computer 102d is outside the The quality of the second voice data is poor due to the influence of the audio played, so that the second voice data may be recognized as "How will it be in Beijing tomorrow?", that is, the user's actual voice command "How will the weather in Beijing tomorrow?” different. That is, because the quality of the second voice data picked up by the mobile phone 101d is poor, the recognition result of the subsequent second voice data is inaccurate.
  • the mobile phone 101d can perform noise reduction processing on the picked-up second voice data through the noise reduction information shared by the tablet computer 102d, the influence of the audio played by the tablet computer 102d on the sound pickup effect of the mobile phone 101d is eliminated. Furthermore, the quality of the second voice data after the noise reduction process is made higher, and the second voice data is subsequently accurately identified as "How is the weather in Beijing tomorrow?".
  • the auxiliary sound pickup equipment performs noise reduction processing during the sound pickup process, which can effectively aggregate multiple devices equipped with microphone modules and voice assistants.
  • the peripheral resources of electronic devices further improve the accuracy of speech recognition in multi-device scenarios.
  • the sound pickup device selected from the multiple devices may have one or more favorable factors such as the closest distance to the user, the farthest distance to the external noise source, and the ability to reduce internal noise. kind.
  • the impact of the deployment location of electronic devices, internal noise interference or external noise interference on the voice pickup effect of the voice assistant and the accuracy of speech recognition in the multi-device scenario can be alleviated, and the user interaction experience and the environment of speech recognition in multi-device scenarios can be improved. robustness.
  • FIG. 12 shows a schematic structural diagram of the electronic device 100 .
  • the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and Subscriber identification module (subscriber identification module, SIM) card interface 195 and so on.
  • SIM Subscriber identification module
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
  • the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the electronic device 100 .
  • the electronic device 100 may include more or less components than shown, or combine some components, or separate some components, or arrange different components.
  • the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (neural-network processing unit, NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • application processor application processor, AP
  • modem processor graphics processor
  • ISP image signal processor
  • controller video codec
  • digital signal processor digital signal processor
  • baseband processor baseband processor
  • neural-network processing unit neural-network processing unit
  • the processor 110 may be used to detect whether the electronic device 100 picks up voice data corresponding to a wake-up word or voice command spoken by the user, and to obtain sound information, device status information, microphone module information, etc. of the voice data.
  • actions such as the above-mentioned answering device election, voice pickup device election, or internal noise device inquiry can also be performed according to information of each electronic device (such as voice pickup election information or response election information, etc.).
  • the NPU is a neural-network (NN) computing processor.
  • NPU neural-network
  • Applications such as intelligent cognition of the electronic device 100 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.
  • the NPU can support the electronic device 100 to recognize the voice data obtained by picking up the voice through the voice assistant.
  • the interface connection relationship between the modules illustrated in the embodiment of the present invention is only a schematic illustration, and does not constitute a structural limitation of the electronic device 100 .
  • the electronic device 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
  • the wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in electronic device 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
  • the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
  • the mobile communication module 150 may provide wireless communication solutions including 2G/3G/4G/5G etc. applied on the electronic device 100 .
  • the mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA) and the like.
  • the mobile communication module 150 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation.
  • the mobile communication module 150 can also amplify the signal modulated by the modulation and demodulation processor, and then turn it into an electromagnetic wave for radiation through the antenna 1 .
  • at least part of the functional modules of the mobile communication module 150 may be provided in the processor 110 .
  • at least part of the functional modules of the mobile communication module 150 may be provided in the same device as at least part of the modules of the processor 110 .
  • the wireless communication module 160 can provide applications on the electronic device 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), global navigation satellites Wireless communication solutions such as global navigation satellite system (GNSS), frequency modulation (FM), near field communication (NFC), and infrared technology (IR).
  • WLAN wireless local area networks
  • BT Bluetooth
  • GNSS global navigation satellite system
  • FM frequency modulation
  • NFC near field communication
  • IR infrared technology
  • the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 .
  • the wireless communication module 160 can also receive the signal to be sent from the processor 110 , perform frequency modulation on it, amplify it, and convert it into electromagnetic waves for radiation through the antenna 2 .
  • the above-mentioned antenna 1, antenna 2, mobile communication module 150, wireless communication module 160 and other modules can be used to support the electronic device 100 to send voice information, device status information, etc. of voice data to other electronic devices in a multi-device scenario.
  • the electronic device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like.
  • the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
  • Display screen 194 is used to display images, videos, and the like.
  • the display screen 194 may be used to support the electronic device 100 to display a response interface in response to a user's voice command, and the response interface may include response text and other information.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100 .
  • the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.
  • an external memory card may be used to support the electronic device 100 to store the above-mentioned pick-up election information, answering election information, noise reduction information, and the like.
  • the Internal memory 121 may be used to store computer executable program code, which includes instructions.
  • the internal memory 121 may include a storage program area and a storage data area.
  • the storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like.
  • the storage data area may store data (such as audio data, phone book, etc.) created during the use of the electronic device 100 and the like.
  • the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.
  • the processor 110 executes various functional applications and data processing of the electronic device 100 by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.
  • an external memory card may be used to support the electronic device 100 to store the above-mentioned pick-up election information, answering election information, noise reduction information, and the like.
  • the electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playback, recording, etc.
  • the audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal, such as converting the user's voice received by the electronic device 100 into a digital audio signal (ie, the user's voice corresponding to the user's voice). voice data), or convert the audio generated by the voice assistant using TTS into the response voice. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
  • Speaker 170A also referred to as a "speaker" is used to convert audio electrical signals into sound signals.
  • the electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call, or play the response voice corresponding to the user's voice command based on the voice assistant, such as the response voice "I'm here" for the wake-up word, or for the voice command "Tomorrow Beijing weather. How's it going?" The answering voice "Tomorrow will be sunny in Beijing”.
  • the receiver 170B also referred to as "earpiece" is used to convert audio electrical signals into sound signals.
  • the voice can be answered by placing the receiver 170B close to the human ear.
  • Microphone (ie, microphone module) 170C also known as “microphone” and “microphone”, is used to convert sound signals into electrical signals, such as converting wake-up words or voice commands spoken by the user into electrical signals (ie, corresponding voice data). ).
  • the user can make a sound by approaching the microphone 170C through a human mouth, and input the sound signal into the microphone 170C.
  • the electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
  • Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementation methods.
  • Embodiments of the present application may be implemented as a computer program or program code executing on a programmable system including at least one processor, a storage system (including volatile and nonvolatile memory and/or storage elements) , at least one input device, and at least one output device.
  • Program code may be applied to input instructions to perform the functions described herein and to generate output information.
  • the output information can be applied to one or more output devices in a known manner.
  • a processing system includes any system having a processor such as, for example, a digital signal processor (DSP), microcontroller, application specific integrated circuit (ASIC), or microprocessor.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • the program code may be implemented in a high-level procedural language or an object-oriented programming language to communicate with the processing system.
  • the program code may also be implemented in assembly or machine language, if desired.
  • the mechanisms described in this application are not limited in scope to any particular programming language. In either case, the language may be a compiled language or an interpreted language.
  • the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof.
  • the disclosed embodiments can also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (eg, computer-readable) storage media, which can be executed by one or more processors read and execute.
  • the instructions may be distributed over a network or over other computer-readable media.
  • a machine-readable medium can include any mechanism for storing or transmitting information in a form readable by a machine (eg, a computer), including, but not limited to, floppy disks, optical disks, optical disks, read only memories (CD-ROMs), magnetic Optical Disc, Read Only Memory (ROM), Random Access Memory (RAM), Erasable Programmable Read Only Memory (EPROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Magnetic or Optical Cards, Flash Memory, or Tangible machine-readable storage for transmitting information (eg, carrier waves, infrared signal digital signals, etc.) using the Internet in electrical, optical, acoustic, or other forms of propagating signals.
  • machine-readable media includes any type of machine-readable media suitable for storing or transmitting electronic instructions or information in a form readable by a machine (eg, a computer).
  • each unit/module mentioned in each device embodiment of this application is a logical unit/module.
  • a logical unit/module may be a physical unit/module or a physical unit/module.
  • a part of a module can also be implemented by a combination of multiple physical units/modules.
  • the physical implementation of these logical units/modules is not the most important, and the combination of functions implemented by these logical units/modules is the solution to the problem of this application. The crux of the technical question raised.
  • the above-mentioned device embodiments of the present application do not introduce units/modules that are not closely related to solving the technical problems raised in the present application, which does not mean that the above-mentioned device embodiments do not exist. other units/modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Telephone Function (AREA)

Abstract

A multi-device voice processing method, a medium, an electronic device, and a system, capable of alleviating the influence of internal noise of an electronic device that is playing audio to the outside in a multi-device scene on a sound pickup effect of a voice assistant, facilitating ensuring the accuracy of voice recognition of the voice assistant, and improving the environmental robustness of voice recognition in the multi-device scene. The method comprises: a first electronic device among multiple electronic devices performs sound pickup to obtain a voice to be recognized; the first electronic device receives, from a second electronic device among the multiple electronic devices that is playing audio to outside, audio information related to the audio played by the second electronic device to outside; and the first electronic device performs, according to the received audio information, noise reduction processing on the voice to be recognized that is obtained by means of sound pickup.

Description

基于多设备的语音处理方法、介质、电子设备及系统Multi-device based speech processing method, medium, electronic device and system
本申请要求于2020年09月11日提交国家知识产权局、申请号为202010955837.7、申请名称为“基于多设备的语音处理方法、介质、电子设备及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on September 11, 2020, with the application number of 202010955837.7 and the application name of "Multi-device-based voice processing method, medium, electronic device and system", all of which are The contents are incorporated herein by reference.
技术领域technical field
本申请涉及人工智能领域的语音处理技术,特别涉及一种基于多设备的语音处理方法、介质、电子设备及系统。The present application relates to speech processing technology in the field of artificial intelligence, and in particular, to a multi-device-based speech processing method, medium, electronic device and system.
背景技术Background technique
语音助手是一种基于人工智能(Artificial Intelligence,AI)构建的应用程序(Application,APP)。手机等智能设备通过语音助手接收并识别用户说出的语音指令,为用户提供交互对话、信息查询、设备控制等语音控制功能。随着具有语音助手的智能设备大面积普及,用户所处环境中(如用户家中)通常存在多个安装语音助手的设备,在该多设备场景下,如果多个设备中存在唤醒词相同的设备,那么在用户说出唤醒词后,具有相同唤醒词的设备的语音助手均会被唤醒,并都会对用户后续说出的语音指令进行识别并做出响应。A voice assistant is an application (Application, APP) based on artificial intelligence (Artificial Intelligence, AI). Smart devices such as mobile phones receive and recognize voice commands spoken by users through voice assistants, providing users with voice control functions such as interactive dialogue, information query, and device control. With the widespread popularity of smart devices with voice assistants, there are usually multiple devices with voice assistants installed in the user's environment (such as the user's home). In this multi-device scenario, if there are devices with the same wake word in multiple devices , then after the user speaks the wake-up word, the voice assistants of the devices with the same wake-up word will be woken up, and will all recognize and respond to the user's subsequent voice commands.
现有技术中,在多设备场景中,可以由多个设备协同从多个具有相同唤醒词的设备中,选择出距离用户最近的设备来唤醒其语音助手,以便由该设备拾取、识别并响应用户的语音指令。然而,如果选择出的设备附近存在较强的外部噪声,或者,设备拾音能力较差时,使得选择出的设备在上述自动语音识别流程中对该语音指令的识别结果的准确性较低,进而也就不能准确执行该语音指令指示的操作。In the prior art, in a multi-device scenario, multiple devices can cooperate to select the device closest to the user from multiple devices with the same wake-up word to wake up the voice assistant, so that the device can pick up, recognize and respond to the voice assistant. The user's voice command. However, if there is strong external noise near the selected device, or if the device has poor sound pickup capability, the accuracy of the voice command recognition result of the selected device in the above automatic speech recognition process is low, Therefore, the operation indicated by the voice instruction cannot be performed accurately.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了一种基于多设备的语音处理方法、介质、电子设备及系统,多设备中选举出的拾音设备可以具备与用户间距离最近、与外部噪声源距离最远、具备内部噪声降噪能力等有利因素中的一种或多种,从而可以缓解多设备场景中由于电子设备部署位置、内部噪声干扰或外部噪声干扰对语音助手拾音效果以及语音识别准确率的影响,提升了用户交互体验和多设备场景中语音识别的环境鲁棒性。Embodiments of the present application provide a multi-device-based voice processing method, medium, electronic device, and system. The sound-picking device selected from the multi-device may have the closest distance to the user, the farthest distance from an external noise source, and an internal One or more of the favorable factors such as noise reduction capability, so as to alleviate the influence of electronic equipment deployment location, internal noise interference or external noise interference on the voice pickup effect and voice recognition accuracy of the voice assistant in multi-device scenarios, improve User interaction experience and environmental robustness of speech recognition in multi-device scenarios.
第一方面,本申请实施例提供了一种基于多设备的语音处理方法,该方法包括:多个电子设备中的第一电子设备拾音得到第一待识别语音;第一电子设备从多个电子设备中外放音频的第二电子设备接收与第二电子设备外放的音频相关的音频信息;第一电子设备根据接收的音频信息对拾音得到的第一待识别语音进行降噪处理得到第二待识别语音。可以理解,用于拾音的电子设备(即第一电子设备)为下文中的拾音设备,如从该多设备中选取出的拾音效果较好的电子设备。上述外放音频的电子设备(即第二电子设备)即为该多设备中的内部噪声设备,第二电子设备外放的音频的音频信息即为下文中描述的内部噪声设备的降噪信息。具体地,第一电子设备通过第二电子设备外放音频的音频信息对拾音得到的第一待识别语音进行降噪处理得到第二待识别语音,可以缓解多设备场景中正在外放音频的电子设备的内部噪声对语音助手的拾音效果的影响,保证语音助手基于多设备的拾音效果,进而有利于保证语音助手的语音识别准确率,并提升了多设备场景中语音识别的环境鲁棒性。In a first aspect, an embodiment of the present application provides a multi-device-based voice processing method, the method includes: a first electronic device in a plurality of electronic devices picks up a voice to obtain a first to-be-recognized voice; The second electronic device that broadcasts audio in the electronic device receives audio information related to the audio broadcast by the second electronic device; Second, the voice to be recognized. It can be understood that the electronic device used for sound pickup (ie, the first electronic device) is the sound pickup device hereinafter, such as an electronic device with better sound pickup effect selected from the multiple devices. The above-mentioned electronic device that broadcasts audio (ie, the second electronic device) is the internal noise device in the multi-device, and the audio information of the external audio from the second electronic device is the noise reduction information of the internal noise device described below. Specifically, the first electronic device performs noise reduction processing on the first to-be-recognized voice obtained by picking up the sound by using the audio information of the audio played by the second electronic device to obtain the second to-be-recognized voice, which can alleviate the problem of audio being played out in a multi-device scenario. The influence of the internal noise of electronic equipment on the voice pickup effect of the voice assistant ensures the voice pickup effect of the voice assistant based on multiple devices, which in turn helps to ensure the voice recognition accuracy of the voice assistant, and improves the environmental robustness of voice recognition in multi-device scenarios. Awesome.
在上述第一方面的一种可能的实现中,上述音频信息包括以下至少一项:外放音频的音频数据,该音频对应的话音激活检测VAD信息。可以理解,该音频的音频信息可以反映该音频本身,通过该音频信 息对外放的该音频产生的内部噪声进行降噪处理,可以消除该内部噪声对其他语音数据(如用户拾取的语音数据,如第二待识别语音对应的语音数据)的影响,以提升拾取的语音数据的质量。In a possible implementation of the above-mentioned first aspect, the above-mentioned audio information includes at least one of the following: audio data of the externally played audio, and VAD information corresponding to the voice activation detection of the audio. It can be understood that the audio information of the audio can reflect the audio itself, and by performing noise reduction processing on the internal noise generated by the external audio, the internal noise can be eliminated to other voice data (such as the voice data picked up by the user, such as The influence of the voice data corresponding to the voice to be recognized) to improve the quality of the picked-up voice data.
在上述第一方面的一种可能的实现中,上述方法还包括:第一电子设备向多个电子设备中用于识别语音的第三电子设备发送第二待识别语音;或者,第一电子设备对第二待识别语音进行识别。其中,用于识别语音的电子设备(即第三电子设备)可以为下文中的应答设备。可以理解,在本申请实施例的多设备场景中,用于识别语音的电子设备与用于拾音的电子设备可以相同或不同,即可以由第三电子设备将第一电子设备(或第一电子设备的麦克风模组)作为外设拾取用户的语音指令,如此可以有效聚合多个配备麦克风模组和语音助手的电子设备的外设资源。In a possible implementation of the above-mentioned first aspect, the above-mentioned method further includes: the first electronic device sends the second to-be-recognized speech to a third electronic device used for recognizing speech among the plurality of electronic devices; or, the first electronic device Recognizing the second speech to be recognized. Wherein, the electronic device for recognizing voice (ie, the third electronic device) may be the answering device hereinafter. It can be understood that in the multi-device scenario of this embodiment of the present application, the electronic device used for recognizing voice and the electronic device used for picking up sounds may be the same or different, that is, the first electronic device (or the first electronic device (or the first electronic device) may be The microphone module of the electronic device) is used as a peripheral device to pick up the user's voice command, so that the peripheral resources of multiple electronic devices equipped with a microphone module and a voice assistant can be effectively aggregated.
在上述第一方面的一种可能的实现中,在多个电子设备中的第一电子设备拾音得到第一待识别语音之前,上述方法还包括:第一电子设备向第三电子设备发送第一电子设备的拾音选举信息,其中第一电子设备的拾音选举信息用于表示第一电子设备的拾音情况;第一电子设备为第三电子设备基于获取的多个电子设备的拾音选举信息从多个电子设备中选举出的用于拾音的电子设备。例如,在本申请实施例的多设备场景中,用户说出一个语音指令之后,用户无需专门操作某个电子设备拾取待识别语音指令(如下文中的第二语音数据对应的语音指令),而是由应答设备(即第三电子设备)自动将拾音设备(即第二电子设备)作为外设拾取用户的语音指令,进而通过应答设备对用户的语音指令的响应实现语音控制功能。In a possible implementation of the above-mentioned first aspect, before the first electronic device among the plurality of electronic devices picks up the voice to obtain the first to-be-recognized voice, the above-mentioned method further includes: the first electronic device sends the first electronic device to the third electronic device. Pickup election information of an electronic device, wherein the pickup election information of the first electronic device is used to indicate the sound pickup situation of the first electronic device; the first electronic device is the third electronic device based on the acquired sound pickup of multiple electronic devices The election information is an electronic device selected from a plurality of electronic devices for pickup. For example, in the multi-device scenario of the embodiment of the present application, after the user speaks a voice command, the user does not need to specifically operate an electronic device to pick up the voice command to be recognized (such as the voice command corresponding to the second voice data below), but instead The answering device (ie the third electronic device) automatically uses the pickup device (ie the second electronic device) as a peripheral to pick up the user's voice command, and then implements the voice control function through the response device's response to the user's voice command.
在上述第一方面的一种可能的实现中,上述方法还包括:第一电子设备接收第三电子设备发送的拾音指令(即下文中的拾音指示),其中,该拾音指令用于指示第一电子设备拾音并向第三电子设备发送降噪处理后的待识别语音。如此,在拾音指令的指示下,第一电子设备可以获知其需要向第三电子设备发送拾音得到的待识别语音(如上述第二待识别语音),而不会对待识别语音进行识别等后续处理。In a possible implementation of the above-mentioned first aspect, the above-mentioned method further includes: the first electronic device receives a sound pickup instruction (that is, the sound pickup instruction hereinafter) sent by the third electronic device, wherein the sound pickup instruction is used for Instruct the first electronic device to pick up sound and send the noise-reduced voice to be recognized to the third electronic device. In this way, under the instruction of the pickup instruction, the first electronic device can know that it needs to send the to-be-recognized voice (such as the second to-be-recognized voice) obtained by the pickup to the third electronic device, but will not recognize the to-be-recognized voice, etc. Subsequent processing.
在上述第一方面的一种可能的实现中,上述拾音选举信息包括以下至少一项:回声消除AEC能力信息,麦克风模组信息,设备状态信息,拾音得到的对应唤醒词的语音信息,拾音得到的对应语音指令的语音信息;其中,该语音指令为拾音得到唤醒词之后拾音得到的;该设备状态信息包括以下至少一项:网络连接状态信息、耳机连接状态信息、麦克风占用状态信息、情景模式信息。可以理解,拾音选举信息中的不同信息,表示影响电子设备拾音效果的不同因素,如此,本申请实施例可以综合考虑对电子设备拾音效果的不同因素来选举出拾音设备,如选举出拾音效果最好的电子设备用于拾音,即作为多设备中的拾音设备。In a possible implementation of the above-mentioned first aspect, the above-mentioned voice pickup election information includes at least one of the following: echo cancellation AEC capability information, microphone module information, device status information, and voice information corresponding to the wake-up word obtained by voice pickup, The voice information corresponding to the voice command obtained by picking up the voice; wherein, the voice command is obtained by picking up the voice after the wake-up word is obtained; the device status information includes at least one of the following: network connection status information, headset connection status information, microphone occupation Status information, profile information. It can be understood that the different information in the sound-picking election information represents different factors that affect the sound-picking effect of the electronic equipment. In this way, the embodiment of the present application can comprehensively consider the different factors of the sound-picking effect of the electronic equipment to elect the sound-picking equipment, such as election. The electronic device with the best sound pickup effect is used for sound pickup, that is, as a pickup device in a multi-device.
第二方面,本申请实施例提供了一种基于多设备的语音处理方法,该方法包括:多个电子设备中的第二电子设备外放音频;第二电子设备向多个电子设备中用于拾音的第一电子设备发送与该音频相关的音频信息,其中,该音频信息能够被第一电子设备用于对第一电子设备拾音得到的待识别音频进行降噪处理。具体地,由于正在外放音频的电子设备可以提供该音频的音频信息,使得用于拾音的第一电子设备根据通过该音频信息对拾音得到的第一待识别语音进行降噪处理,实现消除该音频产生的内部噪声对拾音的影响,以提升第一电子设备的拾音效果,即提高拾音得到的语音数据(即第二待识别语音的语音数据)的质量。从而,可以缓解多设备场景中正在外放音频的电子设备的内部噪声对语音助手的拾音效果的影响,保证语音助手基于多设备的拾音效果,进而有利于保证语音助手的语音识别准确率,并提升了多设备场景中语音识别的环境鲁棒性。In a second aspect, an embodiment of the present application provides a multi-device-based voice processing method, the method includes: a second electronic device in the plurality of electronic devices plays audio; the second electronic device is used in the plurality of electronic devices for The first electronic device that picks up the sound sends audio information related to the audio, where the audio information can be used by the first electronic device to perform noise reduction processing on the to-be-identified audio that is picked up by the first electronic device. Specifically, since the electronic device that is playing the audio can provide the audio information of the audio, the first electronic device for picking up the sound performs noise reduction processing on the first to-be-recognized voice obtained by the sound pickup according to the audio information, so as to realize The influence of the internal noise generated by the audio on the sound pickup is eliminated to improve the sound pickup effect of the first electronic device, that is, to improve the quality of the voice data obtained by the sound pickup (ie, the voice data of the second to-be-recognized voice). Therefore, the influence of the internal noise of the electronic device that is playing audio externally on the sound pickup effect of the voice assistant in the multi-device scenario can be alleviated, and the sound pickup effect of the voice assistant based on multiple devices can be ensured, thereby helping to ensure the voice recognition accuracy of the voice assistant. , and improve the environmental robustness of speech recognition in multi-device scenarios.
在上述第二方面的一种可能的实现中,上述音频信息包括以下至少一项:该音频的音频数据,该音频对应的话音激活检测VAD信息。In a possible implementation of the second aspect, the audio information includes at least one of the following: audio data of the audio, and voice activation detection VAD information corresponding to the audio.
在上述第二方面的一种可能的实现中,上述方法还包括:第二电子设备从多个电子设备中用于识 别语音的第三电子设备接收共享指令(即下文中的降噪指示);或者第二电子设备从第一电子设备接收共享指令;其中,共享指令用于指示第二电子设备向第一电子设备发送上述音频信息。可以理解,发送共享指令的电子设备(如第一电子设备或第三电子设备),可以监测第二电子设备是否正在外放音频,在第二电子设备外放音频时,才向第二电子设备发送共享指令。In a possible implementation of the above-mentioned second aspect, the above-mentioned method further includes: the second electronic device receives a shared instruction (ie, the noise reduction instruction hereinafter) from a third electronic device used for recognizing voice among the plurality of electronic devices; Or the second electronic device receives a sharing instruction from the first electronic device; wherein, the sharing instruction is used to instruct the second electronic device to send the audio information to the first electronic device. It can be understood that the electronic device (such as the first electronic device or the third electronic device) that sends the sharing instruction can monitor whether the second electronic device is playing audio, and only transmit the audio to the second electronic device when the second electronic device is playing the audio. Send a share command.
在上述第二方面的一种可能的实现中,上述在第二电子设备向多个电子设备中用于拾音的第一电子设备发送与外放的音频相关的音频信息之前,方法还包括:第二电子设备向第三电子设备发送第二电子设备的拾音选举信息,其中第二电子设备的拾音选举信息用于表示第二电子设备的拾音情况;第一电子设备为第三电子设备基于获取的多个电子设备的拾音选举信息从多个电子设备中选举出的用于拾音的电子设备。例如,第三电子设备作为下文中的应答设备可以选举出拾取语音指令的音频质量最好(即拾音最好的电子设备)的电子设备作为拾音设备(如第一电子设备),以支持应答设备通过语音助手完成与用户的语音交互流程,例如拾音设备可以为距离用户最近且SE处理能力较优的电子设备。如此,可以有效聚合多个配备麦克风模组和语音助手的电子设备的外设资源,缓解多设备场景中由于电子设备部署位置对语音助手识别准确率的影响,提升了用户交互体验和多设备场景中语音识别的环境鲁棒性。In a possible implementation of the above-mentioned second aspect, before the above-mentioned second electronic device sends the audio information related to the externally played audio to the first electronic device used for sound pickup among the plurality of electronic devices, the method further includes: The second electronic device sends the voice-picking election information of the second electronic device to the third electronic device, wherein the voice-picking election information of the second electronic device is used to indicate the voice-picking situation of the second electronic device; the first electronic device is the third electronic device The device elects an electronic device for sound pickup from the plurality of electronic devices based on the acquired sound pickup election information of the plurality of electronic devices. For example, the third electronic device as the answering device hereinafter may elect the electronic device with the best audio quality for picking up the voice command (that is, the electronic device with the best sound pickup) as the sound pickup device (such as the first electronic device) to support The answering device completes the voice interaction process with the user through the voice assistant. For example, the pickup device may be an electronic device that is closest to the user and has better SE processing capability. In this way, the peripheral resources of multiple electronic devices equipped with microphone modules and voice assistants can be effectively aggregated, the impact of the deployment location of electronic devices on the recognition accuracy of voice assistants in multi-device scenarios is alleviated, and the user interaction experience and multi-device scenarios are improved. Environmental Robustness for Speech Recognition in China.
第三方面,本申请实施例提供了一种基于多设备的语音处理方法,该方法包括:多个电子设备中的第三电子设备监测到多个电子设备中存在正在外放音频的第二电子设备;在第二电子设备与第三电子设备不同的情况下,第三电子设备向第二电子设备发送共享指令,其中共享指令用于指示第二电子设备向多个设备中用于拾音的第一电子设备发送与第二电子设备外放的音频相关的音频信息;在第二电子设备与第三电子设备相同的情况下,第三电子设备向第一电子设备发送该音频信息;其中,该音频信息能够被第一电子设备用于对第一电子设备拾音得到的第一待识别语音进行降噪处理得到第二待识别语音。具体地,由于在第三电子设备的指示下正在外放音频的第二电子设备可以提供该音频的音频信息,使得用于拾音的第一电子设备根据该音频信息对拾音得到的第一待识别语音进行降噪处理,实现消除该音频产生的内部噪声对拾音的影响,以提升第一电子设备的拾音效果,即提高拾音得到的语音数据(即第二待识别语音的语音数据)的质量。从而,可以缓解多设备场景中正在外放音频的电子设备的内部噪声对语音助手的拾音效果的影响,保证语音助手基于多设备的拾音效果,进而有利于保证语音助手的语音识别准确率,并提升了多设备场景中语音识别的环境鲁棒性。In a third aspect, an embodiment of the present application provides a multi-device-based voice processing method, the method comprising: a third electronic device in the plurality of electronic devices detects that there is a second electronic device in the plurality of electronic devices that is playing audio device; in the case that the second electronic device is different from the third electronic device, the third electronic device sends a sharing instruction to the second electronic device, wherein the sharing instruction is used to instruct the second electronic device to send a voice pickup device to the plurality of devices. The first electronic device sends audio information related to the audio played by the second electronic device; if the second electronic device is the same as the third electronic device, the third electronic device sends the audio information to the first electronic device; wherein, The audio information can be used by the first electronic device to perform noise reduction processing on the first to-be-recognized voice picked up by the first electronic device to obtain the second to-be-recognized voice. Specifically, because the second electronic device that is playing audio externally under the instruction of the third electronic device can provide the audio information of the audio, the first electronic device used for picking up the sound can obtain the first sound picked up by the audio information according to the audio information. Noise reduction processing is performed on the voice to be recognized, so as to eliminate the influence of the internal noise generated by the audio on the pickup, so as to improve the pickup effect of the first electronic device, that is, to improve the voice data obtained by pickup (ie, the voice of the second to-be-recognized voice). data) quality. Therefore, the influence of the internal noise of the electronic device that is playing audio externally on the sound pickup effect of the voice assistant in the multi-device scenario can be alleviated, and the sound pickup effect of the voice assistant based on multiple devices can be ensured, thereby helping to ensure the voice recognition accuracy of the voice assistant. , and improve the environmental robustness of speech recognition in multi-device scenarios.
在上述第三方面的一种可能的实现中,上述音频信息包括以下至少一项:该音频的音频数据,该音频对应的话音激活检测VAD信息。In a possible implementation of the third aspect, the audio information includes at least one of the following: audio data of the audio, and voice activation detection VAD information corresponding to the audio.
在上述第三方面的一种可能的实现中,第一电子设备与第三电子设备不同,并且上述方法还包括:第三电子设备从第一电子设备获取由第一电子设备拾音得到的第二待识别语音;第一电子设备对第二待识别语音进行识别。进而,有利于提升语音控制过程中语音识别的准确性,并提升用户体验。如此,即使多设备场景中选举出的应答设备(如距离用户最近的第三电子设备)拾音效果较差,或存在正在外放音频的电子设备产生的噪声,多个设备也可以协同拾取并识别音频质量较好的语音数据,而无需用户移动位置或手动控制特定的电子设备拾音。In a possible implementation of the above third aspect, the first electronic device is different from the third electronic device, and the above method further includes: the third electronic device obtains, from the first electronic device, the first electronic device obtained by picking up the sound of the first electronic device. Second, the voice to be recognized; the first electronic device recognizes the second voice to be recognized. Further, it is beneficial to improve the accuracy of speech recognition during the speech control process, and to improve user experience. In this way, even if the selected answering device in the multi-device scenario (such as the third electronic device closest to the user) has poor sound pickup effect, or there is noise generated by the electronic device that is playing audio, multiple devices can cooperate to pick up and Recognize voice data with better audio quality without requiring the user to move location or manually control the pickup of specific electronic devices.
在上述第三方面的一种可能的实现中,在第三电子设备向第二电子设备发送共享指令之前,上述方法还包括:第三电子设备获取多个电子设备的拾音选举信息,其中该多个电子设备的拾音选举信息用于表示该多个电子设备的拾音情况;第三电子设备基于该多个设备的拾音选举信息,从该多个电子设备中选举出至少一个电子设备作为第一电子设备。如此,可以有效聚合多个配备麦克风模组和语音助手的电子设备的外设资源,缓解多设备场景中由于电子设备部署位置、内部噪声干扰、外部噪声干扰等多种 因素对语音助手识别准确率的影响,提升了用户交互体验和多设备场景中语音识别的环境鲁棒性。In a possible implementation of the above third aspect, before the third electronic device sends the sharing instruction to the second electronic device, the above method further includes: the third electronic device acquires voice selection information of multiple electronic devices, wherein the The sound pickup election information of the plurality of electronic devices is used to indicate the sound pickup situation of the plurality of electronic devices; the third electronic device elects at least one electronic device from the plurality of electronic devices based on the sound pickup election information of the plurality of devices as the first electronic device. In this way, the peripheral resources of multiple electronic devices equipped with microphone modules and voice assistants can be effectively aggregated, so as to alleviate the recognition accuracy of voice assistants due to various factors such as the deployment location of electronic devices, internal noise interference, and external noise interference in multi-device scenarios. The impact of this improves the user interaction experience and the environmental robustness of speech recognition in multi-device scenarios.
在上述第三方面的一种可能的实现中,上述方法还包括:第三电子设备向第一电子设备发送拾音指令,其中,该拾音指令用于指示第一电子设备拾音并向第三电子设备发送拾音得到的第二待识别语音。可以理解,在上述拾音指令的指示下,使得第一电子设备可以获知需要向第三电子设备发送拾音得到的待识别语音,而不会对待识别语音进行识别等后续处理。In a possible implementation of the above third aspect, the above method further includes: the third electronic device sends a sound pickup instruction to the first electronic device, wherein the sound pickup instruction is used to instruct the first electronic device to pick up sound and send it to the first electronic device. The third electronic device sends the second voice to be recognized obtained by picking up the voice. It can be understood that, under the instruction of the above voice pickup instruction, the first electronic device can know the to-be-recognized voice that needs to be sent to the third electronic device by picking up the voice, without performing subsequent processing such as recognizing the to-be-recognized voice.
在上述第三方面的一种可能的实现中,上述拾音选举信息包括以下至少一项:回声消除AEC能力信息,麦克风模组信息,设备状态信息,拾音得到的对应唤醒词的语音信息,拾音得到的对应语音指令的语音信息;其中,该语音指令为拾音得到唤醒词之后拾音得到的;该设备状态信息包括以下至少一项:网络连接状态信息、耳机连接状态信息、麦克风占用状态信息、情景模式信息。In a possible implementation of the above-mentioned third aspect, the above-mentioned voice pickup election information includes at least one of the following: echo cancellation AEC capability information, microphone module information, device status information, and voice information corresponding to the wake-up word obtained by voice pickup, The voice information corresponding to the voice command obtained by picking up the voice; wherein, the voice command is obtained by picking up the voice after the wake-up word is obtained; the device status information includes at least one of the following: network connection status information, headset connection status information, microphone occupation Status information, profile information.
在上述第三方面的一种可能的实现中,上述第三电子设备基于多个电子设备的拾音选举信息,从多个电子设备中选举出至少一个电子设备作为第一电子设备,包括下列中的至少一项:在第三电子设备处于预设网络状态的情况下,则第三电子设备将第三电子设备确定为第一电子设备;在第三电子设备已连接耳机的情况下,则第三电子设备将第三电子设备确定为第一电子设备;第三电子设备将多个电子设备中处于预设情景模式的电子设备中的至少一个确定为第一电子设备。可以理解,如果电子设备处于不利于电子设备拾音的设备状态,如电子设备网络连接状态较差、已连接有线或无线耳机、麦克风已经被占用或处于飞行模式,说明该电子设备的拾音效果难以保证,或者该电子设备不能正常与其他设备协同拾音,如不能正常将拾音得到的语音数据发送给其他电子设备。如此,按照上述拾音设备的选择步骤可以选取出拾音效果较好的拾音设备(如上述第一电子设备)。In a possible implementation of the above-mentioned third aspect, the above-mentioned third electronic device elects at least one electronic device from the plurality of electronic devices as the first electronic device based on the voice pickup election information of the plurality of electronic devices, including the following: At least one of: when the third electronic device is in the preset network state, the third electronic device determines the third electronic device as the first electronic device; when the third electronic device is connected to the headset, the third electronic device determines the third electronic device as the first electronic device; The third electronic device determines the third electronic device as the first electronic device; the third electronic device determines at least one of the electronic devices in the preset scene mode among the plurality of electronic devices as the first electronic device. It can be understood that if the electronic device is in a device state that is not conducive to the sound pickup of the electronic device, such as the electronic device has a poor network connection status, a wired or wireless headset is connected, the microphone is already occupied, or is in flight mode, it means the sound pickup effect of the electronic device. It is difficult to guarantee, or the electronic device cannot normally cooperate with other devices to pick up sounds, for example, it cannot normally send the voice data obtained by picking up sounds to other electronic devices. In this way, a sound pickup device (such as the above-mentioned first electronic device) with better sound pickup effect can be selected according to the above-mentioned selection steps of the sound pickup device.
在上述第三方面的一种可能的实现中,上述第三电子设备基于多个电子设备的拾音选举信息,从多个电子设备中选举出至少一个电子设备作为第一电子设备,包括下列中的至少一项:第三电子设备将多个电子设备中AEC生效的电子设备中的至少一个作为第一电子设备;第三电子设备将多个电子设备中降噪能力大于满足预定降噪条件的电子设备中的至少一个作为第一电子设备;第三电子设备将多个电子设备中与用户之间的距离小于第一预定距离的电子设备中的至少一个作为第一电子设备;第三电子设备将多个电子设备中与外部噪声源之间的距离大于第二预定距离的电子设备中的至少一个作为第一电子设备。例如,预定降噪条件表示电子设备SE处理效果较好,如AEC生效或具备内部降噪能力;第一预定距离(如0.5m)说明电子设备距离用户较近;第二预定距离(如3m)说明电子设备距离用户较远。可以理解,通常来说距离用户越近的电子设备的拾音效果越好,距离外部噪声较远的电子设备拾音效果较好;麦克风模组的降噪性能较好或AEC生效的电子设备,说明电子设备的SE处理效果越好,即该电子设备的拾音效果越好。因此,综合考虑这些因素可以从多个设备中选举出拾音效果较好的拾音设备(即上述第一电子设备)。In a possible implementation of the above-mentioned third aspect, the above-mentioned third electronic device elects at least one electronic device from the plurality of electronic devices as the first electronic device based on the voice pickup election information of the plurality of electronic devices, including the following: At least one of: the third electronic device uses at least one of the electronic devices whose AEC is in effect among the plurality of electronic devices as the first electronic device; At least one of the electronic devices is used as the first electronic device; the third electronic device uses at least one of the electronic devices whose distance from the multiple electronic devices to the user is less than the first predetermined distance as the first electronic device; the third electronic device At least one of the plurality of electronic devices whose distance from the external noise source is greater than the second predetermined distance is used as the first electronic device. For example, the predetermined noise reduction condition indicates that the electronic device SE has a better processing effect, such as AEC is effective or has internal noise reduction capability; the first predetermined distance (eg 0.5m) indicates that the electronic device is relatively close to the user; the second predetermined distance (eg 3m) Indicates that the electronic device is far away from the user. It can be understood that, generally speaking, the sound pickup effect of electronic equipment closer to the user is better, and the sound pickup effect of electronic equipment farther away from external noise is better; the noise reduction performance of the microphone module is better or the electronic equipment with AEC effective, It shows that the better the SE processing effect of the electronic device is, that is, the better the sound pickup effect of the electronic device is. Therefore, considering these factors comprehensively, a sound pickup device (ie, the above-mentioned first electronic device) with better sound pickup effect can be selected from multiple devices.
在上述第三方面的一种可能的实现中,预设网络状态包括下列至少一项:网络通信速率小于或等于预定速率的网络,网络电线频次大于或等于预定频次;预设情景模式包括下列至少一项:地铁模式、飞行模式、驾驶模式、旅行模式。其中,若网络通信速率小于或等于预定速率的网络,网络电线频次大于或等于预定频次,则说明电子设备的网络通信速率较差,预定速率和预定频次具体取值可以根据实际需求确定。可以理解,预设网络状态下的电子设备通常不适于参与拾音设备的选举或作为拾音设备(如用于拾音的第一电子设备)。In a possible implementation of the above third aspect, the preset network state includes at least one of the following: a network with a network communication rate less than or equal to a predetermined rate, and a network wire frequency greater than or equal to a predetermined frequency; the preset scene mode includes at least one of the following One: Subway Mode, Airplane Mode, Driving Mode, Travel Mode. Among them, if the network communication rate is less than or equal to the predetermined rate, and the frequency of the network wire is greater than or equal to the predetermined frequency, it means that the network communication rate of the electronic device is poor, and the specific values of the predetermined rate and the predetermined frequency can be determined according to actual needs. It can be understood that the electronic device in the preset network state is generally not suitable for participating in the election of a sound pickup device or as a sound pickup device (eg, the first electronic device for sound pickup).
在上述第三方面的一种可能的实现中,第三电子设备采用神经网络算法或决策树算法从多个电子设备中选举出第一电子设备。可以理解,多个设备的拾音选举信息可以作为神经网络算法或决策树算法的输入,并基于神经网络算法或决策树算法输出决策第一电子设备为拾音设备的结果。In a possible implementation of the above third aspect, the third electronic device uses a neural network algorithm or a decision tree algorithm to select the first electronic device from multiple electronic devices. It can be understood that the pickup election information of multiple devices can be used as the input of the neural network algorithm or the decision tree algorithm, and the result of deciding that the first electronic device is the pickup device is output based on the neural network algorithm or the decision tree algorithm.
第四方面,本申请提供了一种基于多设备的语音处理方法,该方法包括:多个电子设备中的第三电子设备获取多个电子设备的拾音选举信息,其中拾音选举信息用于表示多个电子设备的拾音情况;第三电子设备基于多个设备的拾音选举信息,从多个电子设备中选举出至少一个电子设备作为用于拾音的第一电子设备,其中第一电子设备与第三电子设备相同或者不同;第三电子设备从第一电子设备获取由第一电子设备拾音得到的待识别语音;第三电子设备对获取的待识别语音进行识别。从而,即使多设备场景中选举出的第三电子设备(如距离用户最近的电子设备)拾音效果较差,多个设备也可以协同拾取并识别音频质量较好的语音数据,而无需用户移动位置或手动控制特定的电子设备拾音。进而,有利于提升语音控制过程中语音识别的准确性,并提升用户体验。并且,可以缓解多设备场景中由于电子设备部署位置、外部噪声干扰等多种因素对语音助手拾音效果,以及语音识别准确率的影响,提升了用户交互体验和多设备场景中语音识别的环境鲁棒性。In a fourth aspect, the application provides a multi-device-based voice processing method, the method comprising: a third electronic device in the plurality of electronic devices obtains voice-picking election information of the plurality of electronic devices, wherein the voice-picking election information is used for Indicates the sound pickup situation of the plurality of electronic devices; the third electronic device elects at least one electronic device from the plurality of electronic devices as the first electronic device for sound pickup based on the sound pickup election information of the plurality of devices, wherein the first electronic device is used for sound pickup. The electronic device is the same as or different from the third electronic device; the third electronic device acquires the voice to be recognized obtained by the first electronic device from the first electronic device; and the third electronic device recognizes the acquired voice to be recognized. Therefore, even if the selected third electronic device (such as the electronic device closest to the user) in the multi-device scenario has poor sound pickup effect, multiple devices can collaboratively pick up and recognize voice data with better audio quality without the need for the user to move. position or manually control the pickup of specific electronics. Further, it is beneficial to improve the accuracy of speech recognition during the speech control process, and to improve user experience. In addition, it can alleviate the influence of various factors such as the deployment location of electronic devices, external noise interference and other factors on the voice pickup effect of the voice assistant and the accuracy of speech recognition in multi-device scenarios, improving user interaction experience and the environment for speech recognition in multi-device scenarios. robustness.
在上述第四方面的一种可能的实现中,上述拾音选举信息包括以下至少一项:回声消除AEC能力信息,麦克风模组信息,设备状态信息,拾音得到的对应唤醒词的语音信息,拾音得到的对应语音指令的语音信息;其中,该语音指令为拾音得到唤醒词之后拾音得到的;该设备状态信息包括以下至少一项:网络连接状态信息、耳机连接状态信息、麦克风占用状态信息、情景模式信息。In a possible implementation of the above-mentioned fourth aspect, the above-mentioned voice-collecting election information includes at least one of the following: echo cancellation AEC capability information, microphone module information, device status information, and voice information corresponding to the wake-up word obtained by voice-collecting, The voice information corresponding to the voice command obtained by picking up the voice; wherein, the voice command is obtained by picking up the voice after the wake-up word is obtained; the device status information includes at least one of the following: network connection status information, headset connection status information, microphone occupation Status information, profile information.
在上述第四方面的一种可能的实现中,上述第三电子设备基于多个电子设备的拾音选举信息,从多个电子设备中选举出至少一个电子设备作为第一电子设备,包括下列中的至少一项:在第三电子设备处于预设网络状态的情况下,则第三电子设备将第三电子设备确定为第一电子设备;在第三电子设备已连接耳机的情况下,则第三电子设备将第三电子设备确定为第一电子设备;第三电子设备将多个电子设备中处于预设情景模式的电子设备中的至少一个确定为第一电子设备。In a possible implementation of the fourth aspect, the third electronic device elects at least one electronic device from the plurality of electronic devices as the first electronic device based on the voice pickup election information of the plurality of electronic devices, including the following: At least one of: when the third electronic device is in the preset network state, the third electronic device determines the third electronic device as the first electronic device; when the third electronic device is connected to the headset, the third electronic device determines the third electronic device as the first electronic device; The third electronic device determines the third electronic device as the first electronic device; the third electronic device determines at least one of the electronic devices in the preset scene mode among the plurality of electronic devices as the first electronic device.
在上述第四方面的一种可能的实现中,上述第三电子设备基于多个电子设备的拾音选举信息,从多个电子设备中选举出至少一个电子设备作为第一电子设备,包括下列中的至少一项:第三电子设备将上述多个电子设备中AEC生效的电子设备中的至少一个作为第一电子设备;第三电子设备将上述多个电子设备中降噪能力大于满足预定降噪条件的电子设备中的至少一个作为第一电子设备;第三电子设备将多个电子设备中与用户之间的距离小于第一预定距离的电子设备中的至少一个作为第一电子设备;第三电子设备将多个电子设备中与外部噪声源之间的距离大于第二预定距离的电子设备中的至少一个作为第一电子设备。In a possible implementation of the fourth aspect, the third electronic device elects at least one electronic device from the plurality of electronic devices as the first electronic device based on the voice pickup election information of the plurality of electronic devices, including the following: At least one of the following: the third electronic device uses at least one of the electronic devices whose AEC is in effect among the above-mentioned multiple electronic devices as the first electronic device; At least one of the electronic devices of the condition is used as the first electronic device; the third electronic device uses at least one of the electronic devices whose distance from the plurality of electronic devices to the user is less than the first predetermined distance as the first electronic device; the third electronic device The electronic device uses at least one of the plurality of electronic devices whose distance from the external noise source is greater than the second predetermined distance as the first electronic device.
在上述第四方面的一种可能的实现中,上述预设网络状态包括下列至少一项:网络通信速率小于或等于预定速率的网络,网络电线频次大于或等于预定频次;预设情景模式包括下列至少一项:地铁模式、飞行模式、驾驶模式、旅行模式。In a possible implementation of the fourth aspect, the preset network state includes at least one of the following: a network with a network communication rate less than or equal to a predetermined rate, and a network wire frequency greater than or equal to a predetermined frequency; the preset scene mode includes the following At least one of: Subway Mode, Airplane Mode, Driving Mode, Travel Mode.
在上述第四方面的一种可能的实现中,第三电子设备采用神经网络算法或决策树算法从多个电子设备中选举出第一电子设备。In a possible implementation of the above fourth aspect, the third electronic device uses a neural network algorithm or a decision tree algorithm to select the first electronic device from multiple electronic devices.
在上述第四方面的一种可能的实现中,上述方法还包括:第三电子设备监测到多个电子设备中存在正在外放音频的第二电子设备;第三电子设备向第二电子设备发送共享指令,其中共享指令用于指示第二电子设备向第一电子设备发送第二电子设备外放的音频相关的音频信息,其中该音频信息能够被第一电子设备用于对第一电子设备拾音得到的待识别音频进行降噪处理。In a possible implementation of the above fourth aspect, the above method further includes: the third electronic device detects that there is a second electronic device that is playing audio externally among the plurality of electronic devices; the third electronic device sends a message to the second electronic device. A sharing instruction, wherein the sharing instruction is used to instruct the second electronic device to send to the first electronic device audio information related to the audio played by the second electronic device, wherein the audio information can be used by the first electronic device to pick up the first electronic device. The to-be-identified audio obtained from the sound is subjected to noise reduction processing.
在上述第四方面的一种可能的实现中,第三电子设备与第一电子设备不同,并且方法还包括:第三电子设备外放音频;第三电子设备向第一电子设备发送第三电子设备外放音频相关的音频信息,其中该音频信息能够被第一电子设备用于对第一电子设备拾音得到的待识别音频进行降噪处理。In a possible implementation of the above-mentioned fourth aspect, the third electronic device is different from the first electronic device, and the method further includes: the third electronic device plays external audio; the third electronic device sends the third electronic device to the first electronic device The device broadcasts audio-related audio information, wherein the audio information can be used by the first electronic device to perform noise reduction processing on to-be-identified audio obtained by the first electronic device.
在上述第四方面的一种可能的实现中,上述音频信息包括以下至少一项:外放音频的音频数据, 该音频对应的话音激活检测VAD信息。In a possible implementation of the fourth aspect, the audio information includes at least one of the following: audio data of the external audio, and voice activation detection VAD information corresponding to the audio.
第五面,本申请提供了一种装置,该装置包含在电子设备中,该装置具有实现上述方面及上述方面的可能实现方式中电子设备行为的功能。功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。硬件或软件包括一个或多个与上述功能相对应的模块或单元。例如,拾音单元或模块(如可以是麦克风或麦克风阵列),接收单元或模块(如可以是收发器),降噪模块或单元(如具有该模块或单元功能的处理器)等。例如,拾音单元或模块用于支持多个电子设备中的第一电子设备拾音得到第一待识别语音;接收单元或模块(如可以是收发器),用于支持第一电子设备从多个电子设备中外放音频的第二电子设备接收与第二电子设备外放的音频相关的音频信息;降噪模块或单元,用于支持第一电子设备根据接收单元或模块接收的音频信息对拾音得到的第一待识别语音进行降噪处理得到第二待识别语音。In a fifth aspect, the present application provides an apparatus, the apparatus is included in an electronic device, and the apparatus has the function of implementing the behavior of the electronic device in the above-mentioned aspects and possible implementations of the above-mentioned aspects. The functions can be implemented by hardware, or by executing corresponding software by hardware. The hardware or software includes one or more modules or units corresponding to the above functions. For example, a pickup unit or module (such as a microphone or a microphone array), a receiving unit or module (such as a transceiver), a noise reduction module or unit (such as a processor with the function of the module or unit), and the like. For example, the sound pickup unit or module is used to support the first electronic device in the multiple electronic devices to pick up the voice to obtain the first voice to be recognized; the receiving unit or module (such as a transceiver) is used to support the first electronic device from multiple electronic devices. The second electronic device that broadcasts audio in the electronic device receives audio information related to the audio broadcast by the second electronic device; the noise reduction module or unit is used to support the first electronic device to pick up the audio according to the audio information received by the receiving unit or module. The first to-be-recognized speech obtained from the sound is subjected to noise reduction processing to obtain the second to-be-recognized speech.
第六方面,本申请提供了一种可读介质,该可读介质上存储有指令,该指令在电子设备上执行时使该电子设备执行上述第一方面至第四方面中的基于多设备的语音处理方法。In a sixth aspect, the present application provides a readable medium on which an instruction is stored, and when the instruction is executed on an electronic device, causes the electronic device to perform the multi-device-based multi-device in the above-mentioned first to fourth aspects. speech processing methods.
第七方面,本申请提供了一种电子设备,包括:一个或多个处理器;一个或多个存储器;该一个或多个存储器存储有一个或多个程序,当该一个或者多个程序被该一个或多个处理器执行时,使得该电子设备执行上述第一方面至第四方面中的基于多设备的语音处理方法。在一种可能的实现方式中,所述电子设备还可以包括收发器(可以是分离或集成的接收器和发射器),用于进行信号或数据的接收和发送。In a seventh aspect, the present application provides an electronic device, comprising: one or more processors; one or more memories; the one or more memories stores one or more programs, when the one or more programs are When executed by the one or more processors, the electronic device is caused to execute the multi-device-based voice processing method in the above-mentioned first to fourth aspects. In a possible implementation manner, the electronic device may further include a transceiver (which may be a separate or integrated receiver and transmitter) for receiving and transmitting signals or data.
第八方面,本申请提供了一种电子设备,包括:处理器、存储器、通信接口和通信总线;该存储器用于存储至少一个指令,该至少一个处理器、该存储器和该通信接口通过该通信总线连接,当该至少一个处理器执行该存储器存储的该至少一个指令,以使该电子设备执行上述第一方面至第四方面中的基于多设备的语音处理方法。In an eighth aspect, the present application provides an electronic device, comprising: a processor, a memory, a communication interface, and a communication bus; the memory is used to store at least one instruction, and the at least one processor, the memory, and the communication interface communicate through the at least one processor, the memory, and the communication interface. The bus is connected, when the at least one processor executes the at least one instruction stored in the memory, so that the electronic device executes the multi-device-based voice processing method in the above-mentioned first to fourth aspects.
附图说明Description of drawings
图1为本申请实施例提供的一种基于多设备的语音处理的场景示意图;FIG. 1 is a schematic diagram of a scenario of multi-device-based voice processing provided by an embodiment of the present application;
图2为本申请实施例提供的一种语音助手交互会话流程示意图;FIG. 2 is a schematic flowchart of a voice assistant interaction session provided by an embodiment of the present application;
图3为本申请实施例提供的另一种基于多设备的语音处理的场景示意图;3 is a schematic diagram of another multi-device-based voice processing scenario provided by an embodiment of the present application;
图4为本申请实施例提供的一种基于多设备的语音处理的方法的流程示意图;4 is a schematic flowchart of a method for multi-device-based voice processing provided by an embodiment of the present application;
图5为本申请实施例提供的另一种基于多设备的语音处理的方法流程示意图;5 is a schematic flowchart of another method for multi-device-based voice processing provided by an embodiment of the present application;
图6为本申请实施例提供的另一种基于多设备的语音处理的场景示意图;6 is a schematic diagram of another multi-device-based voice processing scenario provided by an embodiment of the present application;
图7为本申请实施例提供的另一种基于多设备的语音处理的方法流程示意图;7 is a schematic flowchart of another method for multi-device-based voice processing provided by an embodiment of the present application;
图8为本申请实施例提供的另一种基于多设备的语音处理的场景示意图;FIG. 8 is a schematic diagram of another multi-device-based voice processing scenario provided by an embodiment of the present application;
图9为本申请实施例提供的另一种基于多设备的语音处理的方法流程示意图;9 is a schematic flowchart of another method for multi-device-based voice processing provided by an embodiment of the present application;
图10为本申请实施例提供的另一种基于多设备的语音处理的场景示意图;10 is a schematic diagram of another multi-device-based voice processing scenario provided by an embodiment of the present application;
图11为本申请实施例提供的另一种基于多设备的语音处理的方法流程示意图;11 is a schematic flowchart of another method for voice processing based on multiple devices provided by an embodiment of the present application;
图12根据本申请的一些实施例,示出了一种电子设备的结构示意图。FIG. 12 shows a schematic structural diagram of an electronic device according to some embodiments of the present application.
具体实施方式detailed description
本申请的说明性实施例包括但不限于基于多设备的语音处理方法、介质、电子设备。下面结合附图对本申请实施例提供的基于多设备的语音处理应用的多设备场景进行详细描述。Illustrative embodiments of the present application include, but are not limited to, multi-device based speech processing methods, media, and electronic devices. The multi-device scenario of the multi-device-based speech processing application provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings.
图1所示为本申请实施例提供的一种基于多设备的语音处理应用的多设备场景。如图1所示,为了便于说明,该多设备场景10仅示出了3个电子设备,如电子设备101、电子设备102、电子设备103,但是可以理解,本申请的技术方案所适用的多设备场景可以包括任意数量的电子设备,不限于3个。FIG. 1 shows a multi-device scenario of a multi-device-based voice processing application provided by an embodiment of the present application. As shown in FIG. 1 , for the convenience of description, the multi-device scenario 10 only shows three electronic devices, such as electronic device 101 , electronic device 102 , and electronic device 103 , but it is understood that the technical solutions of the present application are applicable to many A device scene can include any number of electronic devices, not limited to 3.
具体地,继续参考图1,在用户说出唤醒词之后,可以从多个电子设备中选举出应答设备,例如,选举电子设备101作为应答设备。再由应答设备从多个设备中选举出拾音效果最好的拾音设备(如语音增强效果最好的电子设备),例如,电子设备101选举电子设备103作为拾音设备。进而,拾音设备(如电子设备103)拾取用户的语音指令对应的语音数据之后,可以由应答设备(如电子设备101)接收、识别以及响应该语音数据,使得应答设备处理的语音数据的质量较好。另外,该场景中如果拾音设备附近存在外放音频的内部噪声设备,那么可以根据该内部噪声设备的降噪信息,对拾音设备拾取的语音数据进行降噪处理,进一步改善应答设备所处理的语音数据的质量。从而,即使多设备场景中选举出的应答设备(如距离用户最近的电子设备)拾音效果较差,或存在正在外放音频的电子设备产生的噪声,多个设备也可以协同拾取并识别音频质量较好的语音数据,而无需用户移动位置或手动控制特定的电子设备拾音。进而,有利于提升语音控制过程中语音识别的准确性,并提升用户体验。Specifically, continuing to refer to FIG. 1 , after the user speaks the wake-up word, an answering device may be elected from a plurality of electronic devices, for example, the electronic device 101 may be elected as the answering device. The answering device then selects the sound pickup device with the best sound pickup effect (eg, the electronic device with the best voice enhancement effect) from the multiple devices. For example, the electronic device 101 elects the electronic device 103 as the sound pickup device. Furthermore, after the voice pickup device (such as the electronic device 103 ) picks up the voice data corresponding to the user's voice command, the answering device (such as the electronic device 101 ) can receive, recognize and respond to the voice data, so that the quality of the voice data processed by the answering device is improved. better. In addition, if there is an internal noise device that emits audio near the sound pickup device in this scenario, noise reduction processing can be performed on the voice data picked up by the sound pickup device according to the noise reduction information of the internal noise device, so as to further improve the processing by the answering device. the quality of the voice data. Therefore, even if the selected answering device in the multi-device scenario (such as the electronic device closest to the user) has poor sound pickup effect, or there is noise generated by the electronic device that is playing audio, multiple devices can cooperate to pick up and identify the audio Better quality voice data without requiring the user to move locations or manually control specific electronic device pickups. Further, it is beneficial to improve the accuracy of speech recognition during the speech control process, and to improve user experience.
在一些实施例中,多设备场景10中的电子设备101-103之间通过无线网络互连,例如,Wi-Fi(如无线保真(Wireless Fidelity)、蓝牙(bluetooth,BT)、近场通信(Near Field Communication,NFC)等无线网络,但不限于此。作为一种示例,为了实现电子设备101-103之间通过无线网络互连,上述电子设备101-103满足以下至少一项:In some embodiments, the electronic devices 101-103 in the multi-device scenario 10 are interconnected via a wireless network, eg, Wi-Fi (eg, Wireless Fidelity (Wireless Fidelity), Bluetooth (BT), Near Field Communication (Near Field Communication, NFC) and other wireless networks, but not limited to this. As an example, in order to realize interconnection between electronic devices 101-103 through a wireless network, the above-mentioned electronic devices 101-103 meet at least one of the following:
1)连接同一无线接入点(如Wi-Fi接入点);1) Connect to the same wireless access point (such as a Wi-Fi access point);
2)登录了同一个账号;2) Log in to the same account;
3)被设置在同一组设备中,例如,同一组设备均具有每个设备的标识信息,以实现该组设备根据各自的标识信息相互通信。3) It is set in the same group of devices, for example, the same group of devices has identification information of each device, so that the group of devices can communicate with each other according to their respective identification information.
可以理解,不同电子设备通过互连的无线网络,可以以广播方式或点对点方式传输信息,但不限于此。It can be understood that different electronic devices may transmit information in a broadcast manner or a peer-to-peer manner through an interconnected wireless network, but not limited to this.
根据本申请的一些实施例,多设备场景中的不同电子设备之间的无线网络的类型可以相同,也可以不同。例如,电子设备101与电子设备102通过Wi-Fi网络连接,而电子设备101与103通过蓝牙连接。According to some embodiments of the present application, the types of wireless networks between different electronic devices in a multi-device scenario may be the same or different. For example, the electronic device 101 and the electronic device 102 are connected through a Wi-Fi network, and the electronic devices 101 and 103 are connected through Bluetooth.
在本申请各实施例中,多设备场景中的电子设备的类型可以相同,也可以不同。例如,适用于本申请的电子设备可以包括但不限于手机、平板电脑、桌面型、膝上型、手持计算机、笔记本电脑、台式电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本,以及蜂窝电话、个人数字助理(personal digital assistant,PDA)、增强现实(augmentedreality,AR)\虚拟现实(virtual reality,VR)设备、媒体播放器、智能电视、智能音箱、智能手表、智能耳机等。作为一种示例,图1示出的电子设备101-103的类型均不同,分别以手机、平板电脑和智能电视为例示出。另外,本申请实施例对电子设备的具体形态不作特殊限制。电子设备的具体结构可以参考下文中图12对应的描述,此处不作赘述。In each embodiment of the present application, the types of electronic devices in the multi-device scenario may be the same or different. For example, electronic devices suitable for use in the present application may include, but are not limited to, cell phones, tablet computers, desktops, laptops, handheld computers, notebook computers, desktop computers, ultra-mobile personal computers (UMPCs), netbooks , as well as cellular phones, personal digital assistants (PDAs), augmented reality (AR)\virtual reality (VR) devices, media players, smart TVs, smart speakers, smart watches, smart headphones, etc. . As an example, the types of electronic devices 101-103 shown in FIG. 1 are all different, and are illustrated by taking a mobile phone, a tablet computer, and a smart TV as examples. In addition, the embodiments of the present application do not specifically limit the specific form of the electronic device. For the specific structure of the electronic device, reference may be made to the description corresponding to FIG. 12 below, which will not be repeated here.
可以理解,在本申请一些实施例中,多设备场景中的电子设备均具有语音控制功能,例如均安装有相同唤醒词的语音助手,如唤醒词均为“小艺小艺”。并且,多设备场景中的电子设备均处于语音助手的有效工作范围内,如与用户之间的距离(即拾音距离)均小于或等于预设距离(如5m),屏幕处于使用状态(如屏幕正面朝上放置、或未合上屏幕盖板),未关闭蓝牙,未超出蓝牙通信范围等,但不限于此。It can be understood that, in some embodiments of the present application, the electronic devices in the multi-device scenario all have a voice control function, for example, voice assistants with the same wake-up words are installed, for example, the wake-up words are all "Xiaoyi Xiaoyi". In addition, the electronic devices in the multi-device scenario are all within the effective working range of the voice assistant. For example, the distance from the user (that is, the pickup distance) is less than or equal to the preset distance (such as 5m), and the screen is in use (such as The screen is placed face up, or the screen cover is not closed), the Bluetooth is not turned off, the Bluetooth communication range is not exceeded, etc., but not limited to this.
可以理解,语音助手是基于人工智能构建的应用程序(APP),借助语音语义识别算法,通过与用户进行即时问答式的语音交互,帮助用户完成信息查询、设备控制、文本输入等操作。语音助手可以是电子设备中的系统应用,也可以是第三方应用。It is understandable that a voice assistant is an application program (APP) based on artificial intelligence. With the help of speech semantic recognition algorithm, it helps users complete information query, device control, text input and other operations through instant question-and-answer voice interaction with users. The voice assistant can be a system application in an electronic device or a third-party application.
如图2所示,语音助手通常采用分阶段级联处理,依次通过语音唤醒、语音增强处理(Speech Enhancement,SE)(或称,语音前端处理)、自动语音识别(Automatic Speech Recognition,ASR)、自然语言理解(Natural Language Understanding,NLU)、对话管理(Dialog Management,DM)、自然语言生成(Natural Language Generation,NLG)、文本转语音(Text To Speech,TTS)以及应答输出等流程实现上述功能。例如,在用户说出唤醒词“小艺小艺”唤醒语音助手的情况下,用户说出语音指令“明天北京天气怎么样?”或“播放音乐”之后,该语音指令经过SE、ASR、NLU、DM、NLG、TTS等流程,可以触发电子设备对该语音指令进行应答输出。As shown in Figure 2, voice assistants usually use staged cascade processing, followed by voice wake-up, voice enhancement processing (Speech Enhancement, SE) (or, voice front-end processing), automatic speech recognition (Automatic Speech Recognition, ASR), Processes such as Natural Language Understanding (NLU), Dialog Management (DM), Natural Language Generation (NLG), Text To Speech (TTS), and response output realize the above functions. For example, when the user speaks the wake-up word "Xiaoyi Xiaoyi" to wake up the voice assistant, after the user speaks the voice command "How is the weather in Beijing tomorrow?" or "Play music", the voice command goes through SE, ASR, NLU , DM, NLG, TTS and other processes can trigger the electronic device to respond and output the voice command.
可以理解,本申请中电子设备拾取的语音数据为通过麦克风直接采集得到的语音数据,或采集后经过SE处理的语音数据,用于输入到ASR进行处理。其中,ASR输出的语音数据的文本处理结果是语音助手准确完成后续识别并响应语音数据等操作的基础。因此,通过语音助手拾音得到并输入ASR的语音数据的质量,将影响语音助手识别并响应该语音数据的准确性。It can be understood that the voice data picked up by the electronic device in the present application is the voice data directly collected by the microphone, or the voice data processed by SE after the collection, and is used for input to the ASR for processing. Among them, the text processing result of the voice data output by the ASR is the basis for the voice assistant to accurately complete subsequent recognition and respond to voice data and other operations. Therefore, the quality of the voice data obtained by the voice assistant and input to the ASR will affect the accuracy of the voice assistant's recognition and response to the voice data.
为了解决电子设备的拾音效果易受到各种因素的影响的问题,使电子设备拾音的语音数据的质量较好,本申请实施例综合考虑多种因素在多设备场景下进行基于多设备的语音处理的流程。通常,影响电子设备拾音效果的因素包括环境因素1)-3)和设备因素4)-6),具体如下所示:In order to solve the problem that the sound pickup effect of an electronic device is easily affected by various factors, so that the quality of the voice data picked up by the electronic device is better, the embodiment of the present application comprehensively considers a variety of factors in a multi-device scenario to perform a multi-device-based The flow of speech processing. Generally, the factors that affect the pickup effect of electronic equipment include environmental factors 1)-3) and equipment factors 4)-6), as follows:
1)电子设备与用户的距离或方位,即电子设备的部署位置。通常,距离用户越近的电子设备的拾音效果越好。1) The distance or azimuth between the electronic device and the user, that is, the deployment location of the electronic device. In general, the closer the electronic device is to the user, the better the pickup.
2)电子设备附近是否存在外部噪声,如电子设备附近的空调风机、无关人声等噪声。可以理解,电子设备周围的噪声是相对于用户说出的语音指令之外的其他声音。通常,距离外部噪声较远的电子设备拾音效果较好。2) Whether there is external noise near the electronic equipment, such as air-conditioning fans near the electronic equipment, unrelated human voices and other noises. It can be understood that the noise around the electronic device is other than the voice commands uttered by the user. Generally, electronic devices that are farther away from external noise are better at pickup.
3)电子设备中是否存在内部噪声,如电子设备使用扬声器外放的音频表示的内部噪声。一般而言,一个电子设备的内部噪声可能成为其他电子设备的外部噪声,影响其他电子设备的拾音效果。3) Whether there is internal noise in the electronic device, such as the internal noise represented by the electronic device using the audio output from the speaker. Generally speaking, the internal noise of one electronic device may become the external noise of other electronic devices, affecting the sound pickup effect of other electronic devices.
4)电子设备的麦克风模组的信息,如麦克风模组为单麦克风还是麦克风阵列,为近场麦克风阵列还是远场麦克风阵列,以及麦克风模组的截止频率。通常麦克风阵列相比于单麦克风的拾音效果较好,人机距离较远时远场麦克风阵列相比于近场麦克风阵列的拾音效果较好,以及麦克风模组的截止频率越高拾音效果越好。4) Information on the microphone module of the electronic device, such as whether the microphone module is a single microphone or a microphone array, whether it is a near-field microphone array or a far-field microphone array, and the cutoff frequency of the microphone module. Generally, the microphone array has a better sound pickup effect than a single microphone. When the distance between the human and the machine is farther, the far-field microphone array has a better sound pickup effect than the near-field microphone array, and the higher the cut-off frequency of the microphone module, the sound pickup The better the effect.
5)电子设备的SE能力,例如电子设备的麦克风模组的降噪性能,以及电子设备的AEC能力,如电子设备的AEC是否生效。通常,麦克风模组的降噪性能较好或AEC生效的电子设备,说明电子设备的SE处理效果越好,即该电子设备的拾音效果越好。例如,相比于单麦克风来说麦克风阵列的降噪性能较好。5) The SE capability of the electronic device, such as the noise reduction performance of the microphone module of the electronic device, and the AEC capability of the electronic device, such as whether the AEC of the electronic device is valid. Generally, the noise reduction performance of the microphone module is better or the electronic device with AEC effective, indicating that the SE processing effect of the electronic device is better, that is, the sound pickup effect of the electronic device is better. For example, the noise reduction performance of the microphone array is better than that of a single microphone.
6)电子设备的设备状态,如设备网络连接状态、耳机连接状态、麦克风占用状态、情景模式信息等因素的一种或多种。例如,如果电子设备处于不利于电子设备拾音的设备状态,如电子设备网络连接状态较差、已连接有线或无线耳机、麦克风已经被占用或处于飞行模式,说明该电子设备的拾音效果难以保证,或者该电子设备不能正常与其他设备协同拾音,如不能正常将拾音得到的语音数据发送给其他电子设备。6) The device status of the electronic device, such as one or more factors such as device network connection status, headset connection status, microphone occupancy status, and profile information. For example, if the electronic device is in a device state that is not conducive to the sound pickup of the electronic device, such as the electronic device's network connection status is poor, wired or wireless headphones are connected, the microphone is already occupied, or it is in airplane mode, it means that the sound pickup effect of the electronic device is difficult. Guaranteed, or the electronic device cannot normally cooperate with other devices to pick up sounds, for example, it cannot normally send the voice data obtained by picking up sounds to other electronic devices.
图3至图11针对上述不同的影响因素,提出了多个电子设备之间协同处理语音的多种实施例。FIG. 3 to FIG. 11 propose various embodiments of co-processing speech among multiple electronic devices according to the above-mentioned different influencing factors.
实施例一Example 1
图3示出了不同部署位置的多个电子设备之间协同处理语音的场景。如图3所示,在该多设备场景(记为多设备场景11)中,手机101a、平板电脑102a和智能电视103a通过无线网络互连,并分别部署在与用户的距离不同的位置上,例如,分别部署在距离用户0.3米(m)、1.5m、3.0m的位置。此时,手机101a被用户手持,平板电脑102a放置于桌面上,智能电视103a壁挂安装在墙壁上。FIG. 3 shows a scenario of co-processing voice among multiple electronic devices in different deployment locations. As shown in FIG. 3, in this multi-device scenario (referred to as multi-device scenario 11), the mobile phone 101a, the tablet computer 102a and the smart TV 103a are interconnected through a wireless network and are respectively deployed at different distances from the user. For example, they are deployed at positions 0.3 meters (m), 1.5 m, and 3.0 m away from the user, respectively. At this time, the mobile phone 101a is held by the user, the tablet computer 102a is placed on the desktop, and the smart TV 103a is wall mounted on the wall.
在该多设备场景11中,假设多个电子设备处于低噪音环境,环境噪声≤20分贝(dB),并且该场景中不存在外放音频的电子设备产生的内部噪声。故可以不考虑外部噪声和内部噪声对电子设备拾音效果的影响,而主要考虑电子设备的部署位置,如哪个电子设备距离用户最近这一因素对基于多设备的语音处理的影响。In the multi-device scenario 11, it is assumed that multiple electronic devices are in a low-noise environment, the ambient noise is less than or equal to 20 decibels (dB), and there is no internal noise generated by electronic devices that play external audio in this scenario. Therefore, it is not necessary to consider the influence of external noise and internal noise on the sound pickup effect of electronic devices, but mainly consider the deployment location of electronic devices, such as the influence of which electronic device is closest to the user on the multi-device-based voice processing.
图4是图3所示场景中具体的协同处理语音的方法的流程。如图4所示,手机101a、平板电脑102a和智能电视103a协同处理语音的方法的过程包括:FIG. 4 is a flowchart of a specific method for collaboratively processing speech in the scenario shown in FIG. 3 . As shown in FIG. 4 , the process of the method for cooperatively processing speech by the mobile phone 101a, the tablet computer 102a and the smart TV 103a includes:
步骤401:手机101a、平板电脑102a和智能电视103a分别拾取用户说出的唤醒词所对应的第一语音数据。Step 401: The mobile phone 101a, the tablet computer 102a and the smart TV 103a respectively pick up the first voice data corresponding to the wake-up word spoken by the user.
例如,手机101a、平板电脑102a和智能电视103a中预先注册的唤醒词均为“小艺小艺”。在用户说出唤醒词“小艺小艺”后,手机101a、平板电脑102a和智能电视103a均可以检测到“小艺小艺”对应的语音,进而判断是否需要唤醒相应的语音助手。For example, the pre-registered wake-up words in the mobile phone 101a, the tablet computer 102a and the smart TV 103a are all "Xiaoyi Xiaoyi". After the user speaks the wake-up word "Xiaoyi Xiaoyi", the mobile phone 101a, the tablet computer 102a and the smart TV 103a can all detect the voice corresponding to "Xiaoyi Xiaoyi", and then determine whether to wake up the corresponding voice assistant.
可以理解,如果用户在电子设备的拾音距离内说出语音,那么电子设备可以通过麦克风监测到对应的语音数据,并进行缓存。具体地,手机101a、平板电脑102a和智能电视103a等电子设备,在没有其他软硬件使用麦克风拾取语音数据的情况下,可以通过麦克风实时监测用户是否有语音数据输入,并缓存拾取到的语音数据,如上述第一语音数据。It can be understood that, if the user speaks a voice within the pickup distance of the electronic device, the electronic device can monitor the corresponding voice data through the microphone and cache it. Specifically, electronic devices such as mobile phone 101a, tablet computer 102a, and smart TV 103a, in the absence of other software and hardware using microphones to pick up voice data, can use the microphone to monitor in real time whether the user has voice data input, and cache the picked-up voice data , as the above-mentioned first voice data.
步骤402:手机101a、平板电脑102a和智能电视103a分别对拾取的第一语音数据进行校验,以确定对应的第一语音数据是否为预先注册的唤醒词。Step 402: The mobile phone 101a, the tablet computer 102a and the smart TV 103a respectively check the picked up first voice data to determine whether the corresponding first voice data is a pre-registered wake-up word.
如果手机101a、平板电脑102a和智能电视103a均对第一语音数据校验成功,则表明拾取到的第一语音数据是唤醒词,可执行以下步骤403。如果手机101a、平板电脑102a和智能电视103a均对第一语音数据校验失败,则表明拾取到的第一语音数据不是唤醒词,执行下述步骤409。If the mobile phone 101a, the tablet computer 102a and the smart TV 103a all successfully verify the first voice data, it indicates that the picked up first voice data is a wake-up word, and the following step 403 can be executed. If the mobile phone 101a, the tablet computer 102a and the smart TV 103a all fail to verify the first voice data, it indicates that the picked up first voice data is not a wake-up word, and the following step 409 is performed.
在一些实施例中,可以通过列表记录对唤醒词对应的第一语音数据校验成功的电子设备,例如,如果手机101a、平板电脑102a和智能电视103a均对第一语音数据校验成功,则通过列表(如称为,候选应答设备列表)记录手机101a、平板电脑102a和智能电视103a。那么,上述候选应答设备列表中的设备将用于参与下述多设备应答选举,以选举出唤醒语音助手并识别用户语音的电子设备(即下文中的应答设备)。可以理解,在本申请实施例中,多设备应答选举是在成功检测到唤醒词的多个设备之间进行的,即在对上述第一语音数据校验成功的电子设备间进行的。In some embodiments, electronic devices that successfully verify the first voice data corresponding to the wake-up word can be recorded in a list. Cell phone 101a, tablet 102a, and smart TV 103a are recorded by a list (eg, a candidate answering device list). Then, the devices in the above candidate answering device list will be used to participate in the following multi-device answering election, so as to elect an electronic device that wakes up the voice assistant and recognizes the user's voice (ie, the answering device hereinafter). It can be understood that, in this embodiment of the present application, the multi-device response election is performed among multiple devices that successfully detect the wake-up word, that is, among the electronic devices that successfully verify the first voice data.
步骤403:手机101a、平板电脑102a和智能电视103a选举出智能电视103a为应答设备。Step 403: The mobile phone 101a, the tablet computer 102a and the smart TV 103a elect the smart TV 103a as the answering device.
在一些实施例中,应答设备一般为用户习惯或倾向使用的电子设备,或识别及响应用户的语音数据成功概率较大的电子设备。具体地,在多设备场景中,应答设备用于识别及响应用户的语音数据,如对语音数据执行ASR、NLU等处理步骤。多设备场景中通常只有一个应答设备,如候选应答设备列表中的一个电子设备。另外,电子设备(如智能电视103a)作为应答设备唤醒语音助手后,可以播放唤醒应答音,如“我在”。而多设备场景中除了应答设备之外的电子设备,如手机101a和平板电脑102a则根据候选拾音指示不做响应,即不输出唤醒应答音。In some embodiments, the answering device is generally an electronic device that the user is used to or tends to use, or an electronic device that has a high probability of success in recognizing and responding to the user's voice data. Specifically, in a multi-device scenario, the answering device is used to recognize and respond to the user's voice data, such as performing processing steps such as ASR and NLU on the voice data. In a multi-device scenario, there is usually only one answering device, such as an electronic device in the list of candidate answering devices. In addition, after the electronic device (such as the smart TV 103a) as the answering device wakes up the voice assistant, it can play a wake-up answering tone, such as "I'm here". In the multi-device scenario, electronic devices other than the answering device, such as the mobile phone 101a and the tablet computer 102a, do not respond according to the candidate voice pickup instructions, that is, do not output a wake-up response tone.
对于应答设备的选取,可以采用现有的各种技术进行,下文也将会进行详细的介绍。For the selection of the answering device, various existing technologies can be used, which will be introduced in detail below.
在一些实施例中,多设备场景中应答设备(如智能电视103a)可以进行协同拾音选举,选举出一个拾音设备,具体执行下述步骤404。In some embodiments, in a multi-device scenario, the answering device (such as the smart TV 103a) may perform a cooperative voice pickup election to elect a voice pickup device, and the following step 404 is specifically performed.
步骤404:智能电视103a分别获取手机101a、平板电脑102a和智能电视103a对应的拾音选举信息,并根据拾音选举信息选举出手机101a为拾音设备。Step 404: The smart TV 103a obtains the phone 101a, the tablet 102a and the smart TV 103a corresponding to the phone 101a, respectively, and elects the phone 101a as the phone 101a according to the phone 103a.
其中,拾音选举信息可以是用来确定各电子设备的拾音效果好坏的参数。例如,在一些实施例中, 拾音选举信息可以包括检测到的用户语音的声音信息(如上述第一语音数据的声音信息)、各电子设备的麦克风模组信息、各电子设备的设备状态信息以及各电子设备的AEC能力信息中的至少一项。此外,可以理解,用于拾音设备选举的信息也可以包括其他信息,只要能够评估电子设备的拾音功能的信息都适用,在此不做限制。The sound pickup election information may be a parameter used to determine whether the sound pickup effect of each electronic device is good or bad. For example, in some embodiments, the pick-up election information may include detected voice information of the user's voice (such as the voice information of the first voice data), microphone module information of each electronic device, and device status information of each electronic device and at least one item of AEC capability information of each electronic device. In addition, it can be understood that the information used for the election of the sound pickup device may also include other information, as long as the information that can evaluate the sound pickup function of the electronic device is applicable, and no limitation is imposed herein.
其中,声音信息可以包括信噪比(Signal to Noise Ratio,SNR)、音强(或能量值)、混响参数(如混响时延)等。并且,电子设备拾取的用户语音的信噪比越高、音强越高、混响延时越低,说明该用户语音的音频质量越好,即电子设备的拾音效果越好。故可以采用用户语音的声音信息来选举拾音设备。The sound information may include a signal-to-noise ratio (Signal to Noise Ratio, SNR), sound intensity (or energy value), reverberation parameters (such as reverberation delay), and the like. Moreover, the higher the signal-to-noise ratio, the higher the sound intensity, and the lower the reverberation delay of the user's voice picked up by the electronic device, the better the audio quality of the user's voice, that is, the better the sound-picking effect of the electronic device. Therefore, the voice information of the user's voice can be used to elect a sound pickup device.
此外,麦克风模组信息用于指示电子设备的麦克风模组是单麦克风还是麦克风阵列,是近场麦克风阵列还是远场麦克风阵列,以及麦克风模组的截止频率是多少。通常,在人机距离较远时,远场麦克风的降噪能力高于近场麦克风,故远场麦克风拾音效果优于近场麦克风。单麦克风、线阵麦克风和环阵麦克风降噪能力依次提高,相应的电子设备拾音效果依次提高。另外,麦克风模组的截止频率越高,降噪能力越好,相应的电子设备的拾音效果越好。因此,麦克风模组信息也可以用来选举拾音设备。In addition, the microphone module information is used to indicate whether the microphone module of the electronic device is a single microphone or a microphone array, whether it is a near-field microphone array or a far-field microphone array, and what the cutoff frequency of the microphone module is. Generally, when the distance between the human and the machine is long, the noise reduction capability of the far-field microphone is higher than that of the near-field microphone, so the sound pickup effect of the far-field microphone is better than that of the near-field microphone. The noise reduction capabilities of the single microphone, the line array microphone and the ring array microphone are successively improved, and the sound pickup effect of the corresponding electronic equipment is successively improved. In addition, the higher the cut-off frequency of the microphone module, the better the noise reduction ability, and the better the sound pickup effect of the corresponding electronic device. Therefore, the microphone module information can also be used to elect a pickup device.
设备状态信息是指能够影响多个电子设备协同拾音的拾音效果的设备状态,如网络连接状态、耳机连接状态、麦克风占用状态、情景模式信息等。其中,情景模式包括:驾驶模式、乘车模式(如公交模式、高铁模式或飞机模式等)、步行模式、运动模式、居家模式等模式。这些情景模式可以通过电子设备读取并分析该电子设备的传感器信息、短消息或电子邮件、设置信息或历史操作记录等信息自动判断。该传感器信息为全球定位系统(Global Positioning System,GPS)、惯性传感器、相机或麦克风等。可以理解,如果耳机连接状态处于占用状态说明该电子设备正在被用户使用,那么支持距离用户较近的耳机麦克风拾音;如果麦克风占用状态指示麦克风模组处于占用状态,那么说明电子设备可能无法通过麦克风模组拾音;如果网络连接状态指示电子设备的无线网络较差,那么该电子设备通过无线网络传输信息的成功率,如向应答设备发送拾音选举信息的成功率受到影响。如果情景模式为驾驶模式、乘车模式等上述情景模式,那么说明电子设备的无线网络连接的稳定性和/或连接速率可能较低,进而影响电子设备参与拾音选举过程或协同拾音过程的成功率。故上述设备状态信息也可以用来选举拾音设备。The device status information refers to a device status that can affect the sound pickup effect of multiple electronic devices for cooperative sound pickup, such as network connection status, headphone connection status, microphone occupancy status, and scene mode information. Among them, the scene modes include: driving mode, riding mode (such as bus mode, high-speed rail mode or airplane mode, etc.), walking mode, sports mode, home mode and other modes. These scene modes can be automatically determined by the electronic device reading and analyzing sensor information, short messages or emails, setting information or historical operation records and other information of the electronic device. The sensor information is a Global Positioning System (Global Positioning System, GPS), an inertial sensor, a camera or a microphone, and the like. It can be understood that if the headset connection state is in the occupied state, it means that the electronic device is being used by the user, and the headset microphone that is closer to the user is supported; if the microphone occupancy state indicates that the microphone module is in the occupied state, it means that the electronic device may not be able to pass The microphone module picks up sound; if the network connection status indicates that the wireless network of the electronic device is poor, the success rate of the electronic device transmitting information through the wireless network, such as the success rate of sending voice pickup election information to the answering device, is affected. If the scenario mode is the above scenario mode such as driving mode and car ride mode, it means that the stability and/or connection rate of the wireless network connection of the electronic device may be low, which in turn affects the electronic device’s ability to participate in the pickup election process or the collaborative pickup process. Success rate. Therefore, the above-mentioned device status information can also be used to elect a sound pickup device.
AEC能力信息用于指示电子设备是否具备AEC能力,以及电子设备的AEC是否生效。其中,AEC能力具体为电子设备中的麦克风模组的AEC能力。可以理解,相比于AEC未生效或者不具备AEC能力的电子设备,AEC生效的电子设备自身的SE处理能力较好、降噪性能更好,进而拾音效果更好。因此,上述AEC能力信息也可以用来选举拾音设备。另外,AEC生效的电子设备通常为正在外放音频的电子设备。The AEC capability information is used to indicate whether the electronic device has the AEC capability and whether the AEC of the electronic device is valid. The AEC capability is specifically the AEC capability of the microphone module in the electronic device. It can be understood that, compared with electronic devices that do not have AEC in effect or do not have AEC capabilities, electronic devices with AEC in effect have better SE processing capabilities, better noise reduction performance, and better sound pickup effects. Therefore, the above AEC capability information can also be used to elect a pickup device. In addition, the electronic equipment for which AEC takes effect is usually the electronic equipment that is playing audio.
可以理解,AEC为一种语音增强技术,通过音波干扰方式消除麦克风与扬声器因空气产生回受路径而产生的杂音,可有效缓解由于扬声器播放音频或声波空间反射所引发的噪声干扰问题,从而实现提高电子设备拾音得到的语音数据的质量。另外,SE用于通过硬件或软件手段,采用混响消除、AEC、盲源分离、波束成型等音频信号处理算法,对电子设备的麦克风采集的用户语音数据进行预处理,以提高得到的语音数据的质量。It can be understood that AEC is a speech enhancement technology. It eliminates the noise generated by the return path of the microphone and the speaker due to the air generated by the sound wave interference. Improve the quality of voice data obtained by electronic equipment pickup. In addition, SE is used to preprocess the user's voice data collected by the microphone of the electronic device by means of hardware or software, using audio signal processing algorithms such as reverberation cancellation, AEC, blind source separation, and beamforming, so as to improve the obtained voice data. the quality of.
智能电视103a可以基于各电子设备的拾音选举信息选举出拾音设备,具体选举方案在下文中将进行详细的介绍。为了便于说明,下文中假设智能电视103a选举出手机101a作为拾音设备。The smart TV 103a can elect a sound pickup device based on the sound pickup election information of each electronic device, and the specific election scheme will be described in detail below. For the convenience of description, it is assumed that the smart TV 103a elects the mobile phone 101a as the sound pickup device.
可以理解,本申请实施例中,可采用远程外设虚拟化技术,将拾音设备或者拾音设备的麦克风作为应答设备的虚拟外设节点,由运行于应答设备端的语音助手调用,完成后续跨设备拾音过程。It can be understood that, in this embodiment of the present application, the remote peripheral virtualization technology can be used, and the pickup device or the microphone of the pickup device is used as the virtual peripheral node of the answering device, which is called by the voice assistant running on the answering device to complete the subsequent crossover. Equipment pickup process.
另外,在一些实施例中,应答设备确定一个电子设备为拾音设备之后,可以向该电子设备发送拾音指示,以指示该电子设备拾取用户的语音数据。类似的,应答设备可以向多设备场景中除拾音设备之外的其他电子设备发送停止拾音指示,以指示这些电子设备不再拾取用户的语音数据。或者,如果多设备 场景中除拾音设备之外的其他电子设备在向应答设备发送拾音选举信息后的一段时间内(如5秒)未接收到任何指示,则这些电子设备确定其不是拾音设备。In addition, in some embodiments, after determining that an electronic device is a voice pickup device, the answering device may send a voice pickup instruction to the electronic device to instruct the electronic device to pick up the user's voice data. Similarly, the answering device may send a voice-picking stop instruction to other electronic devices other than the voice-picking device in the multi-device scenario, so as to instruct these electronic devices to no longer pick up the user's voice data. Alternatively, if other electronic devices other than the pickup device in the multi-device scenario do not receive any indication within a period of time (such as 5 seconds) after sending the pickup election information to the answering device, these electronic devices determine that they are not pickups. audio equipment.
步骤405:手机101a拾取用户说出的语音指令所对应的第二语音数据。Step 405: The mobile phone 101a picks up the second voice data corresponding to the voice command spoken by the user.
可以理解,在后续的应用中,手机101作为拾音设备来拾取用户说出的各种语音指令。例如,用户说出语音指令“明天北京天气怎么样?”,手机101a通过麦克风模组直接采集该语音指令得到第二语音数据,或者手机101a中的麦克风模组采集该语音指令并经过SE处理后得到第二语音数据。It can be understood that in subsequent applications, the mobile phone 101 is used as a sound pickup device to pick up various voice commands spoken by the user. For example, when the user speaks the voice command "What's the weather like in Beijing tomorrow?", the mobile phone 101a directly collects the voice command through the microphone module to obtain the second voice data, or the microphone module in the mobile phone 101a collects the voice command and processes it through SE. Obtain second voice data.
为了方便描述,本申请实施例中单独出现的“语音指令”可以为电子设备唤醒语音助手后,接收的与某一事件或操作对应的语音指令。例如,用户的语音指令为上述“明天的天气怎么样?”或者“播放音乐”等。另外,本文中“语音”、“语音指令”和“语音数据”等名称有时可以混用,应当指出的是,在不强调其区别时,其所要表达的含义是一致的。For the convenience of description, the "voice instruction" that appears alone in this embodiment of the present application may be a voice instruction corresponding to an event or operation received by the electronic device after waking up the voice assistant. For example, the user's voice instruction is the above-mentioned "how is the weather tomorrow?" or "play music". In addition, the names such as "voice", "voice instruction" and "voice data" in this document can sometimes be used interchangeably. It should be noted that the meanings to be expressed are the same when the differences are not emphasized.
步骤406:手机101a向智能电视103a发送第二语音数据。Step 406: The mobile phone 101a sends the second voice data to the smart TV 103a.
可以理解,手机101a作为拾音设备在拾取用户发出的语音指令后直接向应答设备转发该语音指令的语音数据,而自身并不对用户发出的语音指令做出任何识别或者响应。It can be understood that the mobile phone 101a, as a voice pickup device, directly forwards the voice data of the voice command to the answering device after picking up the voice command issued by the user, and does not recognize or respond to the voice command issued by the user.
此外,可以理解,在其他实施例中,如果应答设备和拾音设备为同一设备,则无需该步骤,应答设备或者拾音设备在拾音到用户的语音指令后,直接进行语音数据的语音识别。In addition, it can be understood that in other embodiments, if the answering device and the sound pickup device are the same device, this step is not required, and the answering device or the pickup device directly performs speech recognition on the voice data after picking up the voice command from the user. .
步骤407:智能电视103a识别第二语音数据。Step 407: The smart TV 103a recognizes the second voice data.
具体的,智能电视103a作为应答设备接收到手机101a拾音得到的语音数据之后,可以通过ASR、NLU、DM、NLG、TTS等级联处理流程,实现识别经过降噪处理后的第二语音数据。Specifically, after the smart TV 103a receives the voice data obtained by the phone 101a as the answering device, it can recognize the second voice data after noise reduction processing through the ASR, NLU, DM, NLG, and TTS hierarchical processing procedures.
例如,对于上述提到的语音指令“明天北京天气怎么样?”,ASR可以将经SE处理的第二语音数据转化为对应的文本(或文字),并将口语化的文本进行归一、纠错、书面化等文本化处理,例如得到文字“明天北京天气怎么样?”。For example, for the above-mentioned voice command "How is the weather in Beijing tomorrow?", ASR can convert the second voice data processed by SE into corresponding text (or text), and normalize and correct the spoken text. Error, written and other textual processing, such as getting the text "How will the weather in Beijing be tomorrow?".
步骤408:智能电视103a根据识别结果响应用户的语音指令或者控制其他电子设备响应用户的语音指令。Step 408: The smart TV 103a responds to the user's voice command according to the recognition result or controls other electronic devices to respond to the user's voice command.
可以理解,在本申请实施例中,对于识别出的用户的语音指令,如果是应答设备可以执行的或者只能是应答设备执行的,则应答设备做出与语音指令对应的响应。例如,对于上述提到的语音指令“明天北京天气怎么样?”,智能电视103a回答“明天北京是晴天”,对于语音指令“请关闭电视机”,智能电视103a执行关闭功能。It can be understood that, in the embodiment of the present application, for the recognized voice command of the user, if the response device can execute or can only be executed by the response device, the response device makes a response corresponding to the voice command. For example, for the above-mentioned voice command "How will the weather in Beijing be tomorrow?", the smart TV 103a answers "Tomorrow is sunny in Beijing", and for the voice command "Please turn off the TV", the smart TV 103a performs a shutdown function.
可以理解,上述语音“明天北京是晴天”是应答设备通过TTS输出的应答语音。另外,应答设备还可以控制系统软件、显示屏、振动马达等软硬件执行应答操作,如通过显示屏显示NLG生成的应答文本。It can be understood that the above-mentioned voice "Tomorrow will be sunny in Beijing" is the answer voice output by the answering device through the TTS. In addition, the answering device can also control the system software, display screen, vibration motor and other software and hardware to perform answering operations, such as displaying the answer text generated by NLG through the display screen.
对于针对其他电子设备的语音指令,应答设备可以在识别出语音指令后发送给相应的电子设备。例如,对于语音指令“拉开窗帘”,智能电视103a识别出应答操作为拉开窗帘之后,可以向智能窗帘发送拉开窗帘的操作指令,使得智能窗帘通过硬件完成拉开窗帘的动作。For voice commands for other electronic devices, the answering device can send the voice commands to the corresponding electronic devices after recognizing the voice commands. For example, for the voice command "open the curtains", after the smart TV 103a recognizes that the response operation is to open the curtains, it can send the operation instruction to open the curtains to the smart curtains, so that the smart curtains can complete the action of opening the curtains through hardware.
可以理解,上述其他电子设备可以为物联网(The Internet of Things,IOT)设备,如智能冰箱、智能热水器、智能窗帘等智能家居设备。在一些实施例中,上述其他电子设备不具备语音控制功能,如未安装语音助手,该其他电子设备在应答设备的触发下执行用户的语音指令对应的操作。It can be understood that the above-mentioned other electronic devices may be Internet of Things (The Internet of Things, IOT) devices, such as smart home devices such as smart refrigerators, smart water heaters, and smart curtains. In some embodiments, the above-mentioned other electronic devices do not have a voice control function. If a voice assistant is not installed, the other electronic devices perform operations corresponding to the user's voice commands when triggered by the answering device.
另外,在多设备场景中,用户说出对应第二语音数据的语音指令之后,可以继续说出后续的语音指令数据流,如语音指令“明天应该穿什么衣服呀?”。多设备场景对这些数据流的协同处理语音流程,可以参照上述对第二语音数据的相关描述,此处不再赘述。In addition, in a multi-device scenario, after the user speaks the voice command corresponding to the second voice data, the user can continue to speak the subsequent voice command data stream, such as the voice command "What clothes should you wear tomorrow?". For the collaborative processing voice flow of these data streams in the multi-device scenario, reference may be made to the above-mentioned related description of the second voice data, which will not be repeated here.
步骤409:手机101a、平板电脑102a和智能电视103a不对第一语音数据进行响应,并删除缓存的第一语音数据。Step 409: The mobile phone 101a, the tablet computer 102a and the smart TV 103a do not respond to the first voice data, and delete the cached first voice data.
例如,手机101a、平板电脑102a和智能电视103a执行步骤409时,将不会向用户输出唤醒应答语音“我在”。当然,如果用户继续说出语音指令,如“明天北京天气怎么样?”,这些设备也不会响应该语音指令对应的语音数据。For example, when the mobile phone 101a, the tablet computer 102a and the smart TV 103a execute step 409, the wake-up response voice "I am here" will not be output to the user. Of course, if the user continues to speak a voice command, such as "How is the weather in Beijing tomorrow?", these devices will not respond to the voice data corresponding to the voice command.
可以理解,如果手机101a、平板电脑102a和智能电视103a中的部分电子设备对第一语音数据校验成功,另一部分电子设备对第一语音数据校验失败,那么仅由前者继续执行后续多设备协同拾音的流程。例如,手机101a、平板电脑102a对第一语音数据校验成功,而智能电视103a对第一语音数据校验失败,那么上述步骤403的执行主体将替换为手机101a和平板电脑102a,并且步骤409的执行主体将替换为智能电视103a。It can be understood that if some electronic devices in the mobile phone 101a, the tablet computer 102a and the smart TV 103a succeed in verifying the first voice data, and other electronic devices fail in verifying the first voice data, then only the former will continue to perform subsequent multi-device verification. The process of collaborative pickup. For example, if the mobile phone 101a and the tablet computer 102a successfully verify the first voice data, but the smart TV 103a fails to verify the first voice data, then the execution subject of the above step 403 will be replaced by the mobile phone 101a and the tablet computer 102a, and step 409 will be replaced by the smart TV 103a.
如上所述,在本申请实施例的多设备场景中,用户说出一个语音指令之后,用户无需专门操作某个电子设备拾取该语音指令(如对应第二语音数据的语音指令),而是由应答设备自动将拾音设备作为外设拾取用户的语音指令,进而通过应答设备对用户的语音指令的响应实现语音控制功能。As described above, in the multi-device scenario of the embodiment of the present application, after the user speaks a voice command, the user does not need to specifically operate an electronic device to pick up the voice command (for example, the voice command corresponding to the second voice data), but is instead The answering device automatically uses the pickup device as a peripheral to pick up the user's voice command, and then realizes the voice control function through the response of the answering device to the user's voice command.
本申请实施例提供的基于多设备的语音处理的方法,可以通过多个电子设备的交互协同,选举出拾取语音指令的音频质量最好的电子设备作为拾音设备,以支持应答设备通过语音助手完成与用户的语音交互流程,例如拾音设备可以为距离用户最近且SE处理能力较优的电子设备。如此,可以有效聚合多个配备麦克风模组和语音助手的电子设备的外设资源,缓解多设备场景中由于电子设备部署位置对语音助手识别准确率的影响,提升了用户交互体验和多设备场景中语音识别的环境鲁棒性。The multi-device-based voice processing method provided by the embodiment of the present application can select the electronic device with the best audio quality for picking up the voice command as the voice-picking device through the interaction and cooperation of multiple electronic devices, so as to support the answering device through the voice assistant Complete the voice interaction process with the user. For example, the pickup device can be an electronic device that is closest to the user and has better SE processing capability. In this way, the peripheral resources of multiple electronic devices equipped with microphone modules and voice assistants can be effectively aggregated, the impact of the deployment location of electronic devices on the recognition accuracy of voice assistants in multi-device scenarios is alleviated, and the user interaction experience and multi-device scenarios are improved. Environmental Robustness for Speech Recognition in China.
下面将具体介绍本申请实施例中应答设备的选举和拾音设备的选举方案。The following will specifically introduce the election of the answering device and the election of the voice pickup device in the embodiment of the present application.
应答设备的选举Election of answering device
在一些实施例中,对于上述步骤403,多设备场景中的电子设备可以依据下述应答选举策略中的至少一种执行多设备应答选举,选举出应答设备:In some embodiments, for the above step 403, the electronic device in the multi-device scenario may perform multi-device response election according to at least one of the following response election strategies, and elect a response device:
应答策略1)选举距离用户最近的电子设备作为应答设备。Answering strategy 1) Elect the electronic device closest to the user as the answering device.
例如,对于图3所示的场景,可以选举手机101a为应答设备。电子设备与用户的距离可以通过电子设备拾取的唤醒词对应的语音数据的声音信息表征。例如,第一语音数据的信噪比越高、音强越高、混响延时越低,说明电子设备距离用户越近。For example, for the scenario shown in FIG. 3, the mobile phone 101a can be elected as the answering device. The distance between the electronic device and the user can be represented by the sound information of the voice data corresponding to the wake-up word picked up by the electronic device. For example, the higher the signal-to-noise ratio of the first voice data, the higher the sound intensity, and the lower the reverberation delay, it means that the electronic device is closer to the user.
应答策略2)选举被用户活跃使用的电子设备作为应答设备。Answering strategy 2) Elect the electronic device actively used by the user as the answering device.
可以理解,如果电子设备被用户活跃使用,例如被用户最近抬起屏幕,说明用户可能正在使用该电子设备,并且用户更倾向使用其识别并响应用户的语音数据。It can be understood that if the electronic device is actively used by the user, for example, the user has recently lifted the screen, it means that the user may be using the electronic device, and the user is more inclined to use it to recognize and respond to the user's voice data.
在一些实施例中,可以通过设备使用记录信息表征电子设备是否被用户活跃使用。其中,设备使用记录信息包括以下至少一项:屏幕亮屏时间、屏幕亮屏频率,使用语音助手的频率等。可以理解,屏幕亮屏时间越长、屏幕亮屏频率越高、使用语音助手的频率越高,说明电子设备被用户活跃使用的程度更高。例如,根据手机101a、平板电脑102a和智能电视103a的设备使用记录信息,可以将被用户活跃使用的智能电视103a选举为应答设备。In some embodiments, whether the electronic device is actively used by the user can be characterized by the device usage record information. The device usage record information includes at least one of the following: screen-on time, screen-on-screen frequency, frequency of using a voice assistant, and the like. It can be understood that the longer the screen is on, the higher the screen is on, and the higher the frequency of using the voice assistant, the higher the degree of active use of the electronic device by the user. For example, according to the device usage record information of the mobile phone 101a, the tablet computer 102a and the smart TV 103a, the smart TV 103a actively used by the user can be elected as the answering device.
应答策略3)选举配备远场麦克风阵列的电子设备作为应答设备。Answering strategy 3) Election of an electronic device equipped with a far-field microphone array as the answering device.
可以理解,配备远场麦克风阵列的电子设备大多为公共设备,即用户更倾向用来识别并响应用户的语音数据的电子设备。其中,公共设备通常被用户在较远距离(如1-3m)、多种方位下使用,且支持多人共享使用,如智能电视或智能音箱等。相比于手机、平板电脑等小型电子设备,通常配备有远场麦克风阵列的电子设备的扬声器性能更好、屏幕尺寸更大,因而针对用户的语音指令输出的应答语音或显示 的应答信息的效果较好。故配备远场麦克风阵列的电子设备适合作为应答设备。It can be understood that most electronic devices equipped with far-field microphone arrays are public devices, that is, electronic devices that users are more inclined to use to recognize and respond to the user's voice data. Among them, public equipment is usually used by users at relatively long distances (such as 1-3m) and in various directions, and supports shared use by multiple people, such as smart TVs or smart speakers. Compared with small electronic devices such as mobile phones and tablet computers, electronic devices equipped with far-field microphone arrays usually have better speaker performance and larger screen size, so the effect of the response voice output or the displayed response information for the user's voice command better. Therefore, electronic devices equipped with far-field microphone arrays are suitable as answering devices.
在一些实施例中,电子设备是否配备远场麦克风阵列通过麦克风模组信息表征。例如,手机101a、平板电脑102a和智能电视103a根据麦克风模组信息,选举出配备远场麦克风阵列的智能电视103a作为应答设备。In some embodiments, whether the electronic device is equipped with a far-field microphone array is characterized by the microphone module information. For example, the mobile phone 101a, the tablet computer 102a and the smart TV 103a select the smart TV 103a equipped with the far-field microphone array as the answering device according to the microphone module information.
应答策略4)选举公共设备作为应答设备。Answering strategy 4) Election of public equipment as answering equipment.
在一些实施例中,电子设备是否为公共设备,还可以通过公共设备指示信息表征。作为一种示例,智能电视103a的公共设备指示信息指示智能电视103a为公共设备,多设备场景11选举出智能电视103a作为应答设备。类似的,针对应答策略4)的其他描述可以参照对应答策略3)的相关描述,此处不再赘述。In some embodiments, whether the electronic device is a public device can also be characterized by the public device indication information. As an example, the public device indication information of the smart TV 103a indicates that the smart TV 103a is a public device, and the multi-device scenario 11 elects the smart TV 103a as the answering device. Similarly, for other descriptions of the response strategy 4), reference may be made to the relevant description of the response strategy 3), which will not be repeated here.
如果多设备场景中的两个或以上的电子设备均满足相同的应答选举策略,那么可以从这些电子设备中选择任意一个电子设备作为应答设备。If two or more electronic devices in the multi-device scenario all satisfy the same answering election policy, any one of the electronic devices may be selected as the answering device.
可以理解,对于应答设备同时满足上述应答策略1)至4)的描述,可以参照上述应答设备分别满足应答条件1)至4)中的每个应答策略的相关描述,不再赘述。在一些实施例中,可以预先为不同应答策略设定不同的优先级,如果多设备场景中一个电子设备满足最高优先级的应答条件,另一个电子设备满足较低优先级的应答条件,那么将前者作为应答设备。It can be understood that, for the description that the answering device simultaneously satisfies the above-mentioned response strategies 1) to 4), reference may be made to the description of each response strategy in which the above-mentioned answering device satisfies the response conditions 1) to 4), and will not be repeated. In some embodiments, different priorities can be set for different response policies in advance. If one electronic device satisfies the response condition of the highest priority and the other electronic device meets the response condition of the lower priority in the multi-device scenario, then The former acts as an answering device.
在其他实施例中,除了上面列出的应答选举策略,也可以选择对第一语音数据校验成功的任意电子设备,即上述候选应答设备列表中的任意电子设备作为应答设备。In other embodiments, in addition to the response election strategy listed above, any electronic device that successfully checks the first voice data, that is, any electronic device in the above candidate response device list, may also be selected as the response device.
在一些实施例中,多设备场景中的任意一个电子设备可以作为主设备,执行选举应答设备的步骤。例如,手机101a作为主设备选举出智能电视103a为应答设备,并向智能电视103a发送应答指示来指示智能电视103a后续识别并响应用户的语音指令对应的语音数据。另外,主设备可以向多设备场景中除应答设备之外的其他电子设备发送候选拾音指示,以指示这些电子设备对用户的语音指令不进行识别。或者,如果多设备场景中除应答设备之外的其他电子设备在对第一语音数据校验成功后的预设时间内(如10秒)未接收到任何指示,则这些电子设备确定其不是应答设备。In some embodiments, any electronic device in the multi-device scenario may act as a master device to perform the step of electing a response device. For example, the mobile phone 101a as the master device elects the smart TV 103a as the response device, and sends a response instruction to the smart TV 103a to instruct the smart TV 103a to subsequently recognize and respond to the voice data corresponding to the user's voice command. In addition, the master device may send candidate voice pickup instructions to other electronic devices except the answering device in the multi-device scenario, so as to instruct these electronic devices not to recognize the user's voice command. Or, if other electronic devices other than the answering device in the multi-device scenario do not receive any indication within a preset time (for example, 10 seconds) after successfully verifying the first voice data, these electronic devices determine that they are not answering equipment.
此外,在其他实施例中,多设备场景中的每个电子设备均可以执行选举应答设备的操作。例如,手机101a、平板电脑102a和智能电视103a均执行多设备应答选举,并分别选举出智能电视103a为应答设备。那么,智能电视103a可以确定自身为应答设备,进而唤醒语音助手以识别并响应用户的语音指令对应的语音数据。类似的,手机101a和平板电脑102a分别确定其不是应答设备,并对用户的语音指令不进行识别和响应。Furthermore, in other embodiments, each electronic device in a multi-device scenario may perform the operation of an election answering device. For example, the mobile phone 101a, the tablet computer 102a, and the smart TV 103a all perform multi-device response election, and respectively elect the smart TV 103a as the response device. Then, the smart TV 103a can determine that it is an answering device, and then wake up the voice assistant to recognize and respond to the voice data corresponding to the user's voice command. Similarly, the mobile phone 101a and the tablet computer 102a respectively determine that they are not answering devices, and do not recognize and respond to the user's voice command.
在一些实施例中,执行多设备应答选举的电子设备通过获取多设备场景中的每个电子设备的应答选举信息,并根据应答选举信息选举出应答设备。In some embodiments, the electronic device performing the multi-device response election obtains the response election information of each electronic device in the multi-device scenario, and elects the response device according to the response election information.
例如,一个电子设备的应答选举信息包括以下至少一种:第一语音数据的声音信息、设备使用记录信息、麦克风模组信息、公共设备指示信息,但不限于此。For example, the response election information of an electronic device includes at least one of the following: sound information of the first voice data, device usage record information, microphone module information, and public device indication information, but is not limited thereto.
此外,应答设备获得每个电子设备的应答选举信息之后,可以缓存这些信息。In addition, after the answering device obtains the answering election information of each electronic device, the information may be cached.
拾音设备的选举Election of Pickup Equipment
具体地,上述步骤404中,智能电视103a可以接收手机101a和平板电脑102a分别发送的对应的拾音选举信息,并读取自身的拾音选举信息。Specifically, in the above step 404, the smart TV 103a may receive the corresponding voice-picking election information respectively sent by the mobile phone 101a and the tablet computer 102a, and read its own voice-picking election information.
需要说明的是,本申请实施例对手机101a和平板电脑102a对应的拾音选举信息的发送顺序,和每个拾音选举信息中不同信息的发送顺序均不作限定,可以为任意可实现的发送顺序。It should be noted that, in this embodiment of the present application, the sending order of the voice-picking election information corresponding to the mobile phone 101a and the tablet computer 102a, and the sending order of different information in each voice-picking election information are not limited, and can be any achievable sending. order.
此外,对于各个电子设备的拾音选举信息,如果在上述步骤403中应答设备已经计算并缓存各个电 子设备的一些信息,如第一语音数据的声音信息,那么步骤404中可以读取已经缓存的这些信息,而无需重新计算这些信息。In addition, for the selection information of each electronic device, if the answering device has calculated and cached some information of each electronic device in the above step 403, such as the sound information of the first voice data, then in step 404, the cached data can be read. information without recomputing it.
具体地,本申请实施例可以综合考虑电子设备对应的拾音选举信息中的不同信息,即影响电子设备拾音效果的不同因素,设置拾音选举策略,以将多设备场景中拾音效果较好的电子设备作为拾音设备。Specifically, the embodiment of the present application can comprehensively consider different information in the sound pickup election information corresponding to the electronic device, that is, different factors affecting the sound pickup effect of the electronic device, and set a sound pickup election strategy to compare the sound pickup effect in a multi-device scenario. Good electronics as a pickup device.
可以理解,在本申请实施例中,多设备拾音选举是在成功检测到唤醒词的多个设备之间进行的,即在对上述第一语音数据校验成功的电子设备间进行的。具体的,上述候选应答设备列表中的设备可以用于参与多设备拾音选举,以选举出拾音设备,此时该候选应答设备列表可以称为候选拾音设备列表。具体的,在执行多设备拾音选举的过程中,上述候选拾音设备列表中的电子设备均可以作为候选拾音设备,如上述手机101a、平板电脑102a和智能电视103a均可以作为候选拾音设备,即根据拾音选举信息进行拾音选举的电子设备。It can be understood that, in this embodiment of the present application, the multi-device voice pickup election is performed between multiple devices that successfully detect the wake-up word, that is, between electronic devices that successfully verify the first voice data. Specifically, the devices in the above-mentioned candidate answering device list may be used to participate in a multi-device sound pickup election to elect a sound pickup device. In this case, the candidate answering device list may be referred to as a candidate sound pickup device list. Specifically, in the process of performing the multi-device voice pickup election, the electronic devices in the candidate voice pickup device list can all be used as candidate voice pickup devices, for example, the above-mentioned mobile phone 101a, tablet computer 102a and smart TV 103a can all be used as candidate voice pickup devices A device, that is, an electronic device that conducts voice election based on the voice election information.
在一些实施例中,可以通过人工神经网络、专家系统等端到端方法,采用拾音选举策略将上述候选拾音设备列表中的拾音效果较好的电子设备作为拾音设备。具体地,将候选拾音设备列表中的每个电子设备对应的拾音选举信息作为上述人工神经网络或专家系统的输入,那么该人工神经网络或专家系统的输出结果为拾音设备。例如,将手机101a、平板电脑102a和智能电视103a分别对应的拾音选举信息作为上述人工神经网络或专家系统的输入,那么该人工神经网络或专家系统的输出结果为手机101a,即选举出手机101a为拾音设备。In some embodiments, an electronic device with a better sound pickup effect in the above-mentioned candidate sound pickup device list may be used as a sound pickup device through an end-to-end method such as an artificial neural network, an expert system, etc., using a sound pickup election strategy. Specifically, taking the sound pickup election information corresponding to each electronic device in the candidate sound pickup device list as the input of the artificial neural network or the expert system, the output result of the artificial neural network or the expert system is the sound pickup device. For example, if the selection information corresponding to the mobile phone 101a, the tablet computer 102a and the smart TV 103a is used as the input of the above artificial neural network or expert system, the output result of the artificial neural network or the expert system is the mobile phone 101a, that is, the mobile phone is elected. 101a is a pickup device.
上述人工神经网络可以为深度神经网络(Deep Neural Network,DNN)、卷积神经网络(Convolutional Neural Network,CNN)、长短期记忆网络(Long Short Term Memory,LSTM)或循环神经网络(Recurrent Neural Network,RNN)等,本申请实施例对此不做具体限定。The above artificial neural network can be a deep neural network (Deep Neural Network, DNN), a convolutional neural network (Convolutional Neural Network, CNN), a long short term memory network (Long Short Term Memory, LSTM) or a recurrent neural network (Recurrent Neural Network, RNN), etc., which are not specifically limited in the embodiments of the present application.
另外,在其他实施例中,可以通过分阶段级联处理的方法,实现采用拾音选举策略将候选拾音设备列表中的拾音效果较好的电子设备作为拾音设备。具体的,可以先对候选拾音设备列表中的每个电子设备对应的拾音选举信息中的各个参数向量(即各个拾音选举信息)进行特征提取或数值量化,然后采用决策树、逻辑回归等算法决策输出拾音设备的选择结果。例如,通过分阶段级联处理的方法,可以将手机101a、平板电脑102a和智能电视103a分别对应的拾音选举信息中的各个参数向量进行特征提取或数值量化,然后采用决策树、逻辑回归等算法决策输出拾音设备的选择结果为手机101a,即选举出手机101a为拾音设备。In addition, in other embodiments, the method of cascading processing in stages can be used to implement a sound pickup election strategy to use an electronic device with a better sound pickup effect in the candidate sound pickup device list as a sound pickup device. Specifically, feature extraction or numerical quantification may be performed on each parameter vector (ie, each voice-picking election information) in the sound-picking election information corresponding to each electronic device in the candidate sound-picking equipment list, and then a decision tree, logistic regression may be used. and other algorithm decisions to output the selection result of the pickup device. For example, through the method of cascading processing in stages, each parameter vector in the pickup election information corresponding to the mobile phone 101a, the tablet computer 102a and the smart TV 103a can be subjected to feature extraction or numerical quantification, and then a decision tree, logistic regression, etc. can be used. The algorithm decides to output the selection result of the sound pickup device as the mobile phone 101a, that is, elects the mobile phone 101a as the sound pickup device.
具体地,在一些实施例中,应答设备可以通过第一类协同拾音策略和第二类协同拾音策略中的至少一种执行多设备拾音选举的过程。例如,该过程可以包括两部分流程,第一部分流程为应答设备通过第一类协同拾音策略先从候选拾音设备列表中移除部分明显不适宜参与后续协同拾音的劣势设备,或直接决策选择应答设备为最合适的拾音设备。第二部分流程为应答设备通过第二类协同拾音策略根据候选拾音设备列表中的各个电子设备对应的拾音选举信息,选举出拾音效果较好的电子设备作为拾音设备。可以理解,如果执行第一部分流程未决策出拾音设备,那么将执行第二部分流程来选举拾音设备。Specifically, in some embodiments, the answering device may perform a multi-device pickup election process through at least one of the first type of coordinated pickup strategy and the second type of coordinated pickup strategy. For example, the process may include a two-part process. The first part of the process is that the answering device first removes some disadvantaged devices that are obviously not suitable for participating in the subsequent collaborative voice pickup from the list of candidate voice pickup devices through the first type of collaborative voice pickup strategy, or directly decides Select the answering device as the most suitable pickup device. The second part of the process is that the answering device selects the electronic device with better sound pickup effect as the sound pickup device according to the sound pickup election information corresponding to each electronic device in the candidate sound pickup device list through the second type of cooperative sound pickup strategy. It can be understood that if the first part of the process does not decide a sound pickup device, then the second part of the process will be executed to elect a sound pickup device.
在一些实施例中,上述第一类协同拾音策略可以包括以下策略a1)至a6)中的至少一项。In some embodiments, the above-mentioned first type of cooperative voice pickup strategy may include at least one of the following strategies a1) to a6).
a1)将已连接耳机且不是应答设备的电子设备,确定为非候选拾音设备。a1) Determine the electronic device that is connected to the headset and is not an answering device as a non-candidate pickup device.
其中,电子设备已连接耳机的状态由耳机连接状态信息指示。具体地,如果一个电子设备已连接有线或无线耳机且不是应答设备,由于该电子设备仅支持通过耳机麦克风的近距离拾音,那么该电子设备有较高概率远离用户或当前未被用户使用,选用该设备作为拾音设备可能造成语音识别失败,因此将该电子设备标记为不适于参与多设备拾音选举的非候选拾音设备,并从候选拾音设备列表中移除。可以理解,非候选拾音设备将不会参与多设备拾音选举,即不会被选举为拾音设备。The state that the electronic device has been connected to the earphone is indicated by the earphone connection state information. Specifically, if an electronic device is connected to a wired or wireless headset and is not an answering device, since the electronic device only supports close-range pickup through the headset microphone, the electronic device has a high probability of being far away from the user or not currently being used by the user, Selecting this device as a voice pickup device may cause speech recognition failure, so the electronic device is marked as a non-candidate voice pickup device that is not suitable for participating in the multi-device voice pickup election, and is removed from the candidate voice pickup device list. It can be understood that non-candidate pickup devices will not participate in the multi-device pickup election, that is, they will not be elected as pickup devices.
a2)将处于预设网络状态(即网络状态较差)且不是应答设备的电子设备,确定为非候选拾音设备。a2) Determine the electronic device that is in the preset network state (that is, the network state is poor) and is not an answering device as a non-candidate pickup device.
其中,电子设备处于预设网络状态由网络连接状态信息指示。具体地,如果一个电子设备的网络状态较差(如网络通信速率较低、无线网络信号较弱、近期网络频繁掉线等)且不是应答设备,为了避免该电子设备被应答设备调用过程中出现数据丢失或延迟,进而影响后续协同拾音和语音交互流程,将该电子设备标记为不适于参与多设备拾音选举的非候选电子设备,并从候选拾音设备列表中移除。Wherein, that the electronic device is in a preset network state is indicated by the network connection state information. Specifically, if the network status of an electronic device is poor (such as low network communication rate, weak wireless network signal, frequent network disconnection recently, etc.) and it is not an answering device, in order to prevent the electronic device from being called by the answering device. If the data is lost or delayed, which affects the subsequent collaborative voice pickup and voice interaction process, the electronic device is marked as a non-candidate electronic device that is not suitable for participating in the multi-device voice pickup election, and is removed from the list of candidate voice pickup devices.
a3)将麦克风模组处于被占用状态且不是应答设备的电子设备,确定为非候选拾音设备。a3) Determine the electronic device whose microphone module is in an occupied state and is not an answering device as a non-candidate pickup device.
其中,麦克风模组处于被占用状态由麦克风占用信息指示。如果一个电子设备的麦克风模组被除语音助手之外的其它应用(如录音机)占用,且不是应答设备,则将其作为非候选拾音设备,并从候选拾音设备列表中移除。具体的,电子设备的麦克模组被其它应用占用,说明该电子设备可能无法使用麦克风模组进行拾音,那么将电子设备标记为不适于参与协同拾音的设备。The fact that the microphone module is in an occupied state is indicated by the microphone occupancy information. If the microphone module of an electronic device is occupied by an application other than a voice assistant (such as a voice recorder), and it is not an answering device, it is regarded as a non-candidate pickup device and removed from the list of candidate pickup devices. Specifically, if the microphone module of the electronic device is occupied by other applications, indicating that the electronic device may not be able to use the microphone module for sound pickup, the electronic device is marked as a device that is not suitable for participating in collaborative sound pickup.
a4)将处于预设网络状态的应答设备,确定为拾音设备。a4) Determine the answering device in the preset network state as the sound pickup device.
其中,如果应答设备网络连接状态较差,为避免应答设备调用其它候选拾音设备失败,直接决策选择应答设备为最合适拾音设备,由应答设备作为拾音设备调用本机麦克风模组进行后续拾音。Among them, if the network connection status of the answering device is poor, in order to avoid the failure of the answering device to call other candidate pickup devices, it is directly decided to select the answering device as the most suitable pickup device, and the answering device is used as the pickup device to call the local microphone module for follow-up. pickup.
a5)将已连接耳机的应答设备,确定为拾音设备。a5) Determine the answering device connected to the headset as the pickup device.
其中,如果应答设备已连接有线或无线耳机,那么应答设备有较高概率是最靠近用户或用户正在使用的设备,因而直接决策选择应答设备为拾音设备。Among them, if the answering device has been connected to a wired or wireless headset, then the answering device has a high probability of being the device closest to the user or the device being used by the user, so it is directly decided to select the answering device as a sound pickup device.
a6)将处于预设情景模式的电子设备,确定为拾音设备。a6) Determine the electronic device in the preset scene mode as a sound pickup device.
其中,如果应答设备处于预设情景模式(如地铁模式、飞行模式、驾驶模式、旅行模式)下,可直接决策选择与该情景模式对应的电子设备作为拾音设备,以保证系统性能。例如,在驾驶模式下,为避免行车噪声干扰,可固定选择麦克风降噪能力较好的电子设备为拾音设备。又如,在旅行模式下,为避免设备通信功耗上升和续航时间下降,可固定选择应答设备为拾音设备。Among them, if the answering device is in a preset scene mode (such as subway mode, flight mode, driving mode, travel mode), it can directly decide to select the electronic device corresponding to the scene mode as the sound pickup device to ensure system performance. For example, in the driving mode, in order to avoid the interference of driving noise, the electronic device with better noise reduction capability of the microphone can be fixedly selected as the sound pickup device. For another example, in the travel mode, in order to avoid the increase in the communication power consumption of the device and the decrease in the battery life, the answering device can be fixedly selected as the sound pickup device.
另外,上述第二类拾音选举策略可以包括策略b1)至b4)中的至少一项。In addition, the above-mentioned second type of voice selection strategy may include at least one of strategies b1) to b4).
b1)将AEC生效的电子设备作为拾音设备。b1) Use the electronic device with AEC in effect as a sound pickup device.
即,将候选拾音设备列表中AEC能力信息指示AEC生效的电子设备作为拾音设备,而AEC生效的电子设备的拾音效果较好。That is, the electronic device whose AEC capability information in the candidate sound pickup device list indicates that the AEC is valid is used as the sound pickup device, and the sound pickup effect of the electronic device with the AEC valid is better.
可以理解,AEC生效的电子设备通常为正在外放音频的电子设备。另外,如果电子设备正在外放音频而又没有AEC能力或者AEC未生效,那么将对该电子设备自身产生严重干扰,如严重干扰该电子设备的拾音效果。当然,如果正在外放音频的电子设备具有内部噪声降噪能力和AEC生效,那么可以消除其外放音频产生的内部噪声对其拾音效果的影响。It can be understood that the electronic devices for which AEC takes effect are usually electronic devices that are playing audio. In addition, if the electronic device is playing audio and does not have the AEC capability or the AEC does not take effect, it will cause serious interference to the electronic device itself, such as the sound pickup effect of the electronic device. Of course, if the electronic device that is playing audio externally has the ability to reduce internal noise and AEC is in effect, then the influence of the internal noise generated by the external audio on its pickup effect can be eliminated.
b2)将降噪能力较好的电子设备作为拾音设备。b2) Use electronic devices with better noise reduction capabilities as sound pickup devices.
即,将候选拾音设备列表中麦克风模型参数指示麦克风模组的降噪能力较好的电子设备作为拾音设备,例如,在人机距离较远或第一语音数据较弱时,将配备远场麦克风阵列的电子设备作为拾音设备。具体地,可以通过判断电子设备的麦克风模组为近场麦克风还是远场麦克风,选举出降噪能力较好的远场麦克风的拾音设备。That is, the microphone model parameter in the candidate sound pickup equipment list indicates that the electronic equipment with better noise reduction capability of the microphone module is used as the sound pickup equipment. The electronics of the field microphone array act as a pickup device. Specifically, by judging whether the microphone module of the electronic device is a near-field microphone or a far-field microphone, a sound-collecting device of a far-field microphone with better noise reduction capability can be selected.
b3)将距离用户最近的电子设备作为拾音设备。b3) Use the electronic device closest to the user as a sound pickup device.
即,将候选拾音设备列表中距离用户最近的电子设备作为拾音设备。其中,通常候选拾音设备列表中拾音得到用户语音对应的语音数据(如第一语音数据)的音强最高、信噪比最高、和/或混响时延最低,说明该电子设备距离用户最近且拾音效果最好。That is, the electronic device closest to the user in the candidate sound pickup device list is used as the sound pickup device. Among them, usually the voice data (such as the first voice data) corresponding to the user's voice obtained by picking up voices in the candidate voice-picking device list has the highest sound intensity, the highest signal-to-noise ratio, and/or the lowest reverberation delay, indicating that the electronic device is far from the user. The most recent and the best pickup.
b4)将距离外部噪声源最远的电子设备作为拾音设备。b4) Use the electronic device farthest from the external noise source as the sound pickup device.
即,将候选拾音设备列表中距离外部噪声源最远的电子设备作为拾音设备。其中,通常候选拾音设备列表中拾音得到用户语音对应的语音数据(如第一语音数据)的音强最高、信噪比最高、和/或混响时延最低,说明该电子设备距离外部噪声源最远且拾音效果最好。That is, the electronic device that is farthest from the external noise source in the candidate sound pickup device list is used as the sound pickup device. Among them, usually the voice data (such as the first voice data) corresponding to the user's voice obtained by picking up voices in the candidate voice-picking device list has the highest sound intensity, the highest signal-to-noise ratio, and/or the lowest reverberation delay, indicating that the electronic device is far away from the outside The noise source is the farthest and the pickup is the best.
可以理解,上述拾音选举策略(如第一类拾音选举策略或第二类拾音选举策略)包括但不限于上述示例。具体地,对于同时满足上述拾音选举策略中的策略a1)至a6)以及策略b1)至b4)中多项的描述,可以参照上述对拾音设备分别满足每个拾音选举策略的相关描述,此处不再赘述。It can be understood that the above-mentioned voice selection strategy (such as the first type of voice selection strategy or the second type of voice selection strategy) includes, but is not limited to, the above examples. Specifically, for the descriptions that satisfy multiple items of strategies a1) to a6) and strategies b1) to b4) in the above-mentioned sound-collecting election strategy, reference can be made to the above-mentioned related descriptions that the sound-collecting equipment satisfies each sound-collecting election strategy respectively. , and will not be repeated here.
在一些实施例中,可以预先为不同拾音选举策略设定不同的优先级,并优先依据优先级高的拾音选举策略选举拾音设备。当然,拾音选举策略的优先级可以为单个拾音选举策略的优先级,还可以为多个拾音选举策略组合的优先级。例如,策略b1)和策略b3)的组合的优先级大于策略b3)的优先级,此时,如果候选拾音设备列表中一个电子设备满足策略b1)和策略b3),另一个电子设备满足策略b3),则可以选举前者为拾音设备。In some embodiments, different priorities can be set for different voice selection policies in advance, and the selection of voice pickup devices is preferentially performed according to the voice selection election policy with a higher priority. Of course, the priority of the pickup election strategy may be the priority of a single pickup election strategy, or the priority of a combination of multiple pickup election strategies. For example, the priority of the combination of strategy b1) and strategy b3) is greater than the priority of strategy b3). At this time, if one electronic device in the candidate sound pickup device list satisfies strategy b1) and strategy b3), the other electronic device satisfies the strategy b3), the former can be elected as the pickup device.
例如,在多设备场景11中,智能电视103a作为应答设备,在无外部噪声干扰的低噪环境下,可以从手机101a、平板电脑102a、智能电视103a中,依据上述策略b2)和b3)选举出SE处理能力较好、语音音强或信噪比最高、混响延时最低的手机101a为拾音设备。此时,手机101a是距离用户最近的电子设备,即距离用户0.3m。如此,可以避免多设备场景中电子设备的部署位置对电子设备的拾音效果的影响。For example, in the multi-device scenario 11, the smart TV 103a is used as the answering device. In a low-noise environment without external noise interference, it can be selected from the mobile phone 101a, the tablet computer 102a, and the smart TV 103a according to the above strategies b2) and b3). The mobile phone 101a with better SE processing capability, stronger voice sound or the highest signal-to-noise ratio, and the lowest reverberation delay is the sound pickup device. At this time, the mobile phone 101a is the electronic device closest to the user, that is, 0.3 m away from the user. In this way, the influence of the deployment position of the electronic device on the sound pickup effect of the electronic device in the multi-device scenario can be avoided.
实施例二Embodiment 2
由于用户发出语音唤醒词时会下意识提高音量,且由于用户或电子设备位置移动等因素的影响,唤醒词的语音数据的音强、信噪比等声音信息难以精准表述电子设备后续拾取用户语音指令的音频质量,因此可以通过用户说出唤醒词后说出的语音指令所对应的语音数据的声音信息作为拾音选举信息,来选举拾音设备。Since the user will subconsciously increase the volume when issuing the voice wake-up word, and due to the influence of factors such as the user or the position of the electronic device, the sound information such as the sound intensity and signal-to-noise ratio of the voice data of the wake-up word cannot be accurately expressed by the electronic device to pick up the user's voice command. Therefore, the sound information of the voice data corresponding to the voice command uttered by the user after uttering the wake-up word can be used as the voice-picking election information to elect a voice-picking device.
图5示出了另一种基于多设备的语音处理的方法的流程图,该方法流程区别于图4示出的方法流程之处在于,增加了依据用户说出的语音指令的语音数据的声音信息选举拾音设备的环节。具体地,如图5所示,该方法流程包括:Fig. 5 shows a flow chart of another method for voice processing based on multiple devices. The method flow is different from the method flow shown in Fig. 4 in that the sound of the voice data according to the voice command spoken by the user is added. Information on electoral pickup equipment links. Specifically, as shown in Figure 5, the method flow includes:
步骤501-步骤503,与上述步骤401至403相同,此处不再赘述。 Steps 501 to 503 are the same as the above-mentioned steps 401 to 403, and are not repeated here.
步骤504:智能电视103a分别获取手机101a、平板电脑102a和智能电视103a对应的拾音选举信息。Step 504: The smart TV 103a obtains the phone 101a, the tablet computer 102a, and the smart TV 103a corresponding to the pickup election information respectively.
步骤505:智能电视103a根据手机101a、平板电脑102a和智能电视103a分别拾取用户说出的语音指令,并选取语音指令中的第一时长内的语音对应的语音数据作为第三语音数据。Step 505: The smart TV 103a picks up the voice command spoken by the user according to the mobile phone 101a, the tablet computer 102a and the smart TV 103a respectively, and selects the voice data corresponding to the voice within the first duration in the voice command as the third voice data.
例如,第一时长为X秒(如3s)。在一些实施例中,第三语音数据对应的语音为用户说出的语音指令中任意时长为第一时长的语音。例如,上述第一时长内的语音为用户说出的语音指令中起始的X秒内的语音。For example, the first duration is X seconds (eg, 3s). In some embodiments, the voice corresponding to the third voice data is the voice whose arbitrary duration is the first duration in the voice command spoken by the user. For example, the voice within the above-mentioned first duration is the voice within the first X seconds in the voice command spoken by the user.
可以理解,在上述第一时长内的语音为用户说出的起始的前X秒内的语音的情况下,步骤505中手机101a、平板电脑102a和智能电视103a仅分别拾取第三语音数据,而不会拾取前X秒之后用户说出的语音指令对应的语音数据。此时,上述第三语音数据可以为用户说出上述第二语音数据对应的语音指令(如“明天北京天气怎么样?”)之前说出的一段语音指令。这样,在应答设备较快地获得第三语音数据的同时,避免多设备场景中的各个电子设备较长时间执行拾取用户说出的语音指令的步骤而导致电子设备的资源浪费。It can be understood that in the case where the voice within the above-mentioned first duration is the voice within the first X seconds spoken by the user, in step 505, the mobile phone 101a, the tablet computer 102a and the smart TV 103a only pick up the third voice data, respectively, The voice data corresponding to the voice command spoken by the user after the first X seconds will not be picked up. At this time, the above-mentioned third voice data may be a piece of voice command uttered by the user before the voice command corresponding to the above-mentioned second voice data (for example, "How is the weather in Beijing tomorrow?"). In this way, while the answering device obtains the third voice data relatively quickly, each electronic device in the multi-device scenario can avoid the waste of resources of the electronic device caused by a long time of performing the step of picking up the voice command spoken by the user.
另外,在其他实施例中,步骤505中手机101a、平板电脑102a和智能电视103a分别拾取用户说 出的一个完整的语音指令对应的语音数据,如“明天北京天气怎么样?”对应的第二语音数据,再从第二语音数据中选取出前X秒的第三语音数据。此时,上述第三语音数据可以为用户说出上述第二语音数据对应的语音指令(如“明天北京天气怎么样?”)中起始的一段语音指令,例如,第三语音数据为“明天”。In addition, in other embodiments, in step 505, the mobile phone 101a, the tablet computer 102a, and the smart TV 103a respectively pick up the voice data corresponding to a complete voice command spoken by the user, such as the second corresponding to "What's the weather in Beijing tomorrow?" voice data, and then select the third voice data of the first X seconds from the second voice data. At this time, the above-mentioned third voice data may be a voice command starting from the voice command corresponding to the above-mentioned second voice data (such as "How is the weather in Beijing tomorrow?") spoken by the user. For example, the third voice data is "tomorrow." ".
步骤506:智能电视103a分别获取手机101a、平板电脑102a和智能电视103a拾音得到的第三语音数据的声音信息。Step 506: The smart TV 103a obtains the sound information of the third voice data obtained by the mobile phone 101a, the tablet computer 102a and the smart TV 103a respectively.
例如,第三语音数据的声音信息包括以下至少一项:信噪比、音强(或能量值)、混响参数等。一般而言,电子设备检测到的第三语音数据的信噪比越高、音强越高、混响延时越低,说明该第三语音数据的质量越好,该第三语音数据更贴近于用户说出的语音指令本身,进而说明该电子设备距离用户越近。即故第三语音数据的声音信息可以作为选举拾音设备的拾音选举信息。For example, the sound information of the third speech data includes at least one of the following: a signal-to-noise ratio, a sound intensity (or energy value), a reverberation parameter, and the like. Generally speaking, the higher the signal-to-noise ratio, the higher the sound intensity and the lower the reverberation delay of the third voice data detected by the electronic device, the better the quality of the third voice data and the closer the third voice data is. Based on the voice command itself spoken by the user, it further indicates that the electronic device is closer to the user. That is to say, the sound information of the third voice data can be used as the sound-picking election information of the electoral sound-picking device.
在一些实施例中,手机101a和平板电脑102a可以分别计算得到第三语音数据的声音信息,再将第三语音数据的声音信息发送给智能电视103a。或者,手机101a和平板电脑102a可以分别将检测到的第三语音数据发送给智能电视103a,再由智能电视103a计算手机101a和平板电脑102a分别对应的第三语音数据的声音信息。In some embodiments, the mobile phone 101a and the tablet computer 102a can separately obtain the sound information of the third voice data, and then send the sound information of the third voice data to the smart TV 103a. Alternatively, the mobile phone 101a and the tablet computer 102a can respectively send the detected third voice data to the smart TV 103a, and then the smart TV 103a calculates the sound information of the third voice data corresponding to the mobile phone 101a and the tablet computer 102a respectively.
步骤507:智能电视103a将第三语音数据的声音信息加入拾音选举信息,并根据手机101a、平板电脑102a和智能电视103a分别对应的拾音选举信息选举出手机101a为拾音设备。Step 507: The smart TV 103a adds the voice information of the third voice data to the voice-picking election information, and elects the mobile phone 101a as the voice-picking device according to the voice-picking election information corresponding to the mobile phone 101a, the tablet computer 102a and the smart TV 103a respectively.
其中,步骤504至步骤507与上述步骤404类似,相同之处不再赘述。区别在于,在本实施例的步骤507中,智能电视103a额外获取了用户说出的前X秒的语音指令对应的语音数据(即第三语音数据),进而使得智能电视103a可以依据各个电子设备检测得到的第三语音数据的声音信息决策出拾音设备为手机101a。Among them, steps 504 to 507 are similar to the above-mentioned step 404, and the similarities will not be repeated. The difference is that, in step 507 of this embodiment, the smart TV 103a additionally obtains the voice data (ie, the third voice data) corresponding to the voice commands spoken by the user for the first X seconds, so that the smart TV 103a can use the various electronic devices The detected sound information of the third voice data determines that the sound pickup device is the mobile phone 101a.
具体地,步骤507中根据智能电视103a可以通过手机101a、平板电脑102a和智能电视103a分别对应的第三语音数据的声音信息,判断智能电视103a是否满足上述拾音选举策略b3)和/或b4)。具体地,如果根据第三语音数据的声音信息判断出智能电视103a为距离用户最近或距离噪声最远的电子设备,那么选举出智能电视103a作为拾音设备。Specifically, in step 507, it is judged whether the smart TV 103a satisfies the above-mentioned selection strategy b3) and/or b4 according to the voice information of the third voice data corresponding to the mobile phone 101a, the tablet computer 102a and the smart TV 103a respectively according to the smart TV 103a ). Specifically, if it is determined according to the sound information of the third voice data that the smart TV 103a is the electronic device closest to the user or the electronic device farthest from the noise, then the smart TV 103a is selected as the sound pickup device.
可以理解,通常候选拾音设备列表中检测到第三语音数据的音强最高、信噪比最高、和/或混响时延最低,说明电子设备距离用户最近。此时,该电子设备检测到第三语音数据的质量最好,拾音效果最好。It can be understood that, generally, the third voice data detected in the candidate sound pickup device list has the highest sound intensity, the highest signal-to-noise ratio, and/or the lowest reverberation delay, indicating that the electronic device is closest to the user. At this time, the electronic device detects that the quality of the third voice data is the best, and the pickup effect is the best.
步骤508-步骤512,与上述步骤405-409类似,此处不再赘述。Steps 508-512 are similar to the above-mentioned steps 405-409, and are not repeated here.
本申请实施例中,多设备场景下不仅可以根据用户的唤醒词对应的第一语言数据的声音信息等信息选举拾音设备,还可以根据用户说出的语音指令中的第一时长内的(如起始的前X秒)的语音对应的第三语言数据的声音信息选举拾音设备。如此,综合考虑用户或电子设备位置移动等因素影响,通过增加用户的语音指令对应的声音信息的拾音选举信息选举拾音设备,可进一步提升拾音设备的准确性,从而提升多设备场景中语音识别的准确率。In the embodiment of the present application, in the multi-device scenario, not only can the voice pickup device be selected according to information such as the voice information of the first language data corresponding to the user's wake-up word, but also can be selected according to ( For example, the voice information of the third language data corresponding to the voice corresponding to the first X seconds) elects a sound pickup device. In this way, considering the influence of factors such as the position of the user or the electronic device, by adding the voice-picking election information of the voice information corresponding to the user's voice command to elect the voice-picking equipment, the accuracy of the voice-picking equipment can be further improved, thereby improving the multi-device scene. Accuracy of speech recognition.
实施例三Embodiment 3
在一些多设备场景中,如果存在外部噪声,尤其在一些电子设备与用户的距离相同,甚至电子设备类型相同(如均为手机)的情况下,可以主要考虑电子设备距离外部噪声的距离,即外部噪声对电子设备拾音效果的影响,进行拾音设备选举。可以理解,如果不同电子设备类型相同且与用户的距离相同,那么这些电子设备的拾音效果相同。In some multi-device scenarios, if there is external noise, especially when some electronic devices are at the same distance from the user, or even the electronic devices are of the same type (such as mobile phones), the distance between the electronic device and the external noise can be mainly considered, that is, The influence of external noise on the pickup effect of electronic equipment, and the selection of pickup equipment. It can be understood that, if different electronic devices are of the same type and have the same distance from the user, the sound pickup effects of these electronic devices are the same.
具体地,图6示出了一种外部噪声干扰下的基于多设备的语音处理的多设备场景,该多设备场景 (记为多设备场景12)中,手机101b、手机102b、智能电视103b,通过无线网络互连,并分别部署在距离用户1.5m、1.5m、3.0m的位置上。此时,手机101b和手机102b可以闲置放置于桌面,智能电视103b可以壁挂安装在墙壁上。其中,在多设备场景12中,手机102b附近存在外部噪声源104。例如,外部噪声源可以为正在运行的空调或者外放音频的其他设备。因此,该场景中主要考虑外部噪声源104对各个电子设备拾音效果的影响,进行基于多设备的语音处理的流程。Specifically, FIG. 6 shows a multi-device scenario based on multi-device speech processing under external noise interference. They are interconnected through wireless networks and deployed at positions 1.5m, 1.5m, and 3.0m away from users. At this time, the mobile phone 101b and the mobile phone 102b can be placed idle on the desktop, and the smart TV 103b can be wall-mounted and installed on the wall. Among them, in the multi-device scenario 12, there is an external noise source 104 near the mobile phone 102b. For example, the external noise source can be a running air conditioner or other device that plays audio. Therefore, in this scenario, the influence of the external noise source 104 on the sound pickup effect of each electronic device is mainly considered, and the flow of voice processing based on multiple devices is performed.
图7是基于图6具体的协同处理语音的方法的流程。如图7所示,手机101b、手机102b、智能电视103b协同处理语音的方法的过程包括:FIG. 7 is a flowchart of a specific method for collaboratively processing speech based on FIG. 6 . As shown in FIG. 7 , the process of the method for the mobile phone 101b, the mobile phone 102b, and the smart TV 103b to cooperatively process voice includes:
步骤701-步骤709,与上述步骤401-步骤409类似,相同之处不作赘述。 Steps 701 to 709 are similar to the above-mentioned steps 401 to 409, and the similarities are not repeated.
区别仅在于执行主体有变化,多设备场景12中通过无线网络互连的电子设备由手机101a、平板电脑102a和智能电视103a变为手机101b、手机102b和智能电视103b。具体的,步骤703中选举出的应答设备为智能电视103b,并且步骤704中选举出的拾音设备为手机101b。The only difference is that the execution subject changes. In the multi-device scenario 12, the electronic devices interconnected through the wireless network are changed from mobile phone 101a, tablet computer 102a and smart TV 103a to mobile phone 101b, mobile phone 102b and smart TV 103b. Specifically, the answering device selected in step 703 is the smart TV 103b, and the sound pickup device selected in step 704 is the mobile phone 101b.
多设备场景12中的手机101b和手机102b距离用户的距离均为1.5m,相比于距离用户3m的智能电视103b,手机101b和手机102b均为距离用户最近的电子设备。然而,在手机102b附近存在外部噪声源104的环境下,区分手机101b和手机102b的拾音效果的因素仅为与外部噪音源104的距离。显然,相比于手机102b与外部噪声源104的距离,手机101b与外部噪声源104的距离较远。因此,区别于多设备场景11,在多设备场景12的步骤704中智能电视103b作为应答设备,在有外部噪声源干扰时,可以依据拾音选举策略(如策略b4))选举出远离外部噪声源、语音音强或信噪比最高、混响延时最低的手机101b为拾音设备。如此,可以避免多设备场景中外部噪声对电子设备的拾音效果的影响。The distance between the mobile phone 101b and the mobile phone 102b in the multi-device scenario 12 is 1.5m from the user. Compared with the smart TV 103b which is 3m away from the user, the mobile phone 101b and the mobile phone 102b are both electronic devices closest to the user. However, in an environment where there is an external noise source 104 near the mobile phone 102b, the factor that distinguishes the sound pickup effects of the mobile phone 101b and the mobile phone 102b is only the distance from the external noise source 104 . Obviously, compared with the distance between the mobile phone 102b and the external noise source 104, the distance between the mobile phone 101b and the external noise source 104 is farther. Therefore, different from the multi-device scenario 11, in step 704 of the multi-device scenario 12, the smart TV 103b is used as the answering device. When there is interference from an external noise source, it can be selected according to the selection strategy (such as strategy b4) that is far away from external noise. The mobile phone 101b with the highest source, voice sound intensity or signal-to-noise ratio, and the lowest reverberation delay is the sound pickup device. In this way, the influence of external noise on the sound pickup effect of the electronic device in the multi-device scene can be avoided.
类似的,参照图5示出的步骤505-507,多设备场景12中智能电视103b也可以获取各个电子设备拾取用户说出的语音指令中第一时长内的语音对应的第三语音数据的声音信息,并将这些声音信息加入步骤705中的各个拾音选举信息,选举出距离外部噪声源104较远的手机101b为拾音设备,此处不再赘述。Similarly, referring to steps 505 to 507 shown in FIG. 5 , the smart TV 103b in the multi-device scenario 12 can also acquire the sound of the third voice data corresponding to the voice within the first duration in the voice command spoken by the user picked up by each electronic device information, and add the sound information to each sound pickup election information in step 705, and elect the mobile phone 101b far away from the external noise source 104 as the sound pickup device, which will not be repeated here.
如此,本申请实施例提供的基于多设备的语音处理的方法,拾音设备可以具备与用户间距离最近、具备内部噪声降噪能力(如SE处理能力)、远离外部噪声源等有利因素中的一种或多种。如此,可以缓解多设备场景中由于外部噪声干扰对语音助手识别准确率的影响,提升了用户交互体验和多设备场景中语音识别的环境鲁棒性。In this way, in the multi-device-based voice processing method provided by the embodiment of the present application, the pickup device may have the advantages of being the closest to the user, having the capability of reducing internal noise (such as SE processing capability), and being far away from external noise sources. one or more. In this way, the influence of external noise interference on the recognition accuracy of the voice assistant in the multi-device scenario can be alleviated, and the user interaction experience and the environmental robustness of speech recognition in the multi-device scenario can be improved.
实施例四Embodiment 4
多设备场景中存在内部噪声,如正在外放音频的电子设备产生的噪声,该噪声为60-80dB的噪声,将对周边其它设备拾取语音指令产生强烈干扰。此时,可以主要考虑该内部噪声对多设备协同拾音的拾音效果的影响,如通过将外放音频的电子设备作为拾音设备,实现多设备拾音选举。In a multi-device scenario, there is internal noise, such as the noise generated by an electronic device that is playing audio. The noise is 60-80dB, which will strongly interfere with other surrounding devices picking up voice commands. At this time, the influence of the internal noise on the sound pickup effect of the multi-device cooperative sound pickup can be mainly considered. For example, by using an electronic device that emits audio as a sound pickup device, multi-device sound pickup selection can be realized.
具体地,图8示出了内部噪声干扰下的基于多设备的语音处理的场景,该多设备场景(记为多设备场景13)中,手机101c、平板电脑102c、智能电视103c,通过无线网络互连,并分别部署在距离用户0.3m、1.5m、3.0m的位置上。Specifically, FIG. 8 shows a multi-device-based speech processing scenario under the interference of internal noise. In this multi-device scenario (referred to as multi-device scenario 13 ), the mobile phone 101c, tablet computer 102c, and smart TV 103c are connected through a wireless network They are interconnected and deployed at positions 0.3m, 1.5m, and 3.0m away from users.
其中,智能电视103c处于外放音频的状态,且智能电视103c具备内部噪声降噪(即降噪能力)或AEC能力。例如,智能电视103c播放的音频的音量为60-80dB,将对手机101c、平板电脑102c的拾音效果产生强烈干扰。因此,在该场景中,主要考虑智能电视103c的内部噪声对电子设备拾音效果的影响,进行基于多设备的语音处理的流程。Among them, the smart TV 103c is in the state of playing audio, and the smart TV 103c has internal noise reduction (ie, noise reduction capability) or AEC capability. For example, the volume of the audio played by the smart TV 103c is 60-80 dB, which will strongly interfere with the sound pickup effect of the mobile phone 101c and the tablet computer 102c. Therefore, in this scenario, the flow of voice processing based on multi-devices is performed mainly considering the influence of the internal noise of the smart TV 103c on the sound pickup effect of the electronic device.
图9是其图8示出的多设备场景具体的协同处理语音的方法的流程。如9所示,手机101c、平板电脑102c、智能电视103c协同处理语音的方法的过程包括:FIG. 9 is a flow chart of a specific method for collaboratively processing speech in the multi-device scenario shown in FIG. 8 . As shown in 9, the process of the method for the mobile phone 101c, the tablet computer 102c, and the smart TV 103c to cooperatively process voice includes:
步骤901-步骤905,与上述步骤401-步骤405类似,相同之处不作赘述。Steps 901-905 are similar to the above-mentioned steps 401-405, and the similarities are not repeated.
区别仅在于执行主体有变化,多设备场景13中通过无线网络互连的电子设备由手机101a、平板电脑102a和智能电视103a变为手机101c、平板电脑102c、智能电视103c。其中,步骤903中协同应答选举得到的应答设备是智能电视103c,步骤904中协同拾音举得到的拾音设备也是智能电视103c,即拾音设备和应答设备相同。The only difference is that the execution subject changes. In the multi-device scenario 13, the electronic devices interconnected through the wireless network are changed from mobile phone 101a, tablet computer 102a, and smart TV 103a to mobile phone 101c, tablet computer 102c, and smart TV 103c. Wherein, the answering device obtained by the cooperative answering election in step 903 is the smart TV 103c, and the sound pickup device obtained by the cooperative voice pickup in step 904 is also the smart TV 103c, that is, the sound pickup device and the answering device are the same.
具体地,多设备场景13中增加考虑了内部噪声对电子设备效果的影响,在智能电视103c处于外放音频的状态的情况下,智能电视103c可以依据上述实施例中实施例中的第二类拾音选举策略中的策略b1)和策略b2),选举出语音信噪比、降噪能力相对较高的智能电视103c作为拾音设备。例如,在多设备场景13中,智能电视103c具备内部噪声降噪能力或AEC生效,而手机101c和平板电脑102c不具备内部噪声降噪能力,内部降噪能力低于智能电视103c的内部降噪能力,不具备AEC能力,或者AEC未生效。Specifically, in the multi-device scenario 13, the influence of internal noise on the effect of electronic devices is considered. In the case where the smart TV 103c is in the state of playing audio, the smart TV 103c can follow the second type of the embodiment in the above-mentioned embodiment. In the strategy b1) and strategy b2) of the sound pickup election strategy, the smart TV 103c with relatively high voice signal-to-noise ratio and noise reduction capability is selected as the sound pickup device. For example, in the multi-device scenario 13, the smart TV 103c has the internal noise reduction capability or AEC is in effect, while the mobile phone 101c and the tablet 102c do not have the internal noise reduction capability, and the internal noise reduction capability is lower than that of the smart TV 103c. capability, do not have AEC capability, or AEC is not in effect.
可以理解,通常情况下具备SE处理能力的电子设备,如具备内部噪声降噪能力或AEC能力的电子设备,可以在外放音频时,通过该音频的降噪信息消除内部噪声(即该音频)对拾音效果的影响,拾音得到质量较好的语音数据。It can be understood that, under normal circumstances, electronic devices with SE processing capabilities, such as electronic devices with internal noise reduction capabilities or AEC capabilities, can eliminate internal noise (that is, the audio) through the noise reduction information of the audio when the audio is played externally. Due to the influence of the pickup effect, the pickup can get better quality speech data.
此外,本实施例中,多设备协同选举出应答设备和拾音设备之后,还可以由应答设备查询出正在外放音频的内部噪音设备,使得内部噪音设备共享其降噪信息。In addition, in this embodiment, after multiple devices cooperate to select an answering device and a sound pickup device, the answering device can also query the internal noise device that is broadcasting audio, so that the internal noise device can share its noise reduction information.
步骤906、智能电视103c从手机101c、平板电脑102c、智能电视103c中查询出正在外放音频的智能电视103c,作为内部噪声设备提供降噪信息。Step 906 , the smart TV 103c queries the smart TV 103c that is broadcasting audio from the mobile phone 101c, the tablet computer 102c, and the smart TV 103c, and provides noise reduction information as an internal noise device.
其中,智能电视103c通过查询各个设备的扬声器占用状态或音频/视频软件状态(如音频/视频软件是否打开,以及电子设备的音量)等信息,确定正在外放音频的电子设备。例如,如果智能设备103c查询到其自身的扬声器处于占用状态、音量较大(如最大音量的60%以上),或者音频/视频软件处于开启状态,那么确定智能电视103c自身正在外放音频,将共享降噪信息。The smart TV 103c determines the electronic device that is playing audio by querying the speaker occupancy status of each device or the audio/video software status (such as whether the audio/video software is turned on, and the volume of the electronic device). For example, if the smart device 103c finds that its own speaker is in an occupied state, the volume is high (such as more than 60% of the maximum volume), or the audio/video software is in an open state, then it is determined that the smart TV 103c itself is playing audio, and the Share noise reduction information.
具体地,手机101c和平板电脑102c可以通过无线网络向智能电视103c上报其是否处于外放音频状态的信息,如上报扬声器占用状态、音量、和/或音频/视频软件状态的信息。Specifically, the mobile phone 101c and the tablet computer 102c can report to the smart TV 103c through the wireless network information about whether they are in an external audio state, such as reporting information on speaker occupancy status, volume, and/or audio/video software status.
其中,在一些实施例中,智能电视103c拾取用户说出的语音指令所对应的语音数据的同时,还在持续通过扬声器外放音频。Wherein, in some embodiments, the smart TV 103c continues to play audio through the speaker while picking up the voice data corresponding to the voice command spoken by the user.
可以理解,智能电视103c作为应答设备和拾音设备,查询出自身为内部噪声设备之后,可以对后续拾取的语音指令对应的语音数据进行降噪处理。It can be understood that, as an answering device and a sound pickup device, the smart TV 103c can perform noise reduction processing on the voice data corresponding to the subsequently picked up voice command after inquiring that it is an internal noise device.
步骤907、智能电视103c根据降噪信息对拾音得到的第二语音数据进行降噪处理。Step 907: The smart TV 103c performs noise reduction processing on the second voice data obtained by picking up the sound according to the noise reduction information.
步骤908、智能电视103c识别经过降噪处理后的第二语音数据。Step 908: The smart TV 103c identifies the second voice data after noise reduction processing.
步骤909、智能电视103c根据识别结果响应用户的语音指令或者控制其他电子设备响应用户的语音指令。Step 909, the smart TV 103c responds to the user's voice command according to the recognition result or controls other electronic devices to respond to the user's voice command.
此外,上述步骤908和909与上述步骤406至408类似,区别仅在于,应答设备(即智能电视103c)识别的语音数据为通过上述降噪信息进行降噪处理的语音数据。具体地,新增了步骤906和步骤907,即智能电视103c作为应答设备查询出内部噪声设备,具体为查询出正在外放音频智能电视103c作为内部噪声设备提供降噪信息。其中,该降噪信息支持拾音设备对后续拾取的语音对应的语音数据进行降噪处理。显然,该场景中拾音设备和内部噪声设备相同。In addition, the above steps 908 and 909 are similar to the above steps 406 to 408, except that the voice data recognized by the answering device (ie, the smart TV 103c) is the voice data subjected to noise reduction processing through the noise reduction information. Specifically, steps 906 and 907 are newly added, that is, the smart TV 103c is used as the answering device to query the internal noise device, specifically, the smart TV 103c that is playing audio is provided as the internal noise device to provide noise reduction information. The noise reduction information supports the sound pickup device to perform noise reduction processing on the speech data corresponding to the subsequently picked up speech. Obviously, the sound pickup device and the internal noise device are the same in this scene.
可以理解,具备内部噪声降噪能力(即降噪能力)或AEC生效的电子设备,可以将外放的音频的音频数据引入到降噪过程,通过消减自身播放音频产生的内部噪声,缓解其干扰。即,上述内部噪声设备 基于外放的音频的音频数据得到降噪信息,如该音频数据本身(即内部噪声信息),或者,该音频对应的语音活动检测(Voice Activity Detection,VAD)信息(或称静音抑制信息)。It is understandable that an electronic device with internal noise reduction capability (that is, noise reduction capability) or AEC effective can introduce the audio data of the external audio into the noise reduction process, and alleviate its interference by reducing the internal noise generated by the self-playing audio. . That is, the above-mentioned internal noise device obtains noise reduction information based on the audio data of the external audio, such as the audio data itself (that is, the internal noise information), or, the voice activity detection (Voice Activity Detection, VAD) information corresponding to the audio (or called mute suppression information).
电子设备(如智能电视103c)可以提供外放的音频的降噪信息,并通过其降噪信息对内部噪声进行降噪处理,实现消除该内部噪声对其他语音数据(如用户拾取的语音数据)的影响,以提升拾取的语音数据的质量。The electronic device (such as the smart TV 103c) can provide noise reduction information of the external audio, and perform noise reduction processing on the internal noise through its noise reduction information, so as to eliminate the internal noise and affect other voice data (such as the voice data picked up by the user) effect to improve the quality of the picked-up speech data.
例如,在多设备场景13中,针对用户说出唤醒词之后说出的语音指令“明天北京天气怎么样?”,智能电视103c如果直接拾取第二语音数据,由于外放的音频的影响,该第二语音数据可能直接被识别为“明天北京天天怎么样?”,即未能准确识别出用户实际的语音指令“明天北京天气怎么样?”。此时,智能电视103c可以通过内部的降噪信息消除外放的音频的影响,使得智能电视103c经过降噪处理后的第二语音数据的质量较高,后续得到第二语音数据的准确识别结果“明天北京天气怎么样?”。For example, in the multi-device scenario 13, for the voice command "What's the weather like in Beijing tomorrow?" after the user speaks the wake-up word, if the smart TV 103c directly picks up the second voice data, due to the influence of the external audio, the The second voice data may be directly identified as "How will the day be in Beijing tomorrow?", that is, the user's actual voice command "How will the weather in Beijing tomorrow?" is not accurately identified. At this time, the smart TV 103c can eliminate the influence of the external audio through the internal noise reduction information, so that the quality of the second voice data after noise reduction processing by the smart TV 103c is higher, and an accurate recognition result of the second voice data can be obtained subsequently. "What's the weather like in Beijing tomorrow?".
步骤910,与上述步骤409类似,此处不再赘述。Step 910 is similar to the above-mentioned step 409 and will not be repeated here.
此外,在其他一些实施例中,还可以由应答设备获取内部噪声信息外放音频的降噪信息,并获取拾音设备直接拾音得到待识别语音(即未消除外放音频的内部噪声的待识别语音),然后执行根据获取的降噪信息对获取的待识别语音进行降噪处理的步骤。In addition, in some other embodiments, the answering device can also obtain the noise reduction information of the external audio of the internal noise information, and obtain the voice to be recognized by directly picking up the sound by the sound pickup device (that is, the internal noise of the external audio is not eliminated. Recognize speech), and then perform the step of performing noise reduction processing on the acquired speech to be recognized according to the acquired noise reduction information.
可以理解,通过外放音频的电子设备的降噪信息,如外放的音频的音频数据本身和/或该音频对应的VAD信息,对拾音设备的拾音过程进行降噪处理,可以缓解多设备场景中外放音频的电子设备的内部噪声对语音助手的拾音效果的影响,保证语音助手基于多设备的拾音效果,从而有利于保证语音助手的语音识别准确率。进而,提升了语音识别过程中的用户体验,并提升了多设备场景中语音识别的环境鲁棒性。It can be understood that, by performing noise reduction processing on the sound pickup process of the sound pickup device through the noise reduction information of the electronic device that broadcasts the audio, such as the audio data of the outside sound and/or the corresponding VAD information of the audio, the noise reduction process can be alleviated. The influence of the internal noise of the electronic device that emits audio in the device scene on the sound pickup effect of the voice assistant ensures that the voice assistant is based on the sound pickup effect of multiple devices, thereby helping to ensure the voice recognition accuracy of the voice assistant. Further, the user experience in the speech recognition process is improved, and the environmental robustness of speech recognition in multi-device scenarios is improved.
实施例五Embodiment 5
多设备场景中存在内部噪声时,为了避免该内部噪声对多设备场景协同拾音的拾音效果的影响,不仅可以将外放音频的电子设备作为拾音设备,还可以通过外放音频的电子设备将且内部噪声的降噪信息共享给作为拾音设备的其他电子设备,以使得拾音设备跨设备根据该降噪信息消除该内部噪声对拾音效果的影响。When there is internal noise in the multi-device scene, in order to avoid the influence of the internal noise on the sound pickup effect of the multi-device scene collaborative sound pickup, not only can the external audio electronic device be used as the sound pickup device, but also the external audio electronic device can be used as the sound pickup device. The device shares the noise reduction information of the internal noise with other electronic devices that are sound pickup devices, so that the sound pickup device can eliminate the influence of the internal noise on the sound pickup effect across devices according to the noise reduction information.
图10示出了另一种内部噪声干扰下的基于多设备的语音处理的场景,该多设备场景(记为多设备场景14)中的手机101d和平板电脑102d,通过无线网络互连,并分别部署在距离用户0.3m和0.6m的位置上。此时,手机101d被用户手持,平板电脑102d被闲置在桌面上。其中,平板电脑102d处于外放音频的状态,且具备内部噪声降噪(即降噪能力)或AEC能力。因此,在该场景中可以主要考虑平板电脑102d的内部噪声对多设备场景中协同拾音的拾音效果的影响。FIG. 10 shows another multi-device-based speech processing scenario under the interference of internal noise. In this multi-device scenario (referred to as multi-device scenario 14 ), the mobile phone 101d and the tablet computer 102d are interconnected through a wireless network, and They are deployed at positions 0.3m and 0.6m away from the user, respectively. At this time, the mobile phone 101d is held by the user, and the tablet computer 102d is left idle on the desktop. Among them, the tablet computer 102d is in the state of external audio playback, and has internal noise reduction (ie, noise reduction capability) or AEC capability. Therefore, in this scenario, the influence of the internal noise of the tablet computer 102d on the sound pickup effect of cooperative sound pickup in the multi-device scenario can be mainly considered.
图11是图10示出的多设备场景具体的协同处理语音的方法的流程,包括:FIG. 11 is a flowchart of a specific method for collaboratively processing speech in the multi-device scenario shown in FIG. 10 , including:
步骤1101-步骤1102,与上述步骤401-步骤402类似,相同之处不作赘述。区别在于,多设备场景14中通过无线网络互连的电子设备由手机101c、平板电脑102c、智能电视103c变为手机101d和平板电脑102d。 Steps 1101 to 1102 are similar to the above-mentioned steps 401 to 402, and the similarities are not repeated. The difference is that the electronic devices interconnected through the wireless network in the multi-device scenario 14 are changed from a mobile phone 101c, a tablet computer 102c, and a smart TV 103c to a mobile phone 101d and a tablet computer 102d.
步骤1103:手机101d和平板电脑102d选举出手机101d作为应答设备及拾音设备。Step 1103: The mobile phone 101d and the tablet computer 102d elect the mobile phone 101d as the answering device and the sound pickup device.
步骤1104:手机101d拾取用户说出的语音指令所对应的第二语音数据。Step 1104: The mobile phone 101d picks up the second voice data corresponding to the voice command spoken by the user.
其中,上述步骤1103-步骤1104与上述步骤403-404类似,区别在于,步骤1103中协同应答选举得到应答设备之后,可以直接决策应答设备为拾音设备,而无需执行上述实施例中依据拾音选举策略选举拾音设备的步骤。即应答设备和拾音设备相同,如均为手机101d。The above steps 1103 to 1104 are similar to the above steps 403 to 404, the difference is that, after the responding device is obtained through the cooperative response election in step 1103, it is possible to directly decide that the responding device is a sound pickup device, without performing the sound pickup method in the above embodiment. Election Policy Steps to elect a pickup device. That is, the answering device and the pickup device are the same, such as the mobile phone 101d.
步骤1105:手机101d从手机101d和平板电脑102d中查询出正在外放音频的平板电脑102d,作 为内部噪声设备共享降噪信息。Step 1105: The mobile phone 101d inquires from the mobile phone 101d and the tablet computer 102d that the tablet computer 102d is playing external audio, and shares the noise reduction information as an internal noise device.
上述步骤1105与步骤906类似,区别在于,步骤1105中应答设备查询出共享降噪信息的内部噪声设备(平板电脑102d)与拾音设备(手机101d)不同。因此,本实施例中,增加了步骤1106来实现内部噪声设备向拾音设备(即手机101d)共享降噪信息。The above step 1105 is similar to step 906, the difference is that in step 1105 the answering device finds out that the internal noise device (tablet computer 102d) sharing noise reduction information is different from the sound pickup device (mobile phone 101d). Therefore, in this embodiment, step 1106 is added to realize that the internal noise device shares the noise reduction information with the sound pickup device (ie, the mobile phone 101d).
此外,可以理解,在一些实施例中,平板电脑102d作为应答设备查询出手机101d为内部噪声设备之后,可以向手机101d发送降噪指示,使得手机101d根据该降噪指示向作为拾音设备的平板电脑102d共享降噪信息。In addition, it can be understood that, in some embodiments, after the tablet computer 102d as the answering device finds out that the mobile phone 101d is an internal noise device, it can send a noise reduction instruction to the mobile phone 101d, so that the mobile phone 101d can send a noise reduction instruction to the sound pickup device according to the noise reduction instruction. Tablet 102d shares noise reduction information.
步骤1106:平板电脑102d向手机101d发送平板电脑102d的降噪信息。Step 1106: The tablet computer 102d sends the noise reduction information of the tablet computer 102d to the mobile phone 101d.
可以理解,通过平板电脑102d向手机101d共享降噪信息,可以实现跨设备共享外放音频的音频数据本身和/或该音频对应的VAD信息,有效聚合了多个配备麦克风模组和语音助手的电子设备的外设资源。It can be understood that by sharing the noise reduction information with the mobile phone 101d through the tablet computer 102d, it is possible to share the audio data of the external audio and/or the VAD information corresponding to the audio across devices, effectively aggregating multiple devices equipped with microphone modules and voice assistants. Peripheral resources for electronic devices.
具体地,平板电脑102d可以通过与手机101d之间的无线网络,将平板电脑102d的降噪信息发送给手机101d。Specifically, the tablet computer 102d can send the noise reduction information of the tablet computer 102d to the mobile phone 101d through the wireless network with the mobile phone 101d.
步骤1107:手机101d根据平板电脑102d的降噪信息,对拾音得到的第二语音数据进行降噪处理。Step 1107: The mobile phone 101d performs noise reduction processing on the second voice data obtained by picking up the sound according to the noise reduction information of the tablet computer 102d.
步骤1108:手机101d识别经过降噪处理后的第二语音数据。Step 1108: The mobile phone 101d identifies the second voice data after noise reduction processing.
步骤1109:手机101d根据识别结果响应用户的语音指令或者控制其他电子设备响应用户的语音指令。Step 1109: The mobile phone 101d responds to the user's voice command according to the recognition result or controls other electronic devices to respond to the user's voice command.
其中,步骤1107至1109与上述步骤907至步骤909类似,区别在于,步骤1107中拾音设备(即手机101d)是通过其他设备(即平板电脑102d)的降噪信息,对自身拾取的语音对应的语音数据进行降噪处理的,实现了跨设备降噪处理。Among them, steps 1107 to 1109 are similar to the above steps 907 to 909, the difference is that in step 1107, the sound pickup device (ie the mobile phone 101d) uses the noise reduction information of other devices (ie the tablet computer 102d) to correspond to the voice picked up by itself Noise reduction processing is carried out on the voice data of the device, and cross-device noise reduction processing is realized.
例如,在多设备场景14中,针对用户说出唤醒词之后说出的语音指令“明天北京天气怎么样?”,手机101d直接拾取该语音指令对应的第二语音数据时,由于平板电脑102d外放的音频的影响导致该第二语音数据的质量较差,使得该第二语音数据可能被识别为“明天北京天天怎么样?”,即与用户实际的语音指令“明天北京天气怎么样?”不同。也即由于手机101d拾取的第二语音数据的质量较差导致后续第二语音数据的识别结果不准确。此时,由于手机101d可以通过平板电脑102d共享的降噪信息对拾取到的第二语音数据进行降噪处理,消除平板电脑102d外放的音频对手机101d拾音效果的影响。进而,使得经过降噪处理后的第二语音数据的质量较高,后续准确地将该第二语音数据识别为“明天北京天气怎么样?”。For example, in the multi-device scenario 14, for the voice command "What's the weather like in Beijing tomorrow?" after the user speaks the wake-up word, when the mobile phone 101d directly picks up the second voice data corresponding to the voice command, because the tablet computer 102d is outside the The quality of the second voice data is poor due to the influence of the audio played, so that the second voice data may be recognized as "How will it be in Beijing tomorrow?", that is, the user's actual voice command "How will the weather in Beijing tomorrow?" different. That is, because the quality of the second voice data picked up by the mobile phone 101d is poor, the recognition result of the subsequent second voice data is inaccurate. At this time, since the mobile phone 101d can perform noise reduction processing on the picked-up second voice data through the noise reduction information shared by the tablet computer 102d, the influence of the audio played by the tablet computer 102d on the sound pickup effect of the mobile phone 101d is eliminated. Furthermore, the quality of the second voice data after the noise reduction process is made higher, and the second voice data is subsequently accurately identified as "How is the weather in Beijing tomorrow?".
可以理解,通过跨设备共享外放音频的音频数据本身和/或该音频对应VAD信息,辅助拾音设备在拾音过程中进行降噪处理,可以有效聚合多个配备麦克风模组和语音助手的电子设备的外设资源,进一步提升多设备场景中语音识别的准确率。It can be understood that by sharing the audio data of the external audio and/or the corresponding VAD information of the audio across the devices, the auxiliary sound pickup equipment performs noise reduction processing during the sound pickup process, which can effectively aggregate multiple devices equipped with microphone modules and voice assistants. The peripheral resources of electronic devices further improve the accuracy of speech recognition in multi-device scenarios.
这样一来,本申请实施例中,多设备中选举出的拾音设备可以具备与用户间距离最近、与外部噪声源距离最远、具备内部噪声降噪能力等有利因素中的一种或多种。如此,可以缓解多设备场景中由于电子设备部署位置、内部噪声干扰或外部噪声干扰对语音助手拾音效果,以及语音识别准确率的影响,提升了用户交互体验和多设备场景中语音识别的环境鲁棒性。In this way, in the embodiment of the present application, the sound pickup device selected from the multiple devices may have one or more favorable factors such as the closest distance to the user, the farthest distance to the external noise source, and the ability to reduce internal noise. kind. In this way, the impact of the deployment location of electronic devices, internal noise interference or external noise interference on the voice pickup effect of the voice assistant and the accuracy of speech recognition in the multi-device scenario can be alleviated, and the user interaction experience and the environment of speech recognition in multi-device scenarios can be improved. robustness.
图12示出了电子设备100的结构示意图。FIG. 12 shows a schematic structural diagram of the electronic device 100 .
电子设备100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口 170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and Subscriber identification module (subscriber identification module, SIM) card interface 195 and so on. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
可以理解的是,本发明实施例示意的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。It can be understood that, the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the electronic device 100 . In other embodiments of the present application, the electronic device 100 may include more or less components than shown, or combine some components, or separate some components, or arrange different components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。The processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (neural-network processing unit, NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
例如,处理器110可以用于检测电子设备100是否拾取到用户说出的唤醒词或语音指令对应的语音数据,以及获取语音数据的声音信息、设备状态信息、麦克风模组信息等。另外,还可以根据各个电子设备的信息(如拾音选举信息或应答选举信息等),执行上述应答设备选举、拾音设备选举或内部噪声设备查询等动作。For example, the processor 110 may be used to detect whether the electronic device 100 picks up voice data corresponding to a wake-up word or voice command spoken by the user, and to obtain sound information, device status information, microphone module information, etc. of the voice data. In addition, actions such as the above-mentioned answering device election, voice pickup device election, or internal noise device inquiry can also be performed according to information of each electronic device (such as voice pickup election information or response election information, etc.).
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现电子设备100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。例如,NPU可以支持电子设备100通过语音助手对拾音得到的语音数据进行识别。The NPU is a neural-network (NN) computing processor. By drawing on the structure of biological neural networks, such as the transfer mode between neurons in the human brain, it can quickly process the input information, and can continuously learn by itself. Applications such as intelligent cognition of the electronic device 100 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like. For example, the NPU can support the electronic device 100 to recognize the voice data obtained by picking up the voice through the voice assistant.
可以理解的是,本发明实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对电子设备100的结构限定。在本申请另一些实施例中,电子设备100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。It can be understood that the interface connection relationship between the modules illustrated in the embodiment of the present invention is only a schematic illustration, and does not constitute a structural limitation of the electronic device 100 . In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
电子设备100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.
天线1和天线2用于发射和接收电磁波信号。电子设备100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in electronic device 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example, the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
移动通信模块150可以提供应用在电子设备100上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块150的至少部分功能模块可以被设置于处理器110中。在一些实施例中,移动通信模块150的至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。The mobile communication module 150 may provide wireless communication solutions including 2G/3G/4G/5G etc. applied on the electronic device 100 . The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA) and the like. The mobile communication module 150 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation. The mobile communication module 150 can also amplify the signal modulated by the modulation and demodulation processor, and then turn it into an electromagnetic wave for radiation through the antenna 1 . In some embodiments, at least part of the functional modules of the mobile communication module 150 may be provided in the processor 110 . In some embodiments, at least part of the functional modules of the mobile communication module 150 may be provided in the same device as at least part of the modules of the processor 110 .
无线通信模块160可以提供应用在电子设备100上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无 线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块160经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。The wireless communication module 160 can provide applications on the electronic device 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), global navigation satellites Wireless communication solutions such as global navigation satellite system (GNSS), frequency modulation (FM), near field communication (NFC), and infrared technology (IR). The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 . The wireless communication module 160 can also receive the signal to be sent from the processor 110 , perform frequency modulation on it, amplify it, and convert it into electromagnetic waves for radiation through the antenna 2 .
例如,上述天线1,天线2,移动通信模块150,无线通信模块160等模块,可以用于支持电子设备100向多设备场景下的其他电子设备发送语音数据的声音信息、设备状态信息等,具体为发送上述应答选举信息、拾音选举信息、降噪信息等。For example, the above-mentioned antenna 1, antenna 2, mobile communication module 150, wireless communication module 160 and other modules can be used to support the electronic device 100 to send voice information, device status information, etc. of voice data to other electronic devices in a multi-device scenario. In order to send the above-mentioned response election information, pick-up election information, noise reduction information, etc.
电子设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。The electronic device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
显示屏194用于显示图像,视频等。例如,显示屏194可以用于支持电子设备100显示响应用户的语音指令的应答界面,该应答界面中可以包括应答文本等信息。Display screen 194 is used to display images, videos, and the like. For example, the display screen 194 may be used to support the electronic device 100 to display a response interface in response to a user's voice command, and the response interface may include response text and other information.
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。例如,外部存储卡可以用于支持电子设备100存储上述拾音选举信息,应答选举信息以及降噪信息等。The external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100 . The external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card. For example, an external memory card may be used to support the electronic device 100 to store the above-mentioned pick-up election information, answering election information, noise reduction information, and the like.
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储电子设备100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。处理器110通过运行存储在内部存储器121的指令,和/或存储在设置于处理器中的存储器的指令,执行电子设备100的各种功能应用以及数据处理。例如,外部存储卡可以用于支持电子设备100存储上述拾音选举信息,应答选举信息以及降噪信息等。Internal memory 121 may be used to store computer executable program code, which includes instructions. The internal memory 121 may include a storage program area and a storage data area. The storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like. The storage data area may store data (such as audio data, phone book, etc.) created during the use of the electronic device 100 and the like. In addition, the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like. The processor 110 executes various functional applications and data processing of the electronic device 100 by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor. For example, an external memory card may be used to support the electronic device 100 to store the above-mentioned pick-up election information, answering election information, noise reduction information, and the like.
电子设备100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playback, recording, etc.
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号,如将电子设备100接收到的用户语音转换为数字音频信号(即用户语音对应的语音数据),或将通过语音助手采用TTS生成的音频转化为应答语音。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。The audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal, such as converting the user's voice received by the electronic device 100 into a digital audio signal (ie, the user's voice corresponding to the user's voice). voice data), or convert the audio generated by the voice assistant using TTS into the response voice. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。电子设备100可以通过扬声器170A收听音乐,或收听免提通话,或者基于语音助手播放用户的语音指令对应的应答语音,如针对唤醒词的应答语音“我在”,或者针对语音指令“明天北京天气怎么样?”的应答语音“明天北京是晴天”。Speaker 170A, also referred to as a "speaker", is used to convert audio electrical signals into sound signals. The electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call, or play the response voice corresponding to the user's voice command based on the voice assistant, such as the response voice "I'm here" for the wake-up word, or for the voice command "Tomorrow Beijing weather. How's it going?" The answering voice "Tomorrow will be sunny in Beijing".
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当电子设备100接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。The receiver 170B, also referred to as "earpiece", is used to convert audio electrical signals into sound signals. When the electronic device 100 answers a call or a voice message, the voice can be answered by placing the receiver 170B close to the human ear.
麦克风(即麦克风模组)170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号,如将用户说出的唤醒词或语音指令转化为电信号(即对应的语音数据)。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。电子设备100可以设置至少一 个麦克风170C。在另一些实施例中,电子设备100可以设置两个麦克风170C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,电子设备100还可以设置三个,四个或更多麦克风170C,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。Microphone (ie, microphone module) 170C, also known as "microphone" and "microphone", is used to convert sound signals into electrical signals, such as converting wake-up words or voice commands spoken by the user into electrical signals (ie, corresponding voice data). ). When making a call or sending a voice message, the user can make a sound by approaching the microphone 170C through a human mouth, and input the sound signal into the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
本申请公开的机制的各实施例可以被实现在硬件、软件、固件或这些实现方法的组合中。本申请的实施例可实现为在可编程系统上执行的计算机程序或程序代码,该可编程系统包括至少一个处理器、存储系统(包括易失性和非易失性存储器和/或存储元件)、至少一个输入设备以及至少一个输出设备。Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementation methods. Embodiments of the present application may be implemented as a computer program or program code executing on a programmable system including at least one processor, a storage system (including volatile and nonvolatile memory and/or storage elements) , at least one input device, and at least one output device.
可将程序代码应用于输入指令,以执行本申请描述的各功能并生成输出信息。可以按已知方式将输出信息应用于一个或多个输出设备。为了本申请的目的,处理系统包括具有诸如例如数字信号处理器(DSP)、微控制器、专用集成电路(ASIC)或微处理器之类的处理器的任何系统。Program code may be applied to input instructions to perform the functions described herein and to generate output information. The output information can be applied to one or more output devices in a known manner. For the purposes of this application, a processing system includes any system having a processor such as, for example, a digital signal processor (DSP), microcontroller, application specific integrated circuit (ASIC), or microprocessor.
程序代码可以用高级程序化语言或面向对象的编程语言来实现,以便与处理系统通信。在需要时,也可用汇编语言或机器语言来实现程序代码。事实上,本申请中描述的机制不限于任何特定编程语言的范围。在任一情形下,该语言可以是编译语言或解释语言。The program code may be implemented in a high-level procedural language or an object-oriented programming language to communicate with the processing system. The program code may also be implemented in assembly or machine language, if desired. In fact, the mechanisms described in this application are not limited in scope to any particular programming language. In either case, the language may be a compiled language or an interpreted language.
在一些情况下,所公开的实施例可以以硬件、固件、软件或其任何组合来实现。所公开的实施例还可以被实现为由一个或多个暂时或非暂时性机器可读(例如,计算机可读)存储介质承载或存储在其上的指令,其可以由一个或多个处理器读取和执行。例如,指令可以通过网络或通过其他计算机可读介质分发。因此,机器可读介质可以包括用于以机器(例如,计算机)可读的形式存储或传输信息的任何机制,包括但不限于,软盘、光盘、光碟、只读存储器(CD-ROMs)、磁光盘、只读存储器(ROM)、随机存取存储器(RAM)、可擦除可编程只读存储器(EPROM)、电可擦除可编程只读存储器(EEPROM)、磁卡或光卡、闪存、或用于利用因特网以电、光、声或其他形式的传播信号来传输信息(例如,载波、红外信号数字信号等)的有形的机器可读存储器。因此,机器可读介质包括适合于以机器(例如,计算机)可读的形式存储或传输电子指令或信息的任何类型的机器可读介质。In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments can also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (eg, computer-readable) storage media, which can be executed by one or more processors read and execute. For example, the instructions may be distributed over a network or over other computer-readable media. Thus, a machine-readable medium can include any mechanism for storing or transmitting information in a form readable by a machine (eg, a computer), including, but not limited to, floppy disks, optical disks, optical disks, read only memories (CD-ROMs), magnetic Optical Disc, Read Only Memory (ROM), Random Access Memory (RAM), Erasable Programmable Read Only Memory (EPROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Magnetic or Optical Cards, Flash Memory, or Tangible machine-readable storage for transmitting information (eg, carrier waves, infrared signal digital signals, etc.) using the Internet in electrical, optical, acoustic, or other forms of propagating signals. Thus, machine-readable media includes any type of machine-readable media suitable for storing or transmitting electronic instructions or information in a form readable by a machine (eg, a computer).
在附图中,可以以特定布置和/或顺序示出一些结构或方法特征。然而,应该理解,可能不需要这样的特定布置和/或排序。而是,在一些实施例中,这些特征可以以不同于说明性附图中所示的方式和/或顺序来布置。另外,在特定图中包括结构或方法特征并不意味着暗示在所有实施例中都需要这样的特征,并且在一些实施例中,可以不包括这些特征或者可以与其他特征组合。In the drawings, some structural or method features may be shown in specific arrangements and/or sequences. It should be understood, however, that such specific arrangements and/or orderings may not be required. Rather, in some embodiments, the features may be arranged in a manner and/or order different from that shown in the illustrative figures. Additionally, the inclusion of structural or method features in a particular figure is not meant to imply that such features are required in all embodiments, and in some embodiments such features may not be included or may be combined with other features.
需要说明的是,本申请各设备实施例中提到的各单元/模块都是逻辑单元/模块,在物理上,一个逻辑单元/模块可以是一个物理单元/模块,也可以是一个物理单元/模块的一部分,还可以以多个物理单元/模块的组合实现,这些逻辑单元/模块本身的物理实现方式并不是最重要的,这些逻辑单元/模块所实现的功能的组合才是解决本申请所提出的技术问题的关键。此外,为了突出本申请的创新部分,本申请上述各设备实施例并没有将与解决本申请所提出的技术问题关系不太密切的单元/模块引入,这并不表明上述设备实施例并不存在其它的单元/模块。It should be noted that each unit/module mentioned in each device embodiment of this application is a logical unit/module. Physically, a logical unit/module may be a physical unit/module or a physical unit/module. A part of a module can also be implemented by a combination of multiple physical units/modules. The physical implementation of these logical units/modules is not the most important, and the combination of functions implemented by these logical units/modules is the solution to the problem of this application. The crux of the technical question raised. In addition, in order to highlight the innovative part of the present application, the above-mentioned device embodiments of the present application do not introduce units/modules that are not closely related to solving the technical problems raised in the present application, which does not mean that the above-mentioned device embodiments do not exist. other units/modules.
需要说明的是,在本专利的示例和说明书中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。虽然通过参照本申请的某些优选实施例,已经对本申请进行了图示和描述,但本领域的普通技术人员应该明白,可以在形式上和细节上对其作各种改变,而不偏离本申请的精神和范围。It should be noted that, in the examples and specification of this patent, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that Any such actual relationship or sequence exists between these entities or operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a" does not preclude the presence of additional identical elements in a process, method, article, or device that includes the element. Although the present application has been illustrated and described with reference to certain preferred embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and detail may be made therein without departing from the present disclosure The spirit and scope of the application.

Claims (25)

  1. 一种基于多设备的语音处理方法,其特征在于,所述方法包括:A multi-device-based voice processing method, characterized in that the method comprises:
    多个电子设备中的第一电子设备拾音得到第一待识别语音;The first electronic device in the plurality of electronic devices picks up the voice to obtain the first to-be-recognized voice;
    所述第一电子设备从所述多个电子设备中外放音频的第二电子设备接收与所述第二电子设备外放的音频相关的音频信息;The first electronic device receives audio information related to the audio played by the second electronic device from a second electronic device that plays audio from among the plurality of electronic devices;
    所述第一电子设备根据接收的所述音频信息对拾音得到的所述第一待识别语音进行降噪处理得到第二待识别语音。The first electronic device performs noise reduction processing on the first to-be-recognized speech obtained by picking up sounds according to the received audio information to obtain a second to-be-recognized speech.
  2. 根据权利要求1所述的方法,其特征在于,所述音频信息包括以下至少一项:所述外放音频的音频数据,所述音频对应的话音激活检测VAD信息。The method according to claim 1, wherein the audio information includes at least one of the following: audio data of the external audio, and voice activation detection VAD information corresponding to the audio.
  3. 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:The method according to claim 1 or 2, wherein the method further comprises:
    所述第一电子设备向所述多个电子设备中用于识别语音的第三电子设备发送所述第二待识别语音;或者The first electronic device sends the second voice to be recognized to a third electronic device for recognizing voice among the plurality of electronic devices; or
    所述第一电子设备对所述第二待识别语音进行识别。The first electronic device recognizes the second voice to be recognized.
  4. 根据权利要求3所述的方法,其特征在于,在所述多个电子设备中的第一电子设备拾音得到第一待识别语音之前,所述方法还包括:The method according to claim 3, wherein before the first electronic device in the plurality of electronic devices picks up the voice to obtain the first to-be-recognized voice, the method further comprises:
    所述第一电子设备向所述第三电子设备发送所述第一电子设备的拾音选举信息,其中所述第一电子设备的拾音选举信息用于表示所述第一电子设备的拾音情况;The first electronic device sends the voice pickup election information of the first electronic device to the third electronic device, wherein the voice pickup election information of the first electronic device is used to represent the voice pickup of the first electronic device condition;
    所述第一电子设备为所述第三电子设备基于获取的所述多个电子设备的拾音选举信息从所述多个电子设备中选举出的用于拾音的电子设备。The first electronic device is an electronic device for sound pickup selected by the third electronic device from the plurality of electronic devices based on the acquired sound pickup election information of the plurality of electronic devices.
  5. 根据权利要求4所述的方法,其特征在于,所述方法还包括:The method according to claim 4, wherein the method further comprises:
    所述第一电子设备接收所述第三电子设备发送的拾音指令,其中,所述拾音指令用于指示所述第一电子设备拾音并向所述第三电子设备发送降噪处理后的待识别语音。The first electronic device receives a sound pickup instruction sent by the third electronic device, wherein the sound pickup instruction is used to instruct the first electronic device to pick up sound and send the noise reduction processing to the third electronic device. of the speech to be recognized.
  6. 根据权利要求4或5所述的方法,其特征在于,所述拾音选举信息包括以下至少一项:回声消除AEC能力信息,麦克风模组信息,设备状态信息,拾音得到的对应唤醒词的语音信息,拾音得到的对应语音指令的语音信息;The method according to claim 4 or 5, characterized in that, the voice pickup election information includes at least one of the following: echo cancellation AEC capability information, microphone module information, equipment status information, and the corresponding wake-up word obtained by voice pickup. Voice information, the voice information corresponding to the voice command obtained by picking up the sound;
    其中,所述语音指令为拾音得到所述唤醒词之后拾音得到的;所述设备状态信息包括以下至少一项:网络连接状态信息、耳机连接状态信息、麦克风占用状态信息、情景模式信息。Wherein, the voice command is obtained by picking up sounds after picking up the wake-up word; the device status information includes at least one of the following: network connection status information, headset connection status information, microphone occupancy status information, and scene mode information.
  7. 一种基于多设备的语音处理方法,其特征在于,所述方法包括:A multi-device-based voice processing method, characterized in that the method comprises:
    多个电子设备中的第二电子设备外放音频;The second electronic device in the plurality of electronic devices plays audio;
    所述第二电子设备向所述多个电子设备中用于拾音的第一电子设备发送与所述音频相关的音频信息,其中,The second electronic device sends audio information related to the audio to the first electronic device used for pickup among the plurality of electronic devices, wherein,
    所述音频信息能够被所述第一电子设备用于对所述第一电子设备拾音得到的待识别音频进行降噪处理。The audio information can be used by the first electronic device to perform noise reduction processing on the to-be-identified audio obtained by the sound pickup by the first electronic device.
  8. 根据权利要求7所述的方法,其特征在于,所述音频信息包括以下至少一项:所述音频的音频数据,所述音频对应的话音激活检测VAD信息。The method according to claim 7, wherein the audio information comprises at least one of the following: audio data of the audio, and voice activation detection VAD information corresponding to the audio.
  9. 根据权利要求7或8所述的方法,其特征在于,所述方法还包括:The method according to claim 7 or 8, wherein the method further comprises:
    所述第二电子设备从所述多个电子设备中用于识别语音的第三电子设备接收共享指令;或者the second electronic device receives a shared instruction from a third electronic device for recognizing voice among the plurality of electronic devices; or
    所述第二电子设备从所述第一电子设备接收共享指令;the second electronic device receives a sharing instruction from the first electronic device;
    其中,所述共享指令用于指示所述第二电子设备向所述第一电子设备发送所述音频信息。Wherein, the sharing instruction is used to instruct the second electronic device to send the audio information to the first electronic device.
  10. 根据权利要求9所述的方法,其特征在于,在所述第二电子设备向所述多个电子设备中用于拾音的第一电子设备发送与所述音频相关的音频信息之前,所述方法还包括:The method according to claim 9, wherein before the second electronic device sends the audio information related to the audio to the first electronic device used for picking up sound among the plurality of electronic devices, the audio Methods also include:
    所述第二电子设备向所述第三电子设备发送所述第二电子设备的拾音选举信息,其中所述第二电子设备的拾音选举信息用于表示所述第二电子设备的拾音情况;The second electronic device sends the voice pickup election information of the second electronic device to the third electronic device, wherein the voice pickup election information of the second electronic device is used to represent the voice pickup of the second electronic device condition;
    所述第一电子设备为所述第三电子设备基于获取的所述多个电子设备的拾音选举信息从所述多个电子设备中选举出的用于拾音的电子设备。The first electronic device is an electronic device for sound pickup selected by the third electronic device from the plurality of electronic devices based on the acquired sound pickup election information of the plurality of electronic devices.
  11. 一种基于多设备的语音处理方法,其特征在于,所述方法包括:A multi-device-based voice processing method, characterized in that the method comprises:
    多个电子设备中的第三电子设备监测到所述多个电子设备中存在正在外放音频的第二电子设备;The third electronic device of the plurality of electronic devices detects that there is a second electronic device that is playing audio externally among the plurality of electronic devices;
    在所述第二电子设备与所述第三电子设备不同的情况下,所述第三电子设备向所述第二电子设备发送共享指令,其中所述共享指令用于指示所述第二电子设备向所述多个设备中用于拾音的第一电子设备发送与所述第二电子设备外放的音频相关的音频信息;When the second electronic device is different from the third electronic device, the third electronic device sends a sharing instruction to the second electronic device, wherein the sharing instruction is used to instruct the second electronic device sending audio information related to the audio played by the second electronic device to the first electronic device used for sound pickup among the plurality of devices;
    在所述第二电子设备与所述第三电子设备相同的情况下,所述第三电子设备向所述第一电子设备发送所述音频信息;In the case that the second electronic device is the same as the third electronic device, the third electronic device sends the audio information to the first electronic device;
    其中,所述音频信息能够被所述第一电子设备用于对所述第一电子设备拾音得到的第一待识别语音进行降噪处理得到第二待识别语音。The audio information can be used by the first electronic device to perform noise reduction processing on the first to-be-recognized speech obtained by the first electronic device to obtain the second to-be-recognized speech.
  12. 根据权利要求11所述的方法,其特征在于,所述音频信息包括以下至少一项:所述音频的音频数据,所述音频对应的话音激活检测VAD信息。The method according to claim 11, wherein the audio information comprises at least one of the following: audio data of the audio, and voice activation detection VAD information corresponding to the audio.
  13. 根据权利要求11或12所述的方法,其特征在于,所述第一电子设备与所述第三电子设备不同,并且所述方法还包括:The method according to claim 11 or 12, wherein the first electronic device is different from the third electronic device, and the method further comprises:
    所述第三电子设备从第一电子设备获取由所述第一电子设备拾音得到的所述第二待识别语音;The third electronic device acquires, from the first electronic device, the second to-be-recognized voice obtained by the voice pickup of the first electronic device;
    所述第一电子设备对所述第二待识别语音进行识别。The first electronic device recognizes the second voice to be recognized.
  14. 根据权利要求13所述的方法,其特征在于,在所述第三电子设备向所述第二电子设备发送共享指令之前,所述方法还包括:The method according to claim 13, wherein before the third electronic device sends the sharing instruction to the second electronic device, the method further comprises:
    所述第三电子设备获取所述多个电子设备的拾音选举信息,其中所述多个电子设备的拾音选举信息用于表示所述多个电子设备的拾音情况;The third electronic device acquires the sound pickup election information of the plurality of electronic devices, wherein the sound pickup election information of the plurality of electronic devices is used to indicate the sound pickup situation of the plurality of electronic devices;
    所述第三电子设备基于所述多个设备的拾音选举信息,从所述多个电子设备中选举出电子设备作为所述第一电子设备。The third electronic device selects an electronic device from the plurality of electronic devices as the first electronic device based on the voice pickup election information of the plurality of devices.
  15. 根据权利要求14所述的方法,其特征在于,所述方法还包括:The method of claim 14, wherein the method further comprises:
    所述第三电子设备向所述第一电子设备发送拾音指令,其中,所述拾音指令用于指示所述第一电子设备拾音并向所述第三电子设备发送拾音得到的所述第二待识别语音。The third electronic device sends a sound pickup instruction to the first electronic device, wherein the sound pickup instruction is used to instruct the first electronic device to pick up sound and send the result obtained by picking up the sound to the third electronic device. Describe the second voice to be recognized.
  16. 根据权利要求14或15所述的方法,其特征在于,所述拾音选举信息包括以下至少一项:回声消除AEC能力信息,麦克风模组信息,设备状态信息,拾音得到的对应唤醒词的语音信息,拾音得到的对应语音指令的语音信息;The method according to claim 14 or 15, wherein the voice pickup election information includes at least one of the following: echo cancellation AEC capability information, microphone module information, equipment status information, and the corresponding wake-up word obtained by voice pickup. Voice information, the voice information corresponding to the voice command obtained by picking up the sound;
    其中,所述语音指令为拾音得到所述唤醒词之后拾音得到的;所述设备状态信息包括以下至少一项:网络连接状态信息、耳机连接状态信息、麦克风占用状态信息、情景模式信息。Wherein, the voice command is obtained by picking up sounds after picking up the wake-up word; the device status information includes at least one of the following: network connection status information, headset connection status information, microphone occupancy status information, and scene mode information.
  17. 根据权利要求16所述的方法,其特征在于,所述第三电子设备基于所述多个电子设备的拾音选举信息,从所述多个电子设备中选举出至少一个电子设备作为所述第一电子设备,包括下列中的至少一项:The method according to claim 16, wherein the third electronic device selects at least one electronic device from the plurality of electronic devices as the first electronic device based on the selection information of the plurality of electronic devices. An electronic device including at least one of the following:
    在所述第三电子设备处于预设网络状态的情况下,则所述第三电子设备将所述第三电子设备确定 为所述第一电子设备;When the third electronic device is in a preset network state, the third electronic device determines the third electronic device as the first electronic device;
    在所述第三电子设备已连接耳机的情况下,则所述第三电子设备将所述第三电子设备确定为所述第一电子设备;In the case that the third electronic device has been connected to an earphone, the third electronic device determines the third electronic device as the first electronic device;
    所述第三电子设备将所述多个电子设备中处于预设情景模式的电子设备中的至少一个确定为所述第一电子设备。The third electronic device determines at least one electronic device in a preset scene mode among the plurality of electronic devices as the first electronic device.
  18. 根据权利要求17所述的方法,其特征在于,所述第三电子设备基于所述多个电子设备的拾音选举信息,从所述多个电子设备中选举出至少一个电子设备作为所述第一电子设备,包括下列中的至少一项:The method according to claim 17, wherein the third electronic device selects at least one electronic device from the plurality of electronic devices as the first electronic device based on the selection information of the plurality of electronic devices. An electronic device including at least one of the following:
    所述第三电子设备将所述多个电子设备中AEC生效的电子设备中的至少一个作为所述第一电子设备;The third electronic device uses at least one of the electronic devices for which the AEC is valid among the plurality of electronic devices as the first electronic device;
    所述第三电子设备将所述多个电子设备中降噪能力大于满足预定降噪条件的电子设备中的至少一个作为所述第一电子设备;The third electronic device uses at least one of the plurality of electronic devices whose noise reduction capability is greater than that of the electronic devices satisfying a predetermined noise reduction condition as the first electronic device;
    所述第三电子设备将所述多个电子设备中与用户之间的距离小于第一预定距离的电子设备中的至少一个作为所述第一电子设备;The third electronic device uses at least one of the electronic devices whose distance from the plurality of electronic devices to the user is less than the first predetermined distance as the first electronic device;
    所述第三电子设备将所述多个电子设备中与外部噪声源之间的距离大于第二预定距离的电子设备中的至少一个作为所述第一电子设备。The third electronic device uses at least one of the electronic devices whose distance from the external noise source is greater than the second predetermined distance among the plurality of electronic devices as the first electronic device.
  19. 根据权利要求17所述的方法,其特征在于,所述预设网络状态包括下列至少一项:网络通信速率小于或等于预定速率的网络,网络电线频次大于或等于预定频次;所述预设情景模式包括下列至少一项:地铁模式、飞行模式、驾驶模式、旅行模式。The method according to claim 17, wherein the preset network state includes at least one of the following: a network with a network communication rate less than or equal to a predetermined rate, and a network wire frequency greater than or equal to a predetermined frequency; the preset scenario The modes include at least one of the following: subway mode, airplane mode, driving mode, travel mode.
  20. 根据权利要求11-19中任一项所述的方法,其特征在于,所述第三电子设备采用神经网络算法或决策树算法从所述多个电子设备中选举出所述第一电子设备。The method according to any one of claims 11-19, wherein the third electronic device uses a neural network algorithm or a decision tree algorithm to select the first electronic device from the plurality of electronic devices.
  21. 一种语音处理系统,其特征在于,所述系统包括:第一电子设备和第二电子设备;A voice processing system, characterized in that the system comprises: a first electronic device and a second electronic device;
    其中,所述第二电子设备在外放音频的情况下,向用于拾音的所述第一电子设备发送与所述音频相关的音频信息;Wherein, the second electronic device sends audio information related to the audio to the first electronic device used for sound pickup when the audio is played externally;
    所述第一电子设备用于拾音得到第一待识别语音,并根据从所述第二电子设备接收到的所述音频信息对拾音得到的所述第一待识别语音进行降噪处理得到第二待识别语音。The first electronic device is used to pick up sounds to obtain the first to-be-recognized speech, and performs noise reduction processing on the first to-be-recognized speech obtained by picking up the sound according to the audio information received from the second electronic device. The second voice to be recognized.
  22. 根据权利要求21所述的系统,其特征在于,所述系统还包括:第三电子设备;The system of claim 21, wherein the system further comprises: a third electronic device;
    所述第三电子设备用于获取多个电子设备的拾音选举信息,其中所述多个电子设备的拾音选举信息用于表示所述多个电子设备的拾音情况;并基于所述多个电子设备的拾音选举信息,从所述多个电子设备中选举出至少一个电子设备作为用于拾音的所述第一电子设备,其中所述第一电子设备、第二电子设备和所述第三电子设备均为所述多个电子设备中的电子设备,所述第三电子设备与所述第一电子设备相同或者不同;The third electronic device is configured to acquire the voice pickup election information of multiple electronic devices, wherein the voice pickup election information of the multiple electronic devices is used to represent the voice pickup situation of the multiple electronic devices; and based on the multiple electronic devices Voice pickup election information of a plurality of electronic devices, at least one electronic device is selected from the plurality of electronic devices as the first electronic device for voice pickup, wherein the first electronic device, the second electronic device and all the electronic devices The third electronic device is an electronic device among the plurality of electronic devices, and the third electronic device is the same as or different from the first electronic device;
    所述第一电子设备还用于向所述第三电子设备发送所述第二待识别语音;并且The first electronic device is further configured to send the second to-be-recognized voice to the third electronic device; and
    所述第三电子设备还用于对从所述第一电子设备获取的所述第二待识别语音进行识别。The third electronic device is further configured to recognize the second to-be-recognized voice acquired from the first electronic device.
  23. 一种计算机可读存储介质,其特征在于,所述存储介质上存储有指令,所述指令在计算机上执行时使所述计算机执行权利要求1至20中任一项所述的基于多设备的语音处理方法。A computer-readable storage medium, characterized in that, instructions are stored on the storage medium, and when the instructions are executed on a computer, the computer executes the multi-device-based storage medium according to any one of claims 1 to 20. speech processing methods.
  24. 一种电子设备,其特征在于,包括:一个或多个处理器;一个或多个存储器;所述一个或多个存储器存储有一个或多个程序,当所述一个或者多个程序被所述一个或多个处理器执行时,使得所述电子设备执行权利要求1至20中任一项所述的基于多设备的语音处理方法。An electronic device, characterized in that it comprises: one or more processors; one or more memories; the one or more memories store one or more programs, when the one or more programs are stored by the When executed by one or more processors, the electronic device is caused to execute the multi-device-based voice processing method according to any one of claims 1 to 20.
  25. 一种电子设备,其特征在于,包括:处理器、存储器、通信接口和通信总线;所述存储器用于存储至少一个指令,所述至少一个处理器、所述存储器和所述通信接口通过所述通信总线连接,当所述至少一个处理器执行所述存储器存储的所述至少一个指令,以使所述电子设备执行权利要求1至20中任一项所述的基于多设备的语音处理方法。An electronic device, characterized in that it comprises: a processor, a memory, a communication interface and a communication bus; the memory is used to store at least one instruction, and the at least one processor, the memory and the communication interface pass the A communication bus is connected, when the at least one processor executes the at least one instruction stored in the memory, so that the electronic device executes the multi-device-based voice processing method according to any one of claims 1 to 20.
PCT/CN2021/110865 2020-09-11 2021-08-05 Multi-device voice processing method, medium, electronic device, and system WO2022052691A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010955837.7A CN114255763A (en) 2020-09-11 2020-09-11 Voice processing method, medium, electronic device and system based on multiple devices
CN202010955837.7 2020-09-11

Publications (1)

Publication Number Publication Date
WO2022052691A1 true WO2022052691A1 (en) 2022-03-17

Family

ID=80632591

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/110865 WO2022052691A1 (en) 2020-09-11 2021-08-05 Multi-device voice processing method, medium, electronic device, and system

Country Status (2)

Country Link
CN (1) CN114255763A (en)
WO (1) WO2022052691A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115001890A (en) * 2022-05-31 2022-09-02 四川虹美智能科技有限公司 Intelligent household appliance control method and device based on response-free

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102708864A (en) * 2011-03-28 2012-10-03 德信互动科技(北京)有限公司 Conversation-based household electronic equipment and control method thereof
CN106357497A (en) * 2016-11-10 2017-01-25 北京智能管家科技有限公司 Control system of intelligent home network
CN107886952A (en) * 2017-11-09 2018-04-06 珠海格力电器股份有限公司 A kind of method, apparatus, system and the electronic equipment of Voice command intelligent appliance
US10057125B1 (en) * 2017-04-17 2018-08-21 Essential Products, Inc. Voice-enabled home setup
CN108447479A (en) * 2018-02-02 2018-08-24 上海大学 The robot voice control system of noisy work condition environment
CN108665899A (en) * 2018-04-25 2018-10-16 广东思派康电子科技有限公司 A kind of voice interactive system and voice interactive method
CN108766432A (en) * 2018-07-02 2018-11-06 珠海格力电器股份有限公司 A kind of method to cooperate between control household electrical appliances
US20190020493A1 (en) * 2017-07-12 2019-01-17 Universal Electronics Inc. Apparatus, system and method for directing voice input in a controlling device
CN109347710A (en) * 2018-11-07 2019-02-15 四川长虹电器股份有限公司 A kind of system and method for realizing full room interactive voice control smart home
CN109473095A (en) * 2017-09-08 2019-03-15 北京君林科技股份有限公司 A kind of intelligent home control system and control method

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102708864A (en) * 2011-03-28 2012-10-03 德信互动科技(北京)有限公司 Conversation-based household electronic equipment and control method thereof
CN106357497A (en) * 2016-11-10 2017-01-25 北京智能管家科技有限公司 Control system of intelligent home network
US10057125B1 (en) * 2017-04-17 2018-08-21 Essential Products, Inc. Voice-enabled home setup
US20190020493A1 (en) * 2017-07-12 2019-01-17 Universal Electronics Inc. Apparatus, system and method for directing voice input in a controlling device
CN109473095A (en) * 2017-09-08 2019-03-15 北京君林科技股份有限公司 A kind of intelligent home control system and control method
CN107886952A (en) * 2017-11-09 2018-04-06 珠海格力电器股份有限公司 A kind of method, apparatus, system and the electronic equipment of Voice command intelligent appliance
CN108447479A (en) * 2018-02-02 2018-08-24 上海大学 The robot voice control system of noisy work condition environment
CN108665899A (en) * 2018-04-25 2018-10-16 广东思派康电子科技有限公司 A kind of voice interactive system and voice interactive method
CN108766432A (en) * 2018-07-02 2018-11-06 珠海格力电器股份有限公司 A kind of method to cooperate between control household electrical appliances
CN109347710A (en) * 2018-11-07 2019-02-15 四川长虹电器股份有限公司 A kind of system and method for realizing full room interactive voice control smart home

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115001890A (en) * 2022-05-31 2022-09-02 四川虹美智能科技有限公司 Intelligent household appliance control method and device based on response-free
CN115001890B (en) * 2022-05-31 2023-10-31 四川虹美智能科技有限公司 Intelligent household appliance control method and device based on response-free

Also Published As

Publication number Publication date
CN114255763A (en) 2022-03-29

Similar Documents

Publication Publication Date Title
CN111091828B (en) Voice wake-up method, device and system
US11437021B2 (en) Processing audio signals
CN106782519A (en) A kind of robot
US20180174574A1 (en) Methods and systems for reducing false alarms in keyword detection
CN107464565A (en) A kind of far field voice awakening method and equipment
WO2023284402A1 (en) Audio signal processing method, system, and apparatus, electronic device, and storage medium
CN111696562B (en) Voice wake-up method, device and storage medium
WO2022052691A1 (en) Multi-device voice processing method, medium, electronic device, and system
WO2023004223A1 (en) Noise suppression using tandem networks
KR20200024068A (en) A method, device, and system for selectively using a plurality of voice data reception devices for an intelligent service
US11783809B2 (en) User voice activity detection using dynamic classifier
US20210110838A1 (en) Acoustic aware voice user interface
CN115731923A (en) Command word response method, control equipment and device
US11917386B2 (en) Estimating user location in a system including smart audio devices
CN115424628A (en) Voice processing method and electronic equipment
CN115331672B (en) Device control method, device, electronic device and storage medium
US20210391840A1 (en) Audio gain selection
CN117953872A (en) Voice wakeup model updating method, storage medium, program product and equipment
CN113889084A (en) Audio recognition method and device, electronic equipment and storage medium
CN116564298A (en) Speech recognition method, electronic device, and computer-readable storage medium
CN116978372A (en) Voice interaction method, electronic equipment and storage medium
CN117690423A (en) Man-machine interaction method and related device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21865747

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21865747

Country of ref document: EP

Kind code of ref document: A1