WO2022052691A1

WO2022052691A1 - Multi-device voice processing method, medium, electronic device, and system

Info

Publication number: WO2022052691A1
Application number: PCT/CN2021/110865
Authority: WO
Inventors: 潘邵武; 万柯; 谷岳; 印文帅
Original assignee: 华为技术有限公司
Priority date: 2020-09-11
Filing date: 2021-08-05
Publication date: 2022-03-17
Also published as: CN114255763A

Abstract

A multi-device voice processing method, a medium, an electronic device, and a system, capable of alleviating the influence of internal noise of an electronic device that is playing audio to the outside in a multi-device scene on a sound pickup effect of a voice assistant, facilitating ensuring the accuracy of voice recognition of the voice assistant, and improving the environmental robustness of voice recognition in the multi-device scene. The method comprises: a first electronic device among multiple electronic devices performs sound pickup to obtain a voice to be recognized; the first electronic device receives, from a second electronic device among the multiple electronic devices that is playing audio to outside, audio information related to the audio played by the second electronic device to outside; and the first electronic device performs, according to the received audio information, noise reduction processing on the voice to be recognized that is obtained by means of sound pickup.

Description

Multi-device based speech processing method, medium, electronic device and system

This application claims the priority of the Chinese patent application filed on September 11, 2020, with the application number of 202010955837.7 and the application name of "Multi-device-based voice processing method, medium, electronic device and system", all of which are The contents are incorporated herein by reference.

technical field

The present application relates to speech processing technology in the field of artificial intelligence, and in particular, to a multi-device-based speech processing method, medium, electronic device and system.

Background technique

A voice assistant is an application (Application, APP) based on artificial intelligence (Artificial Intelligence, AI). Smart devices such as mobile phones receive and recognize voice commands spoken by users through voice assistants, providing users with voice control functions such as interactive dialogue, information query, and device control. With the widespread popularity of smart devices with voice assistants, there are usually multiple devices with voice assistants installed in the user's environment (such as the user's home). In this multi-device scenario, if there are devices with the same wake word in multiple devices , then after the user speaks the wake-up word, the voice assistants of the devices with the same wake-up word will be woken up, and will all recognize and respond to the user's subsequent voice commands.

In the prior art, in a multi-device scenario, multiple devices can cooperate to select the device closest to the user from multiple devices with the same wake-up word to wake up the voice assistant, so that the device can pick up, recognize and respond to the voice assistant. The user's voice command. However, if there is strong external noise near the selected device, or if the device has poor sound pickup capability, the accuracy of the voice command recognition result of the selected device in the above automatic speech recognition process is low, Therefore, the operation indicated by the voice instruction cannot be performed accurately.

SUMMARY OF THE INVENTION

Embodiments of the present application provide a multi-device-based voice processing method, medium, electronic device, and system. The sound-picking device selected from the multi-device may have the closest distance to the user, the farthest distance from an external noise source, and an internal One or more of the favorable factors such as noise reduction capability, so as to alleviate the influence of electronic equipment deployment location, internal noise interference or external noise interference on the voice pickup effect and voice recognition accuracy of the voice assistant in multi-device scenarios, improve User interaction experience and environmental robustness of speech recognition in multi-device scenarios.

In a first aspect, an embodiment of the present application provides a multi-device-based voice processing method, the method includes: a first electronic device in a plurality of electronic devices picks up a voice to obtain a first to-be-recognized voice; The second electronic device that broadcasts audio in the electronic device receives audio information related to the audio broadcast by the second electronic device; Second, the voice to be recognized. It can be understood that the electronic device used for sound pickup (ie, the first electronic device) is the sound pickup device hereinafter, such as an electronic device with better sound pickup effect selected from the multiple devices. The above-mentioned electronic device that broadcasts audio (ie, the second electronic device) is the internal noise device in the multi-device, and the audio information of the external audio from the second electronic device is the noise reduction information of the internal noise device described below. Specifically, the first electronic device performs noise reduction processing on the first to-be-recognized voice obtained by picking up the sound by using the audio information of the audio played by the second electronic device to obtain the second to-be-recognized voice, which can alleviate the problem of audio being played out in a multi-device scenario. The influence of the internal noise of electronic equipment on the voice pickup effect of the voice assistant ensures the voice pickup effect of the voice assistant based on multiple devices, which in turn helps to ensure the voice recognition accuracy of the voice assistant, and improves the environmental robustness of voice recognition in multi-device scenarios. Awesome.

In a possible implementation of the above-mentioned first aspect, the above-mentioned audio information includes at least one of the following: audio data of the externally played audio, and VAD information corresponding to the voice activation detection of the audio. It can be understood that the audio information of the audio can reflect the audio itself, and by performing noise reduction processing on the internal noise generated by the external audio, the internal noise can be eliminated to other voice data (such as the voice data picked up by the user, such as The influence of the voice data corresponding to the voice to be recognized) to improve the quality of the picked-up voice data.

In a possible implementation of the above-mentioned first aspect, the above-mentioned method further includes: the first electronic device sends the second to-be-recognized speech to a third electronic device used for recognizing speech among the plurality of electronic devices; or, the first electronic device Recognizing the second speech to be recognized. Wherein, the electronic device for recognizing voice (ie, the third electronic device) may be the answering device hereinafter. It can be understood that in the multi-device scenario of this embodiment of the present application, the electronic device used for recognizing voice and the electronic device used for picking up sounds may be the same or different, that is, the first electronic device (or the first electronic device (or the first electronic device) may be The microphone module of the electronic device) is used as a peripheral device to pick up the user's voice command, so that the peripheral resources of multiple electronic devices equipped with a microphone module and a voice assistant can be effectively aggregated.

In a possible implementation of the above-mentioned first aspect, before the first electronic device among the plurality of electronic devices picks up the voice to obtain the first to-be-recognized voice, the above-mentioned method further includes: the first electronic device sends the first electronic device to the third electronic device. Pickup election information of an electronic device, wherein the pickup election information of the first electronic device is used to indicate the sound pickup situation of the first electronic device; the first electronic device is the third electronic device based on the acquired sound pickup of multiple electronic devices The election information is an electronic device selected from a plurality of electronic devices for pickup. For example, in the multi-device scenario of the embodiment of the present application, after the user speaks a voice command, the user does not need to specifically operate an electronic device to pick up the voice command to be recognized (such as the voice command corresponding to the second voice data below), but instead The answering device (ie the third electronic device) automatically uses the pickup device (ie the second electronic device) as a peripheral to pick up the user's voice command, and then implements the voice control function through the response device's response to the user's voice command.

In a possible implementation of the above-mentioned first aspect, the above-mentioned method further includes: the first electronic device receives a sound pickup instruction (that is, the sound pickup instruction hereinafter) sent by the third electronic device, wherein the sound pickup instruction is used for Instruct the first electronic device to pick up sound and send the noise-reduced voice to be recognized to the third electronic device. In this way, under the instruction of the pickup instruction, the first electronic device can know that it needs to send the to-be-recognized voice (such as the second to-be-recognized voice) obtained by the pickup to the third electronic device, but will not recognize the to-be-recognized voice, etc. Subsequent processing.

In a possible implementation of the above-mentioned first aspect, the above-mentioned voice pickup election information includes at least one of the following: echo cancellation AEC capability information, microphone module information, device status information, and voice information corresponding to the wake-up word obtained by voice pickup, The voice information corresponding to the voice command obtained by picking up the voice; wherein, the voice command is obtained by picking up the voice after the wake-up word is obtained; the device status information includes at least one of the following: network connection status information, headset connection status information, microphone occupation Status information, profile information. It can be understood that the different information in the sound-picking election information represents different factors that affect the sound-picking effect of the electronic equipment. In this way, the embodiment of the present application can comprehensively consider the different factors of the sound-picking effect of the electronic equipment to elect the sound-picking equipment, such as election. The electronic device with the best sound pickup effect is used for sound pickup, that is, as a pickup device in a multi-device.

In a second aspect, an embodiment of the present application provides a multi-device-based voice processing method, the method includes: a second electronic device in the plurality of electronic devices plays audio; the second electronic device is used in the plurality of electronic devices for The first electronic device that picks up the sound sends audio information related to the audio, where the audio information can be used by the first electronic device to perform noise reduction processing on the to-be-identified audio that is picked up by the first electronic device. Specifically, since the electronic device that is playing the audio can provide the audio information of the audio, the first electronic device for picking up the sound performs noise reduction processing on the first to-be-recognized voice obtained by the sound pickup according to the audio information, so as to realize The influence of the internal noise generated by the audio on the sound pickup is eliminated to improve the sound pickup effect of the first electronic device, that is, to improve the quality of the voice data obtained by the sound pickup (ie, the voice data of the second to-be-recognized voice). Therefore, the influence of the internal noise of the electronic device that is playing audio externally on the sound pickup effect of the voice assistant in the multi-device scenario can be alleviated, and the sound pickup effect of the voice assistant based on multiple devices can be ensured, thereby helping to ensure the voice recognition accuracy of the voice assistant. , and improve the environmental robustness of speech recognition in multi-device scenarios.

In a possible implementation of the second aspect, the audio information includes at least one of the following: audio data of the audio, and voice activation detection VAD information corresponding to the audio.

In a possible implementation of the above-mentioned second aspect, the above-mentioned method further includes: the second electronic device receives a shared instruction (ie, the noise reduction instruction hereinafter) from a third electronic device used for recognizing voice among the plurality of electronic devices; Or the second electronic device receives a sharing instruction from the first electronic device; wherein, the sharing instruction is used to instruct the second electronic device to send the audio information to the first electronic device. It can be understood that the electronic device (such as the first electronic device or the third electronic device) that sends the sharing instruction can monitor whether the second electronic device is playing audio, and only transmit the audio to the second electronic device when the second electronic device is playing the audio. Send a share command.

In a possible implementation of the above-mentioned second aspect, before the above-mentioned second electronic device sends the audio information related to the externally played audio to the first electronic device used for sound pickup among the plurality of electronic devices, the method further includes: The second electronic device sends the voice-picking election information of the second electronic device to the third electronic device, wherein the voice-picking election information of the second electronic device is used to indicate the voice-picking situation of the second electronic device; the first electronic device is the third electronic device The device elects an electronic device for sound pickup from the plurality of electronic devices based on the acquired sound pickup election information of the plurality of electronic devices. For example, the third electronic device as the answering device hereinafter may elect the electronic device with the best audio quality for picking up the voice command (that is, the electronic device with the best sound pickup) as the sound pickup device (such as the first electronic device) to support The answering device completes the voice interaction process with the user through the voice assistant. For example, the pickup device may be an electronic device that is closest to the user and has better SE processing capability. In this way, the peripheral resources of multiple electronic devices equipped with microphone modules and voice assistants can be effectively aggregated, the impact of the deployment location of electronic devices on the recognition accuracy of voice assistants in multi-device scenarios is alleviated, and the user interaction experience and multi-device scenarios are improved. Environmental Robustness for Speech Recognition in China.

In a third aspect, an embodiment of the present application provides a multi-device-based voice processing method, the method comprising: a third electronic device in the plurality of electronic devices detects that there is a second electronic device in the plurality of electronic devices that is playing audio device; in the case that the second electronic device is different from the third electronic device, the third electronic device sends a sharing instruction to the second electronic device, wherein the sharing instruction is used to instruct the second electronic device to send a voice pickup device to the plurality of devices. The first electronic device sends audio information related to the audio played by the second electronic device; if the second electronic device is the same as the third electronic device, the third electronic device sends the audio information to the first electronic device; wherein, The audio information can be used by the first electronic device to perform noise reduction processing on the first to-be-recognized voice picked up by the first electronic device to obtain the second to-be-recognized voice. Specifically, because the second electronic device that is playing audio externally under the instruction of the third electronic device can provide the audio information of the audio, the first electronic device used for picking up the sound can obtain the first sound picked up by the audio information according to the audio information. Noise reduction processing is performed on the voice to be recognized, so as to eliminate the influence of the internal noise generated by the audio on the pickup, so as to improve the pickup effect of the first electronic device, that is, to improve the voice data obtained by pickup (ie, the voice of the second to-be-recognized voice). data) quality. Therefore, the influence of the internal noise of the electronic device that is playing audio externally on the sound pickup effect of the voice assistant in the multi-device scenario can be alleviated, and the sound pickup effect of the voice assistant based on multiple devices can be ensured, thereby helping to ensure the voice recognition accuracy of the voice assistant. , and improve the environmental robustness of speech recognition in multi-device scenarios.

In a possible implementation of the third aspect, the audio information includes at least one of the following: audio data of the audio, and voice activation detection VAD information corresponding to the audio.

In a possible implementation of the above third aspect, the first electronic device is different from the third electronic device, and the above method further includes: the third electronic device obtains, from the first electronic device, the first electronic device obtained by picking up the sound of the first electronic device. Second, the voice to be recognized; the first electronic device recognizes the second voice to be recognized. Further, it is beneficial to improve the accuracy of speech recognition during the speech control process, and to improve user experience. In this way, even if the selected answering device in the multi-device scenario (such as the third electronic device closest to the user) has poor sound pickup effect, or there is noise generated by the electronic device that is playing audio, multiple devices can cooperate to pick up and Recognize voice data with better audio quality without requiring the user to move location or manually control the pickup of specific electronic devices.

In a possible implementation of the above third aspect, before the third electronic device sends the sharing instruction to the second electronic device, the above method further includes: the third electronic device acquires voice selection information of multiple electronic devices, wherein the The sound pickup election information of the plurality of electronic devices is used to indicate the sound pickup situation of the plurality of electronic devices; the third electronic device elects at least one electronic device from the plurality of electronic devices based on the sound pickup election information of the plurality of devices as the first electronic device. In this way, the peripheral resources of multiple electronic devices equipped with microphone modules and voice assistants can be effectively aggregated, so as to alleviate the recognition accuracy of voice assistants due to various factors such as the deployment location of electronic devices, internal noise interference, and external noise interference in multi-device scenarios. The impact of this improves the user interaction experience and the environmental robustness of speech recognition in multi-device scenarios.

In a possible implementation of the above third aspect, the above method further includes: the third electronic device sends a sound pickup instruction to the first electronic device, wherein the sound pickup instruction is used to instruct the first electronic device to pick up sound and send it to the first electronic device. The third electronic device sends the second voice to be recognized obtained by picking up the voice. It can be understood that, under the instruction of the above voice pickup instruction, the first electronic device can know the to-be-recognized voice that needs to be sent to the third electronic device by picking up the voice, without performing subsequent processing such as recognizing the to-be-recognized voice.

In a possible implementation of the above-mentioned third aspect, the above-mentioned voice pickup election information includes at least one of the following: echo cancellation AEC capability information, microphone module information, device status information, and voice information corresponding to the wake-up word obtained by voice pickup, The voice information corresponding to the voice command obtained by picking up the voice; wherein, the voice command is obtained by picking up the voice after the wake-up word is obtained; the device status information includes at least one of the following: network connection status information, headset connection status information, microphone occupation Status information, profile information.

In a possible implementation of the above-mentioned third aspect, the above-mentioned third electronic device elects at least one electronic device from the plurality of electronic devices as the first electronic device based on the voice pickup election information of the plurality of electronic devices, including the following: At least one of: when the third electronic device is in the preset network state, the third electronic device determines the third electronic device as the first electronic device; when the third electronic device is connected to the headset, the third electronic device determines the third electronic device as the first electronic device; The third electronic device determines the third electronic device as the first electronic device; the third electronic device determines at least one of the electronic devices in the preset scene mode among the plurality of electronic devices as the first electronic device. It can be understood that if the electronic device is in a device state that is not conducive to the sound pickup of the electronic device, such as the electronic device has a poor network connection status, a wired or wireless headset is connected, the microphone is already occupied, or is in flight mode, it means the sound pickup effect of the electronic device. It is difficult to guarantee, or the electronic device cannot normally cooperate with other devices to pick up sounds, for example, it cannot normally send the voice data obtained by picking up sounds to other electronic devices. In this way, a sound pickup device (such as the above-mentioned first electronic device) with better sound pickup effect can be selected according to the above-mentioned selection steps of the sound pickup device.

In a possible implementation of the above-mentioned third aspect, the above-mentioned third electronic device elects at least one electronic device from the plurality of electronic devices as the first electronic device based on the voice pickup election information of the plurality of electronic devices, including the following: At least one of: the third electronic device uses at least one of the electronic devices whose AEC is in effect among the plurality of electronic devices as the first electronic device; At least one of the electronic devices is used as the first electronic device; the third electronic device uses at least one of the electronic devices whose distance from the multiple electronic devices to the user is less than the first predetermined distance as the first electronic device; the third electronic device At least one of the plurality of electronic devices whose distance from the external noise source is greater than the second predetermined distance is used as the first electronic device. For example, the predetermined noise reduction condition indicates that the electronic device SE has a better processing effect, such as AEC is effective or has internal noise reduction capability; the first predetermined distance (eg 0.5m) indicates that the electronic device is relatively close to the user; the second predetermined distance (eg 3m) Indicates that the electronic device is far away from the user. It can be understood that, generally speaking, the sound pickup effect of electronic equipment closer to the user is better, and the sound pickup effect of electronic equipment farther away from external noise is better; the noise reduction performance of the microphone module is better or the electronic equipment with AEC effective, It shows that the better the SE processing effect of the electronic device is, that is, the better the sound pickup effect of the electronic device is. Therefore, considering these factors comprehensively, a sound pickup device (ie, the above-mentioned first electronic device) with better sound pickup effect can be selected from multiple devices.

In a possible implementation of the above third aspect, the preset network state includes at least one of the following: a network with a network communication rate less than or equal to a predetermined rate, and a network wire frequency greater than or equal to a predetermined frequency; the preset scene mode includes at least one of the following One: Subway Mode, Airplane Mode, Driving Mode, Travel Mode. Among them, if the network communication rate is less than or equal to the predetermined rate, and the frequency of the network wire is greater than or equal to the predetermined frequency, it means that the network communication rate of the electronic device is poor, and the specific values of the predetermined rate and the predetermined frequency can be determined according to actual needs. It can be understood that the electronic device in the preset network state is generally not suitable for participating in the election of a sound pickup device or as a sound pickup device (eg, the first electronic device for sound pickup).

In a possible implementation of the above third aspect, the third electronic device uses a neural network algorithm or a decision tree algorithm to select the first electronic device from multiple electronic devices. It can be understood that the pickup election information of multiple devices can be used as the input of the neural network algorithm or the decision tree algorithm, and the result of deciding that the first electronic device is the pickup device is output based on the neural network algorithm or the decision tree algorithm.

In a fourth aspect, the application provides a multi-device-based voice processing method, the method comprising: a third electronic device in the plurality of electronic devices obtains voice-picking election information of the plurality of electronic devices, wherein the voice-picking election information is used for Indicates the sound pickup situation of the plurality of electronic devices; the third electronic device elects at least one electronic device from the plurality of electronic devices as the first electronic device for sound pickup based on the sound pickup election information of the plurality of devices, wherein the first electronic device is used for sound pickup. The electronic device is the same as or different from the third electronic device; the third electronic device acquires the voice to be recognized obtained by the first electronic device from the first electronic device; and the third electronic device recognizes the acquired voice to be recognized. Therefore, even if the selected third electronic device (such as the electronic device closest to the user) in the multi-device scenario has poor sound pickup effect, multiple devices can collaboratively pick up and recognize voice data with better audio quality without the need for the user to move. position or manually control the pickup of specific electronics. Further, it is beneficial to improve the accuracy of speech recognition during the speech control process, and to improve user experience. In addition, it can alleviate the influence of various factors such as the deployment location of electronic devices, external noise interference and other factors on the voice pickup effect of the voice assistant and the accuracy of speech recognition in multi-device scenarios, improving user interaction experience and the environment for speech recognition in multi-device scenarios. robustness.

In a possible implementation of the above-mentioned fourth aspect, the above-mentioned voice-collecting election information includes at least one of the following: echo cancellation AEC capability information, microphone module information, device status information, and voice information corresponding to the wake-up word obtained by voice-collecting, The voice information corresponding to the voice command obtained by picking up the voice; wherein, the voice command is obtained by picking up the voice after the wake-up word is obtained; the device status information includes at least one of the following: network connection status information, headset connection status information, microphone occupation Status information, profile information.

In a possible implementation of the fourth aspect, the third electronic device elects at least one electronic device from the plurality of electronic devices as the first electronic device based on the voice pickup election information of the plurality of electronic devices, including the following: At least one of: when the third electronic device is in the preset network state, the third electronic device determines the third electronic device as the first electronic device; when the third electronic device is connected to the headset, the third electronic device determines the third electronic device as the first electronic device; The third electronic device determines the third electronic device as the first electronic device; the third electronic device determines at least one of the electronic devices in the preset scene mode among the plurality of electronic devices as the first electronic device.

In a possible implementation of the fourth aspect, the third electronic device elects at least one electronic device from the plurality of electronic devices as the first electronic device based on the voice pickup election information of the plurality of electronic devices, including the following: At least one of the following: the third electronic device uses at least one of the electronic devices whose AEC is in effect among the above-mentioned multiple electronic devices as the first electronic device; At least one of the electronic devices of the condition is used as the first electronic device; the third electronic device uses at least one of the electronic devices whose distance from the plurality of electronic devices to the user is less than the first predetermined distance as the first electronic device; the third electronic device The electronic device uses at least one of the plurality of electronic devices whose distance from the external noise source is greater than the second predetermined distance as the first electronic device.

In a possible implementation of the fourth aspect, the preset network state includes at least one of the following: a network with a network communication rate less than or equal to a predetermined rate, and a network wire frequency greater than or equal to a predetermined frequency; the preset scene mode includes the following At least one of: Subway Mode, Airplane Mode, Driving Mode, Travel Mode.

In a possible implementation of the above fourth aspect, the third electronic device uses a neural network algorithm or a decision tree algorithm to select the first electronic device from multiple electronic devices.

In a possible implementation of the above fourth aspect, the above method further includes: the third electronic device detects that there is a second electronic device that is playing audio externally among the plurality of electronic devices; the third electronic device sends a message to the second electronic device. A sharing instruction, wherein the sharing instruction is used to instruct the second electronic device to send to the first electronic device audio information related to the audio played by the second electronic device, wherein the audio information can be used by the first electronic device to pick up the first electronic device. The to-be-identified audio obtained from the sound is subjected to noise reduction processing.

In a possible implementation of the above-mentioned fourth aspect, the third electronic device is different from the first electronic device, and the method further includes: the third electronic device plays external audio; the third electronic device sends the third electronic device to the first electronic device The device broadcasts audio-related audio information, wherein the audio information can be used by the first electronic device to perform noise reduction processing on to-be-identified audio obtained by the first electronic device.

In a possible implementation of the fourth aspect, the audio information includes at least one of the following: audio data of the external audio, and voice activation detection VAD information corresponding to the audio.

In a fifth aspect, the present application provides an apparatus, the apparatus is included in an electronic device, and the apparatus has the function of implementing the behavior of the electronic device in the above-mentioned aspects and possible implementations of the above-mentioned aspects. The functions can be implemented by hardware, or by executing corresponding software by hardware. The hardware or software includes one or more modules or units corresponding to the above functions. For example, a pickup unit or module (such as a microphone or a microphone array), a receiving unit or module (such as a transceiver), a noise reduction module or unit (such as a processor with the function of the module or unit), and the like. For example, the sound pickup unit or module is used to support the first electronic device in the multiple electronic devices to pick up the voice to obtain the first voice to be recognized; the receiving unit or module (such as a transceiver) is used to support the first electronic device from multiple electronic devices. The second electronic device that broadcasts audio in the electronic device receives audio information related to the audio broadcast by the second electronic device; the noise reduction module or unit is used to support the first electronic device to pick up the audio according to the audio information received by the receiving unit or module. The first to-be-recognized speech obtained from the sound is subjected to noise reduction processing to obtain the second to-be-recognized speech.

In a sixth aspect, the present application provides a readable medium on which an instruction is stored, and when the instruction is executed on an electronic device, causes the electronic device to perform the multi-device-based multi-device in the above-mentioned first to fourth aspects. speech processing methods.

In a seventh aspect, the present application provides an electronic device, comprising: one or more processors; one or more memories; the one or more memories stores one or more programs, when the one or more programs are When executed by the one or more processors, the electronic device is caused to execute the multi-device-based voice processing method in the above-mentioned first to fourth aspects. In a possible implementation manner, the electronic device may further include a transceiver (which may be a separate or integrated receiver and transmitter) for receiving and transmitting signals or data.

In an eighth aspect, the present application provides an electronic device, comprising: a processor, a memory, a communication interface, and a communication bus; the memory is used to store at least one instruction, and the at least one processor, the memory, and the communication interface communicate through the at least one processor, the memory, and the communication interface. The bus is connected, when the at least one processor executes the at least one instruction stored in the memory, so that the electronic device executes the multi-device-based voice processing method in the above-mentioned first to fourth aspects.

Description of drawings

FIG. 1 is a schematic diagram of a scenario of multi-device-based voice processing provided by an embodiment of the present application;

FIG. 2 is a schematic flowchart of a voice assistant interaction session provided by an embodiment of the present application;

3 is a schematic diagram of another multi-device-based voice processing scenario provided by an embodiment of the present application;

4 is a schematic flowchart of a method for multi-device-based voice processing provided by an embodiment of the present application;

5 is a schematic flowchart of another method for multi-device-based voice processing provided by an embodiment of the present application;

6 is a schematic diagram of another multi-device-based voice processing scenario provided by an embodiment of the present application;

7 is a schematic flowchart of another method for multi-device-based voice processing provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of another multi-device-based voice processing scenario provided by an embodiment of the present application;

9 is a schematic flowchart of another method for multi-device-based voice processing provided by an embodiment of the present application;

10 is a schematic diagram of another multi-device-based voice processing scenario provided by an embodiment of the present application;

11 is a schematic flowchart of another method for voice processing based on multiple devices provided by an embodiment of the present application;

FIG. 12 shows a schematic structural diagram of an electronic device according to some embodiments of the present application.

detailed description

Illustrative embodiments of the present application include, but are not limited to, multi-device based speech processing methods, media, and electronic devices. The multi-device scenario of the multi-device-based speech processing application provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings.

FIG. 1 shows a multi-device scenario of a multi-device-based voice processing application provided by an embodiment of the present application. As shown in FIG. 1 , for the convenience of description, the multi-device scenario 10 only shows three electronic devices, such as electronic device 101 , electronic device 102 , and electronic device 103 , but it is understood that the technical solutions of the present application are applicable to many A device scene can include any number of electronic devices, not limited to 3.

Specifically, continuing to refer to FIG. 1 , after the user speaks the wake-up word, an answering device may be elected from a plurality of electronic devices, for example, the electronic device 101 may be elected as the answering device. The answering device then selects the sound pickup device with the best sound pickup effect (eg, the electronic device with the best voice enhancement effect) from the multiple devices. For example, the electronic device 101 elects the electronic device 103 as the sound pickup device. Furthermore, after the voice pickup device (such as the electronic device 103 ) picks up the voice data corresponding to the user's voice command, the answering device (such as the electronic device 101 ) can receive, recognize and respond to the voice data, so that the quality of the voice data processed by the answering device is improved. better. In addition, if there is an internal noise device that emits audio near the sound pickup device in this scenario, noise reduction processing can be performed on the voice data picked up by the sound pickup device according to the noise reduction information of the internal noise device, so as to further improve the processing by the answering device. the quality of the voice data. Therefore, even if the selected answering device in the multi-device scenario (such as the electronic device closest to the user) has poor sound pickup effect, or there is noise generated by the electronic device that is playing audio, multiple devices can cooperate to pick up and identify the audio Better quality voice data without requiring the user to move locations or manually control specific electronic device pickups. Further, it is beneficial to improve the accuracy of speech recognition during the speech control process, and to improve user experience.

In some embodiments, the electronic devices 101-103 in the multi-device scenario 10 are interconnected via a wireless network, eg, Wi-Fi (eg, Wireless Fidelity (Wireless Fidelity), Bluetooth (BT), Near Field Communication (Near Field Communication, NFC) and other wireless networks, but not limited to this. As an example, in order to realize interconnection between electronic devices 101-103 through a wireless network, the above-mentioned electronic devices 101-103 meet at least one of the following:

1) Connect to the same wireless access point (such as a Wi-Fi access point);

2) Log in to the same account;

3) It is set in the same group of devices, for example, the same group of devices has identification information of each device, so that the group of devices can communicate with each other according to their respective identification information.

It can be understood that different electronic devices may transmit information in a broadcast manner or a peer-to-peer manner through an interconnected wireless network, but not limited to this.

According to some embodiments of the present application, the types of wireless networks between different electronic devices in a multi-device scenario may be the same or different. For example, the electronic device 101 and the electronic device 102 are connected through a Wi-Fi network, and the

electronic devices

101 and 103 are connected through Bluetooth.

In each embodiment of the present application, the types of electronic devices in the multi-device scenario may be the same or different. For example, electronic devices suitable for use in the present application may include, but are not limited to, cell phones, tablet computers, desktops, laptops, handheld computers, notebook computers, desktop computers, ultra-mobile personal computers (UMPCs), netbooks , as well as cellular phones, personal digital assistants (PDAs), augmented reality (AR)\virtual reality (VR) devices, media players, smart TVs, smart speakers, smart watches, smart headphones, etc. . As an example, the types of electronic devices 101-103 shown in FIG. 1 are all different, and are illustrated by taking a mobile phone, a tablet computer, and a smart TV as examples. In addition, the embodiments of the present application do not specifically limit the specific form of the electronic device. For the specific structure of the electronic device, reference may be made to the description corresponding to FIG. 12 below, which will not be repeated here.

It can be understood that, in some embodiments of the present application, the electronic devices in the multi-device scenario all have a voice control function, for example, voice assistants with the same wake-up words are installed, for example, the wake-up words are all "Xiaoyi Xiaoyi". In addition, the electronic devices in the multi-device scenario are all within the effective working range of the voice assistant. For example, the distance from the user (that is, the pickup distance) is less than or equal to the preset distance (such as 5m), and the screen is in use (such as The screen is placed face up, or the screen cover is not closed), the Bluetooth is not turned off, the Bluetooth communication range is not exceeded, etc., but not limited to this.

It is understandable that a voice assistant is an application program (APP) based on artificial intelligence. With the help of speech semantic recognition algorithm, it helps users complete information query, device control, text input and other operations through instant question-and-answer voice interaction with users. The voice assistant can be a system application in an electronic device or a third-party application.

As shown in Figure 2, voice assistants usually use staged cascade processing, followed by voice wake-up, voice enhancement processing (Speech Enhancement, SE) (or, voice front-end processing), automatic speech recognition (Automatic Speech Recognition, ASR), Processes such as Natural Language Understanding (NLU), Dialog Management (DM), Natural Language Generation (NLG), Text To Speech (TTS), and response output realize the above functions. For example, when the user speaks the wake-up word "Xiaoyi Xiaoyi" to wake up the voice assistant, after the user speaks the voice command "How is the weather in Beijing tomorrow?" or "Play music", the voice command goes through SE, ASR, NLU , DM, NLG, TTS and other processes can trigger the electronic device to respond and output the voice command.

It can be understood that the voice data picked up by the electronic device in the present application is the voice data directly collected by the microphone, or the voice data processed by SE after the collection, and is used for input to the ASR for processing. Among them, the text processing result of the voice data output by the ASR is the basis for the voice assistant to accurately complete subsequent recognition and respond to voice data and other operations. Therefore, the quality of the voice data obtained by the voice assistant and input to the ASR will affect the accuracy of the voice assistant's recognition and response to the voice data.

In order to solve the problem that the sound pickup effect of an electronic device is easily affected by various factors, so that the quality of the voice data picked up by the electronic device is better, the embodiment of the present application comprehensively considers a variety of factors in a multi-device scenario to perform a multi-device-based The flow of speech processing. Generally, the factors that affect the pickup effect of electronic equipment include environmental factors 1)-3) and equipment factors 4)-6), as follows:

1) The distance or azimuth between the electronic device and the user, that is, the deployment location of the electronic device. In general, the closer the electronic device is to the user, the better the pickup.

2) Whether there is external noise near the electronic equipment, such as air-conditioning fans near the electronic equipment, unrelated human voices and other noises. It can be understood that the noise around the electronic device is other than the voice commands uttered by the user. Generally, electronic devices that are farther away from external noise are better at pickup.

3) Whether there is internal noise in the electronic device, such as the internal noise represented by the electronic device using the audio output from the speaker. Generally speaking, the internal noise of one electronic device may become the external noise of other electronic devices, affecting the sound pickup effect of other electronic devices.

4) Information on the microphone module of the electronic device, such as whether the microphone module is a single microphone or a microphone array, whether it is a near-field microphone array or a far-field microphone array, and the cutoff frequency of the microphone module. Generally, the microphone array has a better sound pickup effect than a single microphone. When the distance between the human and the machine is farther, the far-field microphone array has a better sound pickup effect than the near-field microphone array, and the higher the cut-off frequency of the microphone module, the sound pickup The better the effect.

5) The SE capability of the electronic device, such as the noise reduction performance of the microphone module of the electronic device, and the AEC capability of the electronic device, such as whether the AEC of the electronic device is valid. Generally, the noise reduction performance of the microphone module is better or the electronic device with AEC effective, indicating that the SE processing effect of the electronic device is better, that is, the sound pickup effect of the electronic device is better. For example, the noise reduction performance of the microphone array is better than that of a single microphone.

6) The device status of the electronic device, such as one or more factors such as device network connection status, headset connection status, microphone occupancy status, and profile information. For example, if the electronic device is in a device state that is not conducive to the sound pickup of the electronic device, such as the electronic device's network connection status is poor, wired or wireless headphones are connected, the microphone is already occupied, or it is in airplane mode, it means that the sound pickup effect of the electronic device is difficult. Guaranteed, or the electronic device cannot normally cooperate with other devices to pick up sounds, for example, it cannot normally send the voice data obtained by picking up sounds to other electronic devices.

FIG. 3 to FIG. 11 propose various embodiments of co-processing speech among multiple electronic devices according to the above-mentioned different influencing factors.

Example 1

FIG. 3 shows a scenario of co-processing voice among multiple electronic devices in different deployment locations. As shown in FIG. 3, in this multi-device scenario (referred to as multi-device scenario 11), the mobile phone 101a, the tablet computer 102a and the smart TV 103a are interconnected through a wireless network and are respectively deployed at different distances from the user. For example, they are deployed at positions 0.3 meters (m), 1.5 m, and 3.0 m away from the user, respectively. At this time, the mobile phone 101a is held by the user, the tablet computer 102a is placed on the desktop, and the smart TV 103a is wall mounted on the wall.

In the multi-device scenario 11, it is assumed that multiple electronic devices are in a low-noise environment, the ambient noise is less than or equal to 20 decibels (dB), and there is no internal noise generated by electronic devices that play external audio in this scenario. Therefore, it is not necessary to consider the influence of external noise and internal noise on the sound pickup effect of electronic devices, but mainly consider the deployment location of electronic devices, such as the influence of which electronic device is closest to the user on the multi-device-based voice processing.

FIG. 4 is a flowchart of a specific method for collaboratively processing speech in the scenario shown in FIG. 3 . As shown in FIG. 4 , the process of the method for cooperatively processing speech by the mobile phone 101a, the tablet computer 102a and the smart TV 103a includes:

Step 401: The mobile phone 101a, the tablet computer 102a and the smart TV 103a respectively pick up the first voice data corresponding to the wake-up word spoken by the user.

For example, the pre-registered wake-up words in the mobile phone 101a, the tablet computer 102a and the smart TV 103a are all "Xiaoyi Xiaoyi". After the user speaks the wake-up word "Xiaoyi Xiaoyi", the mobile phone 101a, the tablet computer 102a and the smart TV 103a can all detect the voice corresponding to "Xiaoyi Xiaoyi", and then determine whether to wake up the corresponding voice assistant.

It can be understood that, if the user speaks a voice within the pickup distance of the electronic device, the electronic device can monitor the corresponding voice data through the microphone and cache it. Specifically, electronic devices such as mobile phone 101a, tablet computer 102a, and smart TV 103a, in the absence of other software and hardware using microphones to pick up voice data, can use the microphone to monitor in real time whether the user has voice data input, and cache the picked-up voice data , as the above-mentioned first voice data.

Step 402: The mobile phone 101a, the tablet computer 102a and the smart TV 103a respectively check the picked up first voice data to determine whether the corresponding first voice data is a pre-registered wake-up word.

If the mobile phone 101a, the tablet computer 102a and the smart TV 103a all successfully verify the first voice data, it indicates that the picked up first voice data is a wake-up word, and the following step 403 can be executed. If the mobile phone 101a, the tablet computer 102a and the smart TV 103a all fail to verify the first voice data, it indicates that the picked up first voice data is not a wake-up word, and the following step 409 is performed.

In some embodiments, electronic devices that successfully verify the first voice data corresponding to the wake-up word can be recorded in a list. Cell phone 101a, tablet 102a, and smart TV 103a are recorded by a list (eg, a candidate answering device list). Then, the devices in the above candidate answering device list will be used to participate in the following multi-device answering election, so as to elect an electronic device that wakes up the voice assistant and recognizes the user's voice (ie, the answering device hereinafter). It can be understood that, in this embodiment of the present application, the multi-device response election is performed among multiple devices that successfully detect the wake-up word, that is, among the electronic devices that successfully verify the first voice data.

Step 403: The mobile phone 101a, the tablet computer 102a and the smart TV 103a elect the smart TV 103a as the answering device.

In some embodiments, the answering device is generally an electronic device that the user is used to or tends to use, or an electronic device that has a high probability of success in recognizing and responding to the user's voice data. Specifically, in a multi-device scenario, the answering device is used to recognize and respond to the user's voice data, such as performing processing steps such as ASR and NLU on the voice data. In a multi-device scenario, there is usually only one answering device, such as an electronic device in the list of candidate answering devices. In addition, after the electronic device (such as the smart TV 103a) as the answering device wakes up the voice assistant, it can play a wake-up answering tone, such as "I'm here". In the multi-device scenario, electronic devices other than the answering device, such as the mobile phone 101a and the tablet computer 102a, do not respond according to the candidate voice pickup instructions, that is, do not output a wake-up response tone.

For the selection of the answering device, various existing technologies can be used, which will be introduced in detail below.

In some embodiments, in a multi-device scenario, the answering device (such as the smart TV 103a) may perform a cooperative voice pickup election to elect a voice pickup device, and the following step 404 is specifically performed.

Step 404: The smart TV 103a obtains the phone 101a, the tablet 102a and the smart TV 103a corresponding to the phone 101a, respectively, and elects the phone 101a as the phone 101a according to the phone 103a.

The sound pickup election information may be a parameter used to determine whether the sound pickup effect of each electronic device is good or bad. For example, in some embodiments, the pick-up election information may include detected voice information of the user's voice (such as the voice information of the first voice data), microphone module information of each electronic device, and device status information of each electronic device and at least one item of AEC capability information of each electronic device. In addition, it can be understood that the information used for the election of the sound pickup device may also include other information, as long as the information that can evaluate the sound pickup function of the electronic device is applicable, and no limitation is imposed herein.

The sound information may include a signal-to-noise ratio (Signal to Noise Ratio, SNR), sound intensity (or energy value), reverberation parameters (such as reverberation delay), and the like. Moreover, the higher the signal-to-noise ratio, the higher the sound intensity, and the lower the reverberation delay of the user's voice picked up by the electronic device, the better the audio quality of the user's voice, that is, the better the sound-picking effect of the electronic device. Therefore, the voice information of the user's voice can be used to elect a sound pickup device.

In addition, the microphone module information is used to indicate whether the microphone module of the electronic device is a single microphone or a microphone array, whether it is a near-field microphone array or a far-field microphone array, and what the cutoff frequency of the microphone module is. Generally, when the distance between the human and the machine is long, the noise reduction capability of the far-field microphone is higher than that of the near-field microphone, so the sound pickup effect of the far-field microphone is better than that of the near-field microphone. The noise reduction capabilities of the single microphone, the line array microphone and the ring array microphone are successively improved, and the sound pickup effect of the corresponding electronic equipment is successively improved. In addition, the higher the cut-off frequency of the microphone module, the better the noise reduction ability, and the better the sound pickup effect of the corresponding electronic device. Therefore, the microphone module information can also be used to elect a pickup device.

The device status information refers to a device status that can affect the sound pickup effect of multiple electronic devices for cooperative sound pickup, such as network connection status, headphone connection status, microphone occupancy status, and scene mode information. Among them, the scene modes include: driving mode, riding mode (such as bus mode, high-speed rail mode or airplane mode, etc.), walking mode, sports mode, home mode and other modes. These scene modes can be automatically determined by the electronic device reading and analyzing sensor information, short messages or emails, setting information or historical operation records and other information of the electronic device. The sensor information is a Global Positioning System (Global Positioning System, GPS), an inertial sensor, a camera or a microphone, and the like. It can be understood that if the headset connection state is in the occupied state, it means that the electronic device is being used by the user, and the headset microphone that is closer to the user is supported; if the microphone occupancy state indicates that the microphone module is in the occupied state, it means that the electronic device may not be able to pass The microphone module picks up sound; if the network connection status indicates that the wireless network of the electronic device is poor, the success rate of the electronic device transmitting information through the wireless network, such as the success rate of sending voice pickup election information to the answering device, is affected. If the scenario mode is the above scenario mode such as driving mode and car ride mode, it means that the stability and/or connection rate of the wireless network connection of the electronic device may be low, which in turn affects the electronic device’s ability to participate in the pickup election process or the collaborative pickup process. Success rate. Therefore, the above-mentioned device status information can also be used to elect a sound pickup device.

The AEC capability information is used to indicate whether the electronic device has the AEC capability and whether the AEC of the electronic device is valid. The AEC capability is specifically the AEC capability of the microphone module in the electronic device. It can be understood that, compared with electronic devices that do not have AEC in effect or do not have AEC capabilities, electronic devices with AEC in effect have better SE processing capabilities, better noise reduction performance, and better sound pickup effects. Therefore, the above AEC capability information can also be used to elect a pickup device. In addition, the electronic equipment for which AEC takes effect is usually the electronic equipment that is playing audio.

It can be understood that AEC is a speech enhancement technology. It eliminates the noise generated by the return path of the microphone and the speaker due to the air generated by the sound wave interference. Improve the quality of voice data obtained by electronic equipment pickup. In addition, SE is used to preprocess the user's voice data collected by the microphone of the electronic device by means of hardware or software, using audio signal processing algorithms such as reverberation cancellation, AEC, blind source separation, and beamforming, so as to improve the obtained voice data. the quality of.

The smart TV 103a can elect a sound pickup device based on the sound pickup election information of each electronic device, and the specific election scheme will be described in detail below. For the convenience of description, it is assumed that the smart TV 103a elects the mobile phone 101a as the sound pickup device.

It can be understood that, in this embodiment of the present application, the remote peripheral virtualization technology can be used, and the pickup device or the microphone of the pickup device is used as the virtual peripheral node of the answering device, which is called by the voice assistant running on the answering device to complete the subsequent crossover. Equipment pickup process.

In addition, in some embodiments, after determining that an electronic device is a voice pickup device, the answering device may send a voice pickup instruction to the electronic device to instruct the electronic device to pick up the user's voice data. Similarly, the answering device may send a voice-picking stop instruction to other electronic devices other than the voice-picking device in the multi-device scenario, so as to instruct these electronic devices to no longer pick up the user's voice data. Alternatively, if other electronic devices other than the pickup device in the multi-device scenario do not receive any indication within a period of time (such as 5 seconds) after sending the pickup election information to the answering device, these electronic devices determine that they are not pickups. audio equipment.

Step 405: The mobile phone 101a picks up the second voice data corresponding to the voice command spoken by the user.

It can be understood that in subsequent applications, the mobile phone 101 is used as a sound pickup device to pick up various voice commands spoken by the user. For example, when the user speaks the voice command "What's the weather like in Beijing tomorrow?", the mobile phone 101a directly collects the voice command through the microphone module to obtain the second voice data, or the microphone module in the mobile phone 101a collects the voice command and processes it through SE. Obtain second voice data.

For the convenience of description, the "voice instruction" that appears alone in this embodiment of the present application may be a voice instruction corresponding to an event or operation received by the electronic device after waking up the voice assistant. For example, the user's voice instruction is the above-mentioned "how is the weather tomorrow?" or "play music". In addition, the names such as "voice", "voice instruction" and "voice data" in this document can sometimes be used interchangeably. It should be noted that the meanings to be expressed are the same when the differences are not emphasized.

Step 406: The mobile phone 101a sends the second voice data to the smart TV 103a.

It can be understood that the mobile phone 101a, as a voice pickup device, directly forwards the voice data of the voice command to the answering device after picking up the voice command issued by the user, and does not recognize or respond to the voice command issued by the user.

In addition, it can be understood that in other embodiments, if the answering device and the sound pickup device are the same device, this step is not required, and the answering device or the pickup device directly performs speech recognition on the voice data after picking up the voice command from the user. .

Step 407: The smart TV 103a recognizes the second voice data.

Specifically, after the smart TV 103a receives the voice data obtained by the phone 101a as the answering device, it can recognize the second voice data after noise reduction processing through the ASR, NLU, DM, NLG, and TTS hierarchical processing procedures.

For example, for the above-mentioned voice command "How is the weather in Beijing tomorrow?", ASR can convert the second voice data processed by SE into corresponding text (or text), and normalize and correct the spoken text. Error, written and other textual processing, such as getting the text "How will the weather in Beijing be tomorrow?".

Step 408: The smart TV 103a responds to the user's voice command according to the recognition result or controls other electronic devices to respond to the user's voice command.

It can be understood that, in the embodiment of the present application, for the recognized voice command of the user, if the response device can execute or can only be executed by the response device, the response device makes a response corresponding to the voice command. For example, for the above-mentioned voice command "How will the weather in Beijing be tomorrow?", the smart TV 103a answers "Tomorrow is sunny in Beijing", and for the voice command "Please turn off the TV", the smart TV 103a performs a shutdown function.

It can be understood that the above-mentioned voice "Tomorrow will be sunny in Beijing" is the answer voice output by the answering device through the TTS. In addition, the answering device can also control the system software, display screen, vibration motor and other software and hardware to perform answering operations, such as displaying the answer text generated by NLG through the display screen.

For voice commands for other electronic devices, the answering device can send the voice commands to the corresponding electronic devices after recognizing the voice commands. For example, for the voice command "open the curtains", after the smart TV 103a recognizes that the response operation is to open the curtains, it can send the operation instruction to open the curtains to the smart curtains, so that the smart curtains can complete the action of opening the curtains through hardware.

It can be understood that the above-mentioned other electronic devices may be Internet of Things (The Internet of Things, IOT) devices, such as smart home devices such as smart refrigerators, smart water heaters, and smart curtains. In some embodiments, the above-mentioned other electronic devices do not have a voice control function. If a voice assistant is not installed, the other electronic devices perform operations corresponding to the user's voice commands when triggered by the answering device.

In addition, in a multi-device scenario, after the user speaks the voice command corresponding to the second voice data, the user can continue to speak the subsequent voice command data stream, such as the voice command "What clothes should you wear tomorrow?". For the collaborative processing voice flow of these data streams in the multi-device scenario, reference may be made to the above-mentioned related description of the second voice data, which will not be repeated here.

Step 409: The mobile phone 101a, the tablet computer 102a and the smart TV 103a do not respond to the first voice data, and delete the cached first voice data.

For example, when the mobile phone 101a, the tablet computer 102a and the smart TV 103a execute step 409, the wake-up response voice "I am here" will not be output to the user. Of course, if the user continues to speak a voice command, such as "How is the weather in Beijing tomorrow?", these devices will not respond to the voice data corresponding to the voice command.

It can be understood that if some electronic devices in the mobile phone 101a, the tablet computer 102a and the smart TV 103a succeed in verifying the first voice data, and other electronic devices fail in verifying the first voice data, then only the former will continue to perform subsequent multi-device verification. The process of collaborative pickup. For example, if the mobile phone 101a and the tablet computer 102a successfully verify the first voice data, but the smart TV 103a fails to verify the first voice data, then the execution subject of the above step 403 will be replaced by the mobile phone 101a and the tablet computer 102a, and step 409 will be replaced by the smart TV 103a.

As described above, in the multi-device scenario of the embodiment of the present application, after the user speaks a voice command, the user does not need to specifically operate an electronic device to pick up the voice command (for example, the voice command corresponding to the second voice data), but is instead The answering device automatically uses the pickup device as a peripheral to pick up the user's voice command, and then realizes the voice control function through the response of the answering device to the user's voice command.

The multi-device-based voice processing method provided by the embodiment of the present application can select the electronic device with the best audio quality for picking up the voice command as the voice-picking device through the interaction and cooperation of multiple electronic devices, so as to support the answering device through the voice assistant Complete the voice interaction process with the user. For example, the pickup device can be an electronic device that is closest to the user and has better SE processing capability. In this way, the peripheral resources of multiple electronic devices equipped with microphone modules and voice assistants can be effectively aggregated, the impact of the deployment location of electronic devices on the recognition accuracy of voice assistants in multi-device scenarios is alleviated, and the user interaction experience and multi-device scenarios are improved. Environmental Robustness for Speech Recognition in China.

The following will specifically introduce the election of the answering device and the election of the voice pickup device in the embodiment of the present application.

Election of answering device

In some embodiments, for the above step 403, the electronic device in the multi-device scenario may perform multi-device response election according to at least one of the following response election strategies, and elect a response device:

Answering strategy 1) Elect the electronic device closest to the user as the answering device.

For example, for the scenario shown in FIG. 3, the mobile phone 101a can be elected as the answering device. The distance between the electronic device and the user can be represented by the sound information of the voice data corresponding to the wake-up word picked up by the electronic device. For example, the higher the signal-to-noise ratio of the first voice data, the higher the sound intensity, and the lower the reverberation delay, it means that the electronic device is closer to the user.

Answering strategy 2) Elect the electronic device actively used by the user as the answering device.

It can be understood that if the electronic device is actively used by the user, for example, the user has recently lifted the screen, it means that the user may be using the electronic device, and the user is more inclined to use it to recognize and respond to the user's voice data.

In some embodiments, whether the electronic device is actively used by the user can be characterized by the device usage record information. The device usage record information includes at least one of the following: screen-on time, screen-on-screen frequency, frequency of using a voice assistant, and the like. It can be understood that the longer the screen is on, the higher the screen is on, and the higher the frequency of using the voice assistant, the higher the degree of active use of the electronic device by the user. For example, according to the device usage record information of the mobile phone 101a, the tablet computer 102a and the smart TV 103a, the smart TV 103a actively used by the user can be elected as the answering device.

Answering strategy 3) Election of an electronic device equipped with a far-field microphone array as the answering device.

It can be understood that most electronic devices equipped with far-field microphone arrays are public devices, that is, electronic devices that users are more inclined to use to recognize and respond to the user's voice data. Among them, public equipment is usually used by users at relatively long distances (such as 1-3m) and in various directions, and supports shared use by multiple people, such as smart TVs or smart speakers. Compared with small electronic devices such as mobile phones and tablet computers, electronic devices equipped with far-field microphone arrays usually have better speaker performance and larger screen size, so the effect of the response voice output or the displayed response information for the user's voice command better. Therefore, electronic devices equipped with far-field microphone arrays are suitable as answering devices.

In some embodiments, whether the electronic device is equipped with a far-field microphone array is characterized by the microphone module information. For example, the mobile phone 101a, the tablet computer 102a and the smart TV 103a select the smart TV 103a equipped with the far-field microphone array as the answering device according to the microphone module information.

Answering strategy 4) Election of public equipment as answering equipment.

In some embodiments, whether the electronic device is a public device can also be characterized by the public device indication information. As an example, the public device indication information of the smart TV 103a indicates that the smart TV 103a is a public device, and the multi-device scenario 11 elects the smart TV 103a as the answering device. Similarly, for other descriptions of the response strategy 4), reference may be made to the relevant description of the response strategy 3), which will not be repeated here.

If two or more electronic devices in the multi-device scenario all satisfy the same answering election policy, any one of the electronic devices may be selected as the answering device.

It can be understood that, for the description that the answering device simultaneously satisfies the above-mentioned response strategies 1) to 4), reference may be made to the description of each response strategy in which the above-mentioned answering device satisfies the response conditions 1) to 4), and will not be repeated. In some embodiments, different priorities can be set for different response policies in advance. If one electronic device satisfies the response condition of the highest priority and the other electronic device meets the response condition of the lower priority in the multi-device scenario, then The former acts as an answering device.

In other embodiments, in addition to the response election strategy listed above, any electronic device that successfully checks the first voice data, that is, any electronic device in the above candidate response device list, may also be selected as the response device.

In some embodiments, any electronic device in the multi-device scenario may act as a master device to perform the step of electing a response device. For example, the mobile phone 101a as the master device elects the smart TV 103a as the response device, and sends a response instruction to the smart TV 103a to instruct the smart TV 103a to subsequently recognize and respond to the voice data corresponding to the user's voice command. In addition, the master device may send candidate voice pickup instructions to other electronic devices except the answering device in the multi-device scenario, so as to instruct these electronic devices not to recognize the user's voice command. Or, if other electronic devices other than the answering device in the multi-device scenario do not receive any indication within a preset time (for example, 10 seconds) after successfully verifying the first voice data, these electronic devices determine that they are not answering equipment.

Furthermore, in other embodiments, each electronic device in a multi-device scenario may perform the operation of an election answering device. For example, the mobile phone 101a, the tablet computer 102a, and the smart TV 103a all perform multi-device response election, and respectively elect the smart TV 103a as the response device. Then, the smart TV 103a can determine that it is an answering device, and then wake up the voice assistant to recognize and respond to the voice data corresponding to the user's voice command. Similarly, the mobile phone 101a and the tablet computer 102a respectively determine that they are not answering devices, and do not recognize and respond to the user's voice command.

In some embodiments, the electronic device performing the multi-device response election obtains the response election information of each electronic device in the multi-device scenario, and elects the response device according to the response election information.

For example, the response election information of an electronic device includes at least one of the following: sound information of the first voice data, device usage record information, microphone module information, and public device indication information, but is not limited thereto.

In addition, after the answering device obtains the answering election information of each electronic device, the information may be cached.

Election of Pickup Equipment

Specifically, in the above step 404, the smart TV 103a may receive the corresponding voice-picking election information respectively sent by the mobile phone 101a and the tablet computer 102a, and read its own voice-picking election information.

It should be noted that, in this embodiment of the present application, the sending order of the voice-picking election information corresponding to the mobile phone 101a and the tablet computer 102a, and the sending order of different information in each voice-picking election information are not limited, and can be any achievable sending. order.

In addition, for the selection information of each electronic device, if the answering device has calculated and cached some information of each electronic device in the above step 403, such as the sound information of the first voice data, then in step 404, the cached data can be read. information without recomputing it.

Specifically, the embodiment of the present application can comprehensively consider different information in the sound pickup election information corresponding to the electronic device, that is, different factors affecting the sound pickup effect of the electronic device, and set a sound pickup election strategy to compare the sound pickup effect in a multi-device scenario. Good electronics as a pickup device.

It can be understood that, in this embodiment of the present application, the multi-device voice pickup election is performed between multiple devices that successfully detect the wake-up word, that is, between electronic devices that successfully verify the first voice data. Specifically, the devices in the above-mentioned candidate answering device list may be used to participate in a multi-device sound pickup election to elect a sound pickup device. In this case, the candidate answering device list may be referred to as a candidate sound pickup device list. Specifically, in the process of performing the multi-device voice pickup election, the electronic devices in the candidate voice pickup device list can all be used as candidate voice pickup devices, for example, the above-mentioned mobile phone 101a, tablet computer 102a and smart TV 103a can all be used as candidate voice pickup devices A device, that is, an electronic device that conducts voice election based on the voice election information.

In some embodiments, an electronic device with a better sound pickup effect in the above-mentioned candidate sound pickup device list may be used as a sound pickup device through an end-to-end method such as an artificial neural network, an expert system, etc., using a sound pickup election strategy. Specifically, taking the sound pickup election information corresponding to each electronic device in the candidate sound pickup device list as the input of the artificial neural network or the expert system, the output result of the artificial neural network or the expert system is the sound pickup device. For example, if the selection information corresponding to the mobile phone 101a, the tablet computer 102a and the smart TV 103a is used as the input of the above artificial neural network or expert system, the output result of the artificial neural network or the expert system is the mobile phone 101a, that is, the mobile phone is elected. 101a is a pickup device.

The above artificial neural network can be a deep neural network (Deep Neural Network, DNN), a convolutional neural network (Convolutional Neural Network, CNN), a long short term memory network (Long Short Term Memory, LSTM) or a recurrent neural network (Recurrent Neural Network, RNN), etc., which are not specifically limited in the embodiments of the present application.

In addition, in other embodiments, the method of cascading processing in stages can be used to implement a sound pickup election strategy to use an electronic device with a better sound pickup effect in the candidate sound pickup device list as a sound pickup device. Specifically, feature extraction or numerical quantification may be performed on each parameter vector (ie, each voice-picking election information) in the sound-picking election information corresponding to each electronic device in the candidate sound-picking equipment list, and then a decision tree, logistic regression may be used. and other algorithm decisions to output the selection result of the pickup device. For example, through the method of cascading processing in stages, each parameter vector in the pickup election information corresponding to the mobile phone 101a, the tablet computer 102a and the smart TV 103a can be subjected to feature extraction or numerical quantification, and then a decision tree, logistic regression, etc. can be used. The algorithm decides to output the selection result of the sound pickup device as the mobile phone 101a, that is, elects the mobile phone 101a as the sound pickup device.

Specifically, in some embodiments, the answering device may perform a multi-device pickup election process through at least one of the first type of coordinated pickup strategy and the second type of coordinated pickup strategy. For example, the process may include a two-part process. The first part of the process is that the answering device first removes some disadvantaged devices that are obviously not suitable for participating in the subsequent collaborative voice pickup from the list of candidate voice pickup devices through the first type of collaborative voice pickup strategy, or directly decides Select the answering device as the most suitable pickup device. The second part of the process is that the answering device selects the electronic device with better sound pickup effect as the sound pickup device according to the sound pickup election information corresponding to each electronic device in the candidate sound pickup device list through the second type of cooperative sound pickup strategy. It can be understood that if the first part of the process does not decide a sound pickup device, then the second part of the process will be executed to elect a sound pickup device.

In some embodiments, the above-mentioned first type of cooperative voice pickup strategy may include at least one of the following strategies a1) to a6).

a1) Determine the electronic device that is connected to the headset and is not an answering device as a non-candidate pickup device.

The state that the electronic device has been connected to the earphone is indicated by the earphone connection state information. Specifically, if an electronic device is connected to a wired or wireless headset and is not an answering device, since the electronic device only supports close-range pickup through the headset microphone, the electronic device has a high probability of being far away from the user or not currently being used by the user, Selecting this device as a voice pickup device may cause speech recognition failure, so the electronic device is marked as a non-candidate voice pickup device that is not suitable for participating in the multi-device voice pickup election, and is removed from the candidate voice pickup device list. It can be understood that non-candidate pickup devices will not participate in the multi-device pickup election, that is, they will not be elected as pickup devices.

a2) Determine the electronic device that is in the preset network state (that is, the network state is poor) and is not an answering device as a non-candidate pickup device.

Wherein, that the electronic device is in a preset network state is indicated by the network connection state information. Specifically, if the network status of an electronic device is poor (such as low network communication rate, weak wireless network signal, frequent network disconnection recently, etc.) and it is not an answering device, in order to prevent the electronic device from being called by the answering device. If the data is lost or delayed, which affects the subsequent collaborative voice pickup and voice interaction process, the electronic device is marked as a non-candidate electronic device that is not suitable for participating in the multi-device voice pickup election, and is removed from the list of candidate voice pickup devices.

a3) Determine the electronic device whose microphone module is in an occupied state and is not an answering device as a non-candidate pickup device.

The fact that the microphone module is in an occupied state is indicated by the microphone occupancy information. If the microphone module of an electronic device is occupied by an application other than a voice assistant (such as a voice recorder), and it is not an answering device, it is regarded as a non-candidate pickup device and removed from the list of candidate pickup devices. Specifically, if the microphone module of the electronic device is occupied by other applications, indicating that the electronic device may not be able to use the microphone module for sound pickup, the electronic device is marked as a device that is not suitable for participating in collaborative sound pickup.

a4) Determine the answering device in the preset network state as the sound pickup device.

Among them, if the network connection status of the answering device is poor, in order to avoid the failure of the answering device to call other candidate pickup devices, it is directly decided to select the answering device as the most suitable pickup device, and the answering device is used as the pickup device to call the local microphone module for follow-up. pickup.

a5) Determine the answering device connected to the headset as the pickup device.

Among them, if the answering device has been connected to a wired or wireless headset, then the answering device has a high probability of being the device closest to the user or the device being used by the user, so it is directly decided to select the answering device as a sound pickup device.

a6) Determine the electronic device in the preset scene mode as a sound pickup device.

Among them, if the answering device is in a preset scene mode (such as subway mode, flight mode, driving mode, travel mode), it can directly decide to select the electronic device corresponding to the scene mode as the sound pickup device to ensure system performance. For example, in the driving mode, in order to avoid the interference of driving noise, the electronic device with better noise reduction capability of the microphone can be fixedly selected as the sound pickup device. For another example, in the travel mode, in order to avoid the increase in the communication power consumption of the device and the decrease in the battery life, the answering device can be fixedly selected as the sound pickup device.

In addition, the above-mentioned second type of voice selection strategy may include at least one of strategies b1) to b4).

b1) Use the electronic device with AEC in effect as a sound pickup device.

That is, the electronic device whose AEC capability information in the candidate sound pickup device list indicates that the AEC is valid is used as the sound pickup device, and the sound pickup effect of the electronic device with the AEC valid is better.

It can be understood that the electronic devices for which AEC takes effect are usually electronic devices that are playing audio. In addition, if the electronic device is playing audio and does not have the AEC capability or the AEC does not take effect, it will cause serious interference to the electronic device itself, such as the sound pickup effect of the electronic device. Of course, if the electronic device that is playing audio externally has the ability to reduce internal noise and AEC is in effect, then the influence of the internal noise generated by the external audio on its pickup effect can be eliminated.

b2) Use electronic devices with better noise reduction capabilities as sound pickup devices.

That is, the microphone model parameter in the candidate sound pickup equipment list indicates that the electronic equipment with better noise reduction capability of the microphone module is used as the sound pickup equipment. The electronics of the field microphone array act as a pickup device. Specifically, by judging whether the microphone module of the electronic device is a near-field microphone or a far-field microphone, a sound-collecting device of a far-field microphone with better noise reduction capability can be selected.

b3) Use the electronic device closest to the user as a sound pickup device.

That is, the electronic device closest to the user in the candidate sound pickup device list is used as the sound pickup device. Among them, usually the voice data (such as the first voice data) corresponding to the user's voice obtained by picking up voices in the candidate voice-picking device list has the highest sound intensity, the highest signal-to-noise ratio, and/or the lowest reverberation delay, indicating that the electronic device is far from the user. The most recent and the best pickup.

b4) Use the electronic device farthest from the external noise source as the sound pickup device.

That is, the electronic device that is farthest from the external noise source in the candidate sound pickup device list is used as the sound pickup device. Among them, usually the voice data (such as the first voice data) corresponding to the user's voice obtained by picking up voices in the candidate voice-picking device list has the highest sound intensity, the highest signal-to-noise ratio, and/or the lowest reverberation delay, indicating that the electronic device is far away from the outside The noise source is the farthest and the pickup is the best.

It can be understood that the above-mentioned voice selection strategy (such as the first type of voice selection strategy or the second type of voice selection strategy) includes, but is not limited to, the above examples. Specifically, for the descriptions that satisfy multiple items of strategies a1) to a6) and strategies b1) to b4) in the above-mentioned sound-collecting election strategy, reference can be made to the above-mentioned related descriptions that the sound-collecting equipment satisfies each sound-collecting election strategy respectively. , and will not be repeated here.

In some embodiments, different priorities can be set for different voice selection policies in advance, and the selection of voice pickup devices is preferentially performed according to the voice selection election policy with a higher priority. Of course, the priority of the pickup election strategy may be the priority of a single pickup election strategy, or the priority of a combination of multiple pickup election strategies. For example, the priority of the combination of strategy b1) and strategy b3) is greater than the priority of strategy b3). At this time, if one electronic device in the candidate sound pickup device list satisfies strategy b1) and strategy b3), the other electronic device satisfies the strategy b3), the former can be elected as the pickup device.

For example, in the multi-device scenario 11, the smart TV 103a is used as the answering device. In a low-noise environment without external noise interference, it can be selected from the mobile phone 101a, the tablet computer 102a, and the smart TV 103a according to the above strategies b2) and b3). The mobile phone 101a with better SE processing capability, stronger voice sound or the highest signal-to-noise ratio, and the lowest reverberation delay is the sound pickup device. At this time, the mobile phone 101a is the electronic device closest to the user, that is, 0.3 m away from the user. In this way, the influence of the deployment position of the electronic device on the sound pickup effect of the electronic device in the multi-device scenario can be avoided.

Embodiment 2

Since the user will subconsciously increase the volume when issuing the voice wake-up word, and due to the influence of factors such as the user or the position of the electronic device, the sound information such as the sound intensity and signal-to-noise ratio of the voice data of the wake-up word cannot be accurately expressed by the electronic device to pick up the user's voice command. Therefore, the sound information of the voice data corresponding to the voice command uttered by the user after uttering the wake-up word can be used as the voice-picking election information to elect a voice-picking device.

Fig. 5 shows a flow chart of another method for voice processing based on multiple devices. The method flow is different from the method flow shown in Fig. 4 in that the sound of the voice data according to the voice command spoken by the user is added. Information on electoral pickup equipment links. Specifically, as shown in Figure 5, the method flow includes:

Steps 501 to 503 are the same as the above-mentioned steps 401 to 403, and are not repeated here.

Step 504: The smart TV 103a obtains the phone 101a, the tablet computer 102a, and the smart TV 103a corresponding to the pickup election information respectively.

Step 505: The smart TV 103a picks up the voice command spoken by the user according to the mobile phone 101a, the tablet computer 102a and the smart TV 103a respectively, and selects the voice data corresponding to the voice within the first duration in the voice command as the third voice data.

For example, the first duration is X seconds (eg, 3s). In some embodiments, the voice corresponding to the third voice data is the voice whose arbitrary duration is the first duration in the voice command spoken by the user. For example, the voice within the above-mentioned first duration is the voice within the first X seconds in the voice command spoken by the user.

It can be understood that in the case where the voice within the above-mentioned first duration is the voice within the first X seconds spoken by the user, in step 505, the mobile phone 101a, the tablet computer 102a and the smart TV 103a only pick up the third voice data, respectively, The voice data corresponding to the voice command spoken by the user after the first X seconds will not be picked up. At this time, the above-mentioned third voice data may be a piece of voice command uttered by the user before the voice command corresponding to the above-mentioned second voice data (for example, "How is the weather in Beijing tomorrow?"). In this way, while the answering device obtains the third voice data relatively quickly, each electronic device in the multi-device scenario can avoid the waste of resources of the electronic device caused by a long time of performing the step of picking up the voice command spoken by the user.

In addition, in other embodiments, in step 505, the mobile phone 101a, the tablet computer 102a, and the smart TV 103a respectively pick up the voice data corresponding to a complete voice command spoken by the user, such as the second corresponding to "What's the weather in Beijing tomorrow?" voice data, and then select the third voice data of the first X seconds from the second voice data. At this time, the above-mentioned third voice data may be a voice command starting from the voice command corresponding to the above-mentioned second voice data (such as "How is the weather in Beijing tomorrow?") spoken by the user. For example, the third voice data is "tomorrow." ".

Step 506: The smart TV 103a obtains the sound information of the third voice data obtained by the mobile phone 101a, the tablet computer 102a and the smart TV 103a respectively.

For example, the sound information of the third speech data includes at least one of the following: a signal-to-noise ratio, a sound intensity (or energy value), a reverberation parameter, and the like. Generally speaking, the higher the signal-to-noise ratio, the higher the sound intensity and the lower the reverberation delay of the third voice data detected by the electronic device, the better the quality of the third voice data and the closer the third voice data is. Based on the voice command itself spoken by the user, it further indicates that the electronic device is closer to the user. That is to say, the sound information of the third voice data can be used as the sound-picking election information of the electoral sound-picking device.

In some embodiments, the mobile phone 101a and the tablet computer 102a can separately obtain the sound information of the third voice data, and then send the sound information of the third voice data to the smart TV 103a. Alternatively, the mobile phone 101a and the tablet computer 102a can respectively send the detected third voice data to the smart TV 103a, and then the smart TV 103a calculates the sound information of the third voice data corresponding to the mobile phone 101a and the tablet computer 102a respectively.

Step 507: The smart TV 103a adds the voice information of the third voice data to the voice-picking election information, and elects the mobile phone 101a as the voice-picking device according to the voice-picking election information corresponding to the mobile phone 101a, the tablet computer 102a and the smart TV 103a respectively.

Among them, steps 504 to 507 are similar to the above-mentioned step 404, and the similarities will not be repeated. The difference is that, in step 507 of this embodiment, the smart TV 103a additionally obtains the voice data (ie, the third voice data) corresponding to the voice commands spoken by the user for the first X seconds, so that the smart TV 103a can use the various electronic devices The detected sound information of the third voice data determines that the sound pickup device is the mobile phone 101a.

Specifically, in step 507, it is judged whether the smart TV 103a satisfies the above-mentioned selection strategy b3) and/or b4 according to the voice information of the third voice data corresponding to the mobile phone 101a, the tablet computer 102a and the smart TV 103a respectively according to the smart TV 103a ). Specifically, if it is determined according to the sound information of the third voice data that the smart TV 103a is the electronic device closest to the user or the electronic device farthest from the noise, then the smart TV 103a is selected as the sound pickup device.

It can be understood that, generally, the third voice data detected in the candidate sound pickup device list has the highest sound intensity, the highest signal-to-noise ratio, and/or the lowest reverberation delay, indicating that the electronic device is closest to the user. At this time, the electronic device detects that the quality of the third voice data is the best, and the pickup effect is the best.

Steps 508-512 are similar to the above-mentioned steps 405-409, and are not repeated here.

In the embodiment of the present application, in the multi-device scenario, not only can the voice pickup device be selected according to information such as the voice information of the first language data corresponding to the user's wake-up word, but also can be selected according to ( For example, the voice information of the third language data corresponding to the voice corresponding to the first X seconds) elects a sound pickup device. In this way, considering the influence of factors such as the position of the user or the electronic device, by adding the voice-picking election information of the voice information corresponding to the user's voice command to elect the voice-picking equipment, the accuracy of the voice-picking equipment can be further improved, thereby improving the multi-device scene. Accuracy of speech recognition.

Embodiment 3

In some multi-device scenarios, if there is external noise, especially when some electronic devices are at the same distance from the user, or even the electronic devices are of the same type (such as mobile phones), the distance between the electronic device and the external noise can be mainly considered, that is, The influence of external noise on the pickup effect of electronic equipment, and the selection of pickup equipment. It can be understood that, if different electronic devices are of the same type and have the same distance from the user, the sound pickup effects of these electronic devices are the same.

Specifically, FIG. 6 shows a multi-device scenario based on multi-device speech processing under external noise interference. They are interconnected through wireless networks and deployed at positions 1.5m, 1.5m, and 3.0m away from users. At this time, the mobile phone 101b and the mobile phone 102b can be placed idle on the desktop, and the smart TV 103b can be wall-mounted and installed on the wall. Among them, in the multi-device scenario 12, there is an external noise source 104 near the mobile phone 102b. For example, the external noise source can be a running air conditioner or other device that plays audio. Therefore, in this scenario, the influence of the external noise source 104 on the sound pickup effect of each electronic device is mainly considered, and the flow of voice processing based on multiple devices is performed.

FIG. 7 is a flowchart of a specific method for collaboratively processing speech based on FIG. 6 . As shown in FIG. 7 , the process of the method for the mobile phone 101b, the mobile phone 102b, and the smart TV 103b to cooperatively process voice includes:

Steps 701 to 709 are similar to the above-mentioned steps 401 to 409, and the similarities are not repeated.

The only difference is that the execution subject changes. In the multi-device scenario 12, the electronic devices interconnected through the wireless network are changed from mobile phone 101a, tablet computer 102a and smart TV 103a to mobile phone 101b, mobile phone 102b and smart TV 103b. Specifically, the answering device selected in step 703 is the smart TV 103b, and the sound pickup device selected in step 704 is the mobile phone 101b.

The distance between the mobile phone 101b and the mobile phone 102b in the multi-device scenario 12 is 1.5m from the user. Compared with the smart TV 103b which is 3m away from the user, the mobile phone 101b and the mobile phone 102b are both electronic devices closest to the user. However, in an environment where there is an external noise source 104 near the mobile phone 102b, the factor that distinguishes the sound pickup effects of the mobile phone 101b and the mobile phone 102b is only the distance from the external noise source 104 . Obviously, compared with the distance between the mobile phone 102b and the external noise source 104, the distance between the mobile phone 101b and the external noise source 104 is farther. Therefore, different from the multi-device scenario 11, in step 704 of the multi-device scenario 12, the smart TV 103b is used as the answering device. When there is interference from an external noise source, it can be selected according to the selection strategy (such as strategy b4) that is far away from external noise. The mobile phone 101b with the highest source, voice sound intensity or signal-to-noise ratio, and the lowest reverberation delay is the sound pickup device. In this way, the influence of external noise on the sound pickup effect of the electronic device in the multi-device scene can be avoided.

Similarly, referring to steps 505 to 507 shown in FIG. 5 , the smart TV 103b in the multi-device scenario 12 can also acquire the sound of the third voice data corresponding to the voice within the first duration in the voice command spoken by the user picked up by each electronic device information, and add the sound information to each sound pickup election information in step 705, and elect the mobile phone 101b far away from the external noise source 104 as the sound pickup device, which will not be repeated here.

In this way, in the multi-device-based voice processing method provided by the embodiment of the present application, the pickup device may have the advantages of being the closest to the user, having the capability of reducing internal noise (such as SE processing capability), and being far away from external noise sources. one or more. In this way, the influence of external noise interference on the recognition accuracy of the voice assistant in the multi-device scenario can be alleviated, and the user interaction experience and the environmental robustness of speech recognition in the multi-device scenario can be improved.

Embodiment 4

In a multi-device scenario, there is internal noise, such as the noise generated by an electronic device that is playing audio. The noise is 60-80dB, which will strongly interfere with other surrounding devices picking up voice commands. At this time, the influence of the internal noise on the sound pickup effect of the multi-device cooperative sound pickup can be mainly considered. For example, by using an electronic device that emits audio as a sound pickup device, multi-device sound pickup selection can be realized.

Specifically, FIG. 8 shows a multi-device-based speech processing scenario under the interference of internal noise. In this multi-device scenario (referred to as multi-device scenario 13 ), the mobile phone 101c, tablet computer 102c, and smart TV 103c are connected through a wireless network They are interconnected and deployed at positions 0.3m, 1.5m, and 3.0m away from users.

Among them, the smart TV 103c is in the state of playing audio, and the smart TV 103c has internal noise reduction (ie, noise reduction capability) or AEC capability. For example, the volume of the audio played by the smart TV 103c is 60-80 dB, which will strongly interfere with the sound pickup effect of the mobile phone 101c and the tablet computer 102c. Therefore, in this scenario, the flow of voice processing based on multi-devices is performed mainly considering the influence of the internal noise of the smart TV 103c on the sound pickup effect of the electronic device.

FIG. 9 is a flow chart of a specific method for collaboratively processing speech in the multi-device scenario shown in FIG. 8 . As shown in 9, the process of the method for the mobile phone 101c, the tablet computer 102c, and the smart TV 103c to cooperatively process voice includes:

Steps 901-905 are similar to the above-mentioned steps 401-405, and the similarities are not repeated.

The only difference is that the execution subject changes. In the multi-device scenario 13, the electronic devices interconnected through the wireless network are changed from mobile phone 101a, tablet computer 102a, and smart TV 103a to mobile phone 101c, tablet computer 102c, and smart TV 103c. Wherein, the answering device obtained by the cooperative answering election in step 903 is the smart TV 103c, and the sound pickup device obtained by the cooperative voice pickup in step 904 is also the smart TV 103c, that is, the sound pickup device and the answering device are the same.

Specifically, in the multi-device scenario 13, the influence of internal noise on the effect of electronic devices is considered. In the case where the smart TV 103c is in the state of playing audio, the smart TV 103c can follow the second type of the embodiment in the above-mentioned embodiment. In the strategy b1) and strategy b2) of the sound pickup election strategy, the smart TV 103c with relatively high voice signal-to-noise ratio and noise reduction capability is selected as the sound pickup device. For example, in the multi-device scenario 13, the smart TV 103c has the internal noise reduction capability or AEC is in effect, while the mobile phone 101c and the tablet 102c do not have the internal noise reduction capability, and the internal noise reduction capability is lower than that of the smart TV 103c. capability, do not have AEC capability, or AEC is not in effect.

It can be understood that, under normal circumstances, electronic devices with SE processing capabilities, such as electronic devices with internal noise reduction capabilities or AEC capabilities, can eliminate internal noise (that is, the audio) through the noise reduction information of the audio when the audio is played externally. Due to the influence of the pickup effect, the pickup can get better quality speech data.

In addition, in this embodiment, after multiple devices cooperate to select an answering device and a sound pickup device, the answering device can also query the internal noise device that is broadcasting audio, so that the internal noise device can share its noise reduction information.

Step 906 , the smart TV 103c queries the smart TV 103c that is broadcasting audio from the mobile phone 101c, the tablet computer 102c, and the smart TV 103c, and provides noise reduction information as an internal noise device.

The smart TV 103c determines the electronic device that is playing audio by querying the speaker occupancy status of each device or the audio/video software status (such as whether the audio/video software is turned on, and the volume of the electronic device). For example, if the smart device 103c finds that its own speaker is in an occupied state, the volume is high (such as more than 60% of the maximum volume), or the audio/video software is in an open state, then it is determined that the smart TV 103c itself is playing audio, and the Share noise reduction information.

Specifically, the mobile phone 101c and the tablet computer 102c can report to the smart TV 103c through the wireless network information about whether they are in an external audio state, such as reporting information on speaker occupancy status, volume, and/or audio/video software status.

Wherein, in some embodiments, the smart TV 103c continues to play audio through the speaker while picking up the voice data corresponding to the voice command spoken by the user.

It can be understood that, as an answering device and a sound pickup device, the smart TV 103c can perform noise reduction processing on the voice data corresponding to the subsequently picked up voice command after inquiring that it is an internal noise device.

Step 907: The smart TV 103c performs noise reduction processing on the second voice data obtained by picking up the sound according to the noise reduction information.

Step 908: The smart TV 103c identifies the second voice data after noise reduction processing.

Step 909, the smart TV 103c responds to the user's voice command according to the recognition result or controls other electronic devices to respond to the user's voice command.

In addition, the above steps 908 and 909 are similar to the above steps 406 to 408, except that the voice data recognized by the answering device (ie, the smart TV 103c) is the voice data subjected to noise reduction processing through the noise reduction information. Specifically, steps 906 and 907 are newly added, that is, the smart TV 103c is used as the answering device to query the internal noise device, specifically, the smart TV 103c that is playing audio is provided as the internal noise device to provide noise reduction information. The noise reduction information supports the sound pickup device to perform noise reduction processing on the speech data corresponding to the subsequently picked up speech. Obviously, the sound pickup device and the internal noise device are the same in this scene.

It is understandable that an electronic device with internal noise reduction capability (that is, noise reduction capability) or AEC effective can introduce the audio data of the external audio into the noise reduction process, and alleviate its interference by reducing the internal noise generated by the self-playing audio. . That is, the above-mentioned internal noise device obtains noise reduction information based on the audio data of the external audio, such as the audio data itself (that is, the internal noise information), or, the voice activity detection (Voice Activity Detection, VAD) information corresponding to the audio (or called mute suppression information).

The electronic device (such as the smart TV 103c) can provide noise reduction information of the external audio, and perform noise reduction processing on the internal noise through its noise reduction information, so as to eliminate the internal noise and affect other voice data (such as the voice data picked up by the user) effect to improve the quality of the picked-up speech data.

For example, in the multi-device scenario 13, for the voice command "What's the weather like in Beijing tomorrow?" after the user speaks the wake-up word, if the smart TV 103c directly picks up the second voice data, due to the influence of the external audio, the The second voice data may be directly identified as "How will the day be in Beijing tomorrow?", that is, the user's actual voice command "How will the weather in Beijing tomorrow?" is not accurately identified. At this time, the smart TV 103c can eliminate the influence of the external audio through the internal noise reduction information, so that the quality of the second voice data after noise reduction processing by the smart TV 103c is higher, and an accurate recognition result of the second voice data can be obtained subsequently. "What's the weather like in Beijing tomorrow?".

Step 910 is similar to the above-mentioned step 409 and will not be repeated here.

In addition, in some other embodiments, the answering device can also obtain the noise reduction information of the external audio of the internal noise information, and obtain the voice to be recognized by directly picking up the sound by the sound pickup device (that is, the internal noise of the external audio is not eliminated. Recognize speech), and then perform the step of performing noise reduction processing on the acquired speech to be recognized according to the acquired noise reduction information.

It can be understood that, by performing noise reduction processing on the sound pickup process of the sound pickup device through the noise reduction information of the electronic device that broadcasts the audio, such as the audio data of the outside sound and/or the corresponding VAD information of the audio, the noise reduction process can be alleviated. The influence of the internal noise of the electronic device that emits audio in the device scene on the sound pickup effect of the voice assistant ensures that the voice assistant is based on the sound pickup effect of multiple devices, thereby helping to ensure the voice recognition accuracy of the voice assistant. Further, the user experience in the speech recognition process is improved, and the environmental robustness of speech recognition in multi-device scenarios is improved.

Embodiment 5

When there is internal noise in the multi-device scene, in order to avoid the influence of the internal noise on the sound pickup effect of the multi-device scene collaborative sound pickup, not only can the external audio electronic device be used as the sound pickup device, but also the external audio electronic device can be used as the sound pickup device. The device shares the noise reduction information of the internal noise with other electronic devices that are sound pickup devices, so that the sound pickup device can eliminate the influence of the internal noise on the sound pickup effect across devices according to the noise reduction information.

FIG. 10 shows another multi-device-based speech processing scenario under the interference of internal noise. In this multi-device scenario (referred to as multi-device scenario 14 ), the mobile phone 101d and the tablet computer 102d are interconnected through a wireless network, and They are deployed at positions 0.3m and 0.6m away from the user, respectively. At this time, the mobile phone 101d is held by the user, and the tablet computer 102d is left idle on the desktop. Among them, the tablet computer 102d is in the state of external audio playback, and has internal noise reduction (ie, noise reduction capability) or AEC capability. Therefore, in this scenario, the influence of the internal noise of the tablet computer 102d on the sound pickup effect of cooperative sound pickup in the multi-device scenario can be mainly considered.

FIG. 11 is a flowchart of a specific method for collaboratively processing speech in the multi-device scenario shown in FIG. 10 , including:

Steps 1101 to 1102 are similar to the above-mentioned steps 401 to 402, and the similarities are not repeated. The difference is that the electronic devices interconnected through the wireless network in the multi-device scenario 14 are changed from a mobile phone 101c, a tablet computer 102c, and a smart TV 103c to a mobile phone 101d and a tablet computer 102d.

Step 1103: The mobile phone 101d and the tablet computer 102d elect the mobile phone 101d as the answering device and the sound pickup device.

Step 1104: The mobile phone 101d picks up the second voice data corresponding to the voice command spoken by the user.

The above steps 1103 to 1104 are similar to the above steps 403 to 404, the difference is that, after the responding device is obtained through the cooperative response election in step 1103, it is possible to directly decide that the responding device is a sound pickup device, without performing the sound pickup method in the above embodiment. Election Policy Steps to elect a pickup device. That is, the answering device and the pickup device are the same, such as the mobile phone 101d.

Step 1105: The mobile phone 101d inquires from the mobile phone 101d and the tablet computer 102d that the tablet computer 102d is playing external audio, and shares the noise reduction information as an internal noise device.

The above step 1105 is similar to step 906, the difference is that in step 1105 the answering device finds out that the internal noise device (tablet computer 102d) sharing noise reduction information is different from the sound pickup device (mobile phone 101d). Therefore, in this embodiment, step 1106 is added to realize that the internal noise device shares the noise reduction information with the sound pickup device (ie, the mobile phone 101d).

In addition, it can be understood that, in some embodiments, after the tablet computer 102d as the answering device finds out that the mobile phone 101d is an internal noise device, it can send a noise reduction instruction to the mobile phone 101d, so that the mobile phone 101d can send a noise reduction instruction to the sound pickup device according to the noise reduction instruction. Tablet 102d shares noise reduction information.

Step 1106: The tablet computer 102d sends the noise reduction information of the tablet computer 102d to the mobile phone 101d.

It can be understood that by sharing the noise reduction information with the mobile phone 101d through the tablet computer 102d, it is possible to share the audio data of the external audio and/or the VAD information corresponding to the audio across devices, effectively aggregating multiple devices equipped with microphone modules and voice assistants. Peripheral resources for electronic devices.

Specifically, the tablet computer 102d can send the noise reduction information of the tablet computer 102d to the mobile phone 101d through the wireless network with the mobile phone 101d.

Step 1107: The mobile phone 101d performs noise reduction processing on the second voice data obtained by picking up the sound according to the noise reduction information of the tablet computer 102d.

Step 1108: The mobile phone 101d identifies the second voice data after noise reduction processing.

Step 1109: The mobile phone 101d responds to the user's voice command according to the recognition result or controls other electronic devices to respond to the user's voice command.

Among them, steps 1107 to 1109 are similar to the above steps 907 to 909, the difference is that in step 1107, the sound pickup device (ie the mobile phone 101d) uses the noise reduction information of other devices (ie the tablet computer 102d) to correspond to the voice picked up by itself Noise reduction processing is carried out on the voice data of the device, and cross-device noise reduction processing is realized.

For example, in the multi-device scenario 14, for the voice command "What's the weather like in Beijing tomorrow?" after the user speaks the wake-up word, when the mobile phone 101d directly picks up the second voice data corresponding to the voice command, because the tablet computer 102d is outside the The quality of the second voice data is poor due to the influence of the audio played, so that the second voice data may be recognized as "How will it be in Beijing tomorrow?", that is, the user's actual voice command "How will the weather in Beijing tomorrow?" different. That is, because the quality of the second voice data picked up by the mobile phone 101d is poor, the recognition result of the subsequent second voice data is inaccurate. At this time, since the mobile phone 101d can perform noise reduction processing on the picked-up second voice data through the noise reduction information shared by the tablet computer 102d, the influence of the audio played by the tablet computer 102d on the sound pickup effect of the mobile phone 101d is eliminated. Furthermore, the quality of the second voice data after the noise reduction process is made higher, and the second voice data is subsequently accurately identified as "How is the weather in Beijing tomorrow?".

It can be understood that by sharing the audio data of the external audio and/or the corresponding VAD information of the audio across the devices, the auxiliary sound pickup equipment performs noise reduction processing during the sound pickup process, which can effectively aggregate multiple devices equipped with microphone modules and voice assistants. The peripheral resources of electronic devices further improve the accuracy of speech recognition in multi-device scenarios.

In this way, in the embodiment of the present application, the sound pickup device selected from the multiple devices may have one or more favorable factors such as the closest distance to the user, the farthest distance to the external noise source, and the ability to reduce internal noise. kind. In this way, the impact of the deployment location of electronic devices, internal noise interference or external noise interference on the voice pickup effect of the voice assistant and the accuracy of speech recognition in the multi-device scenario can be alleviated, and the user interaction experience and the environment of speech recognition in multi-device scenarios can be improved. robustness.

FIG. 12 shows a schematic structural diagram of the electronic device 100 .

The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and Subscriber identification module (subscriber identification module, SIM) card interface 195 and so on. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.

It can be understood that, the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the electronic device 100 . In other embodiments of the present application, the electronic device 100 may include more or less components than shown, or combine some components, or separate some components, or arrange different components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (neural-network processing unit, NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.

For example, the processor 110 may be used to detect whether the electronic device 100 picks up voice data corresponding to a wake-up word or voice command spoken by the user, and to obtain sound information, device status information, microphone module information, etc. of the voice data. In addition, actions such as the above-mentioned answering device election, voice pickup device election, or internal noise device inquiry can also be performed according to information of each electronic device (such as voice pickup election information or response election information, etc.).

The NPU is a neural-network (NN) computing processor. By drawing on the structure of biological neural networks, such as the transfer mode between neurons in the human brain, it can quickly process the input information, and can continuously learn by itself. Applications such as intelligent cognition of the electronic device 100 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like. For example, the NPU can support the electronic device 100 to recognize the voice data obtained by picking up the voice through the voice assistant.

It can be understood that the interface connection relationship between the modules illustrated in the embodiment of the present invention is only a schematic illustration, and does not constitute a structural limitation of the electronic device 100 . In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.

Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in electronic device 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example, the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide wireless communication solutions including 2G/3G/4G/5G etc. applied on the electronic device 100 . The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA) and the like. The mobile communication module 150 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation. The mobile communication module 150 can also amplify the signal modulated by the modulation and demodulation processor, and then turn it into an electromagnetic wave for radiation through the antenna 1 . In some embodiments, at least part of the functional modules of the mobile communication module 150 may be provided in the processor 110 . In some embodiments, at least part of the functional modules of the mobile communication module 150 may be provided in the same device as at least part of the modules of the processor 110 .

The wireless communication module 160 can provide applications on the electronic device 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), global navigation satellites Wireless communication solutions such as global navigation satellite system (GNSS), frequency modulation (FM), near field communication (NFC), and infrared technology (IR). The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 . The wireless communication module 160 can also receive the signal to be sent from the processor 110 , perform frequency modulation on it, amplify it, and convert it into electromagnetic waves for radiation through the antenna 2 .

For example, the above-mentioned antenna 1, antenna 2, mobile communication module 150, wireless communication module 160 and other modules can be used to support the electronic device 100 to send voice information, device status information, etc. of voice data to other electronic devices in a multi-device scenario. In order to send the above-mentioned response election information, pick-up election information, noise reduction information, etc.

The electronic device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

Display screen 194 is used to display images, videos, and the like. For example, the display screen 194 may be used to support the electronic device 100 to display a response interface in response to a user's voice command, and the response interface may include response text and other information.

The external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100 . The external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card. For example, an external memory card may be used to support the electronic device 100 to store the above-mentioned pick-up election information, answering election information, noise reduction information, and the like.

Internal memory 121 may be used to store computer executable program code, which includes instructions. The internal memory 121 may include a storage program area and a storage data area. The storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like. The storage data area may store data (such as audio data, phone book, etc.) created during the use of the electronic device 100 and the like. In addition, the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like. The processor 110 executes various functional applications and data processing of the electronic device 100 by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor. For example, an external memory card may be used to support the electronic device 100 to store the above-mentioned pick-up election information, answering election information, noise reduction information, and the like.

The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playback, recording, etc.

The audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal, such as converting the user's voice received by the electronic device 100 into a digital audio signal (ie, the user's voice corresponding to the user's voice). voice data), or convert the audio generated by the voice assistant using TTS into the response voice. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .

Speaker 170A, also referred to as a "speaker", is used to convert audio electrical signals into sound signals. The electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call, or play the response voice corresponding to the user's voice command based on the voice assistant, such as the response voice "I'm here" for the wake-up word, or for the voice command "Tomorrow Beijing weather. How's it going?" The answering voice "Tomorrow will be sunny in Beijing".

The receiver 170B, also referred to as "earpiece", is used to convert audio electrical signals into sound signals. When the electronic device 100 answers a call or a voice message, the voice can be answered by placing the receiver 170B close to the human ear.

Microphone (ie, microphone module) 170C, also known as "microphone" and "microphone", is used to convert sound signals into electrical signals, such as converting wake-up words or voice commands spoken by the user into electrical signals (ie, corresponding voice data). ). When making a call or sending a voice message, the user can make a sound by approaching the microphone 170C through a human mouth, and input the sound signal into the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.

Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementation methods. Embodiments of the present application may be implemented as a computer program or program code executing on a programmable system including at least one processor, a storage system (including volatile and nonvolatile memory and/or storage elements) , at least one input device, and at least one output device.

Program code may be applied to input instructions to perform the functions described herein and to generate output information. The output information can be applied to one or more output devices in a known manner. For the purposes of this application, a processing system includes any system having a processor such as, for example, a digital signal processor (DSP), microcontroller, application specific integrated circuit (ASIC), or microprocessor.

The program code may be implemented in a high-level procedural language or an object-oriented programming language to communicate with the processing system. The program code may also be implemented in assembly or machine language, if desired. In fact, the mechanisms described in this application are not limited in scope to any particular programming language. In either case, the language may be a compiled language or an interpreted language.

In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments can also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (eg, computer-readable) storage media, which can be executed by one or more processors read and execute. For example, the instructions may be distributed over a network or over other computer-readable media. Thus, a machine-readable medium can include any mechanism for storing or transmitting information in a form readable by a machine (eg, a computer), including, but not limited to, floppy disks, optical disks, optical disks, read only memories (CD-ROMs), magnetic Optical Disc, Read Only Memory (ROM), Random Access Memory (RAM), Erasable Programmable Read Only Memory (EPROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Magnetic or Optical Cards, Flash Memory, or Tangible machine-readable storage for transmitting information (eg, carrier waves, infrared signal digital signals, etc.) using the Internet in electrical, optical, acoustic, or other forms of propagating signals. Thus, machine-readable media includes any type of machine-readable media suitable for storing or transmitting electronic instructions or information in a form readable by a machine (eg, a computer).

In the drawings, some structural or method features may be shown in specific arrangements and/or sequences. It should be understood, however, that such specific arrangements and/or orderings may not be required. Rather, in some embodiments, the features may be arranged in a manner and/or order different from that shown in the illustrative figures. Additionally, the inclusion of structural or method features in a particular figure is not meant to imply that such features are required in all embodiments, and in some embodiments such features may not be included or may be combined with other features.

It should be noted that each unit/module mentioned in each device embodiment of this application is a logical unit/module. Physically, a logical unit/module may be a physical unit/module or a physical unit/module. A part of a module can also be implemented by a combination of multiple physical units/modules. The physical implementation of these logical units/modules is not the most important, and the combination of functions implemented by these logical units/modules is the solution to the problem of this application. The crux of the technical question raised. In addition, in order to highlight the innovative part of the present application, the above-mentioned device embodiments of the present application do not introduce units/modules that are not closely related to solving the technical problems raised in the present application, which does not mean that the above-mentioned device embodiments do not exist. other units/modules.

It should be noted that, in the examples and specification of this patent, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that Any such actual relationship or sequence exists between these entities or operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a" does not preclude the presence of additional identical elements in a process, method, article, or device that includes the element. Although the present application has been illustrated and described with reference to certain preferred embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and detail may be made therein without departing from the present disclosure The spirit and scope of the application.

Claims

A multi-device-based voice processing method, characterized in that the method comprises:

The first electronic device in the plurality of electronic devices picks up the voice to obtain the first to-be-recognized voice;

The first electronic device receives audio information related to the audio played by the second electronic device from a second electronic device that plays audio from among the plurality of electronic devices;

The first electronic device performs noise reduction processing on the first to-be-recognized speech obtained by picking up sounds according to the received audio information to obtain a second to-be-recognized speech.
The method according to claim 1, wherein the audio information includes at least one of the following: audio data of the external audio, and voice activation detection VAD information corresponding to the audio.
The method according to claim 1 or 2, wherein the method further comprises:

The first electronic device sends the second voice to be recognized to a third electronic device for recognizing voice among the plurality of electronic devices; or

The first electronic device recognizes the second voice to be recognized.
The method according to claim 3, wherein before the first electronic device in the plurality of electronic devices picks up the voice to obtain the first to-be-recognized voice, the method further comprises:

The first electronic device sends the voice pickup election information of the first electronic device to the third electronic device, wherein the voice pickup election information of the first electronic device is used to represent the voice pickup of the first electronic device condition;

The first electronic device is an electronic device for sound pickup selected by the third electronic device from the plurality of electronic devices based on the acquired sound pickup election information of the plurality of electronic devices.
The method according to claim 4, wherein the method further comprises:

The first electronic device receives a sound pickup instruction sent by the third electronic device, wherein the sound pickup instruction is used to instruct the first electronic device to pick up sound and send the noise reduction processing to the third electronic device. of the speech to be recognized.
The method according to claim 4 or 5, characterized in that, the voice pickup election information includes at least one of the following: echo cancellation AEC capability information, microphone module information, equipment status information, and the corresponding wake-up word obtained by voice pickup. Voice information, the voice information corresponding to the voice command obtained by picking up the sound;

Wherein, the voice command is obtained by picking up sounds after picking up the wake-up word; the device status information includes at least one of the following: network connection status information, headset connection status information, microphone occupancy status information, and scene mode information.
A multi-device-based voice processing method, characterized in that the method comprises:

The second electronic device in the plurality of electronic devices plays audio;

The second electronic device sends audio information related to the audio to the first electronic device used for pickup among the plurality of electronic devices, wherein,

The audio information can be used by the first electronic device to perform noise reduction processing on the to-be-identified audio obtained by the sound pickup by the first electronic device.
The method according to claim 7, wherein the audio information comprises at least one of the following: audio data of the audio, and voice activation detection VAD information corresponding to the audio.
The method according to claim 7 or 8, wherein the method further comprises:

the second electronic device receives a shared instruction from a third electronic device for recognizing voice among the plurality of electronic devices; or

the second electronic device receives a sharing instruction from the first electronic device;

Wherein, the sharing instruction is used to instruct the second electronic device to send the audio information to the first electronic device.
The method according to claim 9, wherein before the second electronic device sends the audio information related to the audio to the first electronic device used for picking up sound among the plurality of electronic devices, the audio Methods also include:

The second electronic device sends the voice pickup election information of the second electronic device to the third electronic device, wherein the voice pickup election information of the second electronic device is used to represent the voice pickup of the second electronic device condition;

The first electronic device is an electronic device for sound pickup selected by the third electronic device from the plurality of electronic devices based on the acquired sound pickup election information of the plurality of electronic devices.
A multi-device-based voice processing method, characterized in that the method comprises:

The third electronic device of the plurality of electronic devices detects that there is a second electronic device that is playing audio externally among the plurality of electronic devices;

When the second electronic device is different from the third electronic device, the third electronic device sends a sharing instruction to the second electronic device, wherein the sharing instruction is used to instruct the second electronic device sending audio information related to the audio played by the second electronic device to the first electronic device used for sound pickup among the plurality of devices;

In the case that the second electronic device is the same as the third electronic device, the third electronic device sends the audio information to the first electronic device;

The audio information can be used by the first electronic device to perform noise reduction processing on the first to-be-recognized speech obtained by the first electronic device to obtain the second to-be-recognized speech.
The method according to claim 11, wherein the audio information comprises at least one of the following: audio data of the audio, and voice activation detection VAD information corresponding to the audio.
The method according to claim 11 or 12, wherein the first electronic device is different from the third electronic device, and the method further comprises:

The third electronic device acquires, from the first electronic device, the second to-be-recognized voice obtained by the voice pickup of the first electronic device;

The first electronic device recognizes the second voice to be recognized.
The method according to claim 13, wherein before the third electronic device sends the sharing instruction to the second electronic device, the method further comprises:

The third electronic device acquires the sound pickup election information of the plurality of electronic devices, wherein the sound pickup election information of the plurality of electronic devices is used to indicate the sound pickup situation of the plurality of electronic devices;

The third electronic device selects an electronic device from the plurality of electronic devices as the first electronic device based on the voice pickup election information of the plurality of devices.
The method of claim 14, wherein the method further comprises:

The third electronic device sends a sound pickup instruction to the first electronic device, wherein the sound pickup instruction is used to instruct the first electronic device to pick up sound and send the result obtained by picking up the sound to the third electronic device. Describe the second voice to be recognized.
The method according to claim 14 or 15, wherein the voice pickup election information includes at least one of the following: echo cancellation AEC capability information, microphone module information, equipment status information, and the corresponding wake-up word obtained by voice pickup. Voice information, the voice information corresponding to the voice command obtained by picking up the sound;

Wherein, the voice command is obtained by picking up sounds after picking up the wake-up word; the device status information includes at least one of the following: network connection status information, headset connection status information, microphone occupancy status information, and scene mode information.
The method according to claim 16, wherein the third electronic device selects at least one electronic device from the plurality of electronic devices as the first electronic device based on the selection information of the plurality of electronic devices. An electronic device including at least one of the following:

When the third electronic device is in a preset network state, the third electronic device determines the third electronic device as the first electronic device;

In the case that the third electronic device has been connected to an earphone, the third electronic device determines the third electronic device as the first electronic device;

The third electronic device determines at least one electronic device in a preset scene mode among the plurality of electronic devices as the first electronic device.
The method according to claim 17, wherein the third electronic device selects at least one electronic device from the plurality of electronic devices as the first electronic device based on the selection information of the plurality of electronic devices. An electronic device including at least one of the following:

The third electronic device uses at least one of the electronic devices for which the AEC is valid among the plurality of electronic devices as the first electronic device;

The third electronic device uses at least one of the plurality of electronic devices whose noise reduction capability is greater than that of the electronic devices satisfying a predetermined noise reduction condition as the first electronic device;

The third electronic device uses at least one of the electronic devices whose distance from the plurality of electronic devices to the user is less than the first predetermined distance as the first electronic device;

The third electronic device uses at least one of the electronic devices whose distance from the external noise source is greater than the second predetermined distance among the plurality of electronic devices as the first electronic device.
The method according to claim 17, wherein the preset network state includes at least one of the following: a network with a network communication rate less than or equal to a predetermined rate, and a network wire frequency greater than or equal to a predetermined frequency; the preset scenario The modes include at least one of the following: subway mode, airplane mode, driving mode, travel mode.
The method according to any one of claims 11-19, wherein the third electronic device uses a neural network algorithm or a decision tree algorithm to select the first electronic device from the plurality of electronic devices.
A voice processing system, characterized in that the system comprises: a first electronic device and a second electronic device;

Wherein, the second electronic device sends audio information related to the audio to the first electronic device used for sound pickup when the audio is played externally;

The first electronic device is used to pick up sounds to obtain the first to-be-recognized speech, and performs noise reduction processing on the first to-be-recognized speech obtained by picking up the sound according to the audio information received from the second electronic device. The second voice to be recognized.
The system of claim 21, wherein the system further comprises: a third electronic device;

The third electronic device is configured to acquire the voice pickup election information of multiple electronic devices, wherein the voice pickup election information of the multiple electronic devices is used to represent the voice pickup situation of the multiple electronic devices; and based on the multiple electronic devices Voice pickup election information of a plurality of electronic devices, at least one electronic device is selected from the plurality of electronic devices as the first electronic device for voice pickup, wherein the first electronic device, the second electronic device and all the electronic devices The third electronic device is an electronic device among the plurality of electronic devices, and the third electronic device is the same as or different from the first electronic device;

The first electronic device is further configured to send the second to-be-recognized voice to the third electronic device; and

The third electronic device is further configured to recognize the second to-be-recognized voice acquired from the first electronic device.
A computer-readable storage medium, characterized in that, instructions are stored on the storage medium, and when the instructions are executed on a computer, the computer executes the multi-device-based storage medium according to any one of claims 1 to 20. speech processing methods.
An electronic device, characterized in that it comprises: one or more processors; one or more memories; the one or more memories store one or more programs, when the one or more programs are stored by the When executed by one or more processors, the electronic device is caused to execute the multi-device-based voice processing method according to any one of claims 1 to 20.
An electronic device, characterized in that it comprises: a processor, a memory, a communication interface and a communication bus; the memory is used to store at least one instruction, and the at least one processor, the memory and the communication interface pass the A communication bus is connected, when the at least one processor executes the at least one instruction stored in the memory, so that the electronic device executes the multi-device-based voice processing method according to any one of claims 1 to 20.