CN117641191A - Sound processing method, sound pickup system and electronic equipment - Google Patents

Sound processing method, sound pickup system and electronic equipment Download PDF

Info

Publication number
CN117641191A
CN117641191A CN202210983429.1A CN202210983429A CN117641191A CN 117641191 A CN117641191 A CN 117641191A CN 202210983429 A CN202210983429 A CN 202210983429A CN 117641191 A CN117641191 A CN 117641191A
Authority
CN
China
Prior art keywords
sound
user
pickup
picked
control device
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210983429.1A
Other languages
Chinese (zh)
Inventor
刘鑫
高丽
罗友
王金山
黎椿键
杨志兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202210983429.1A priority Critical patent/CN117641191A/en
Publication of CN117641191A publication Critical patent/CN117641191A/en
Pending legal-status Critical Current

Links

Abstract

A sound processing method, a sound pickup system and electronic equipment relate to the technical field of terminals. Each pickup device in the pickup system can actively pick up the sound of at least one user in the multi-person interaction process, and a control device in the pickup system can automatically identify pickup demand modes, such as silence, speaking and the like, of the user corresponding to the pickup device in the multi-person interaction process according to the sound picked up by the pickup device. And then, the control equipment performs inhibition or enhancement processing on the sound picked up by the sound pickup equipment corresponding to the user according to the sound pickup demand mode of the user. No matter what position the user is, the pickup demand mode of the user can be automatically identified, so that the demands of different users at different positions in the interaction of multiple people are met, the operation of manually changing the pickup state of the pickup device by the user is reduced, the user experience is ensured, and the accuracy and quality of sound processing are also improved.

Description

Sound processing method, sound pickup system and electronic equipment
Technical Field
The application relates to the technical field of terminals, in particular to a sound processing method, a sound pickup system and electronic equipment.
Background
In an interactive scenario of a multi-person conference, if a participant does not want to speak or the content of the discussion is not intended to be heard by other participants, the sound of the participant can be suppressed. At present, two modes for suppressing the sound of a target participant who does not want to speak are mainly included, one is to control sound pickup devices such as microphones near the target participant to mute, and the other is to suppress the sound of an area where the target participant is located.
However, in any of the above methods, the target conference participants are required to manually control the pickup state of the pickup device such as the microphone according to their own speech, mute requirements, etc., and such an operation process is complicated, which affects the participant and host participating in the conference.
Disclosure of Invention
The embodiment of the application provides a sound processing method, a sound pickup system and electronic equipment, which can automatically identify a sound pickup demand mode of a user, such as a mute mode, a speaking mode and the like, and inhibit or enhance picked-up sound of the user according to the sound pickup demand mode of the user, so that the operation of manually changing the sound pickup state of the sound pickup equipment by the user is reduced, and the experience of the user is ensured.
In order to achieve the above purpose, the embodiments of the present application adopt the following technical solutions:
in a first aspect, a sound processing method is provided, and the sound processing method is applied to a control device in a sound pickup system, where the sound pickup system further includes a first device and a second device that can pick up sound at a location where the first device is located. The control device may determine, according to the first sound picked up by the first device, a pickup demand mode of the first user corresponding to the first device, and determine, according to the second sound picked up by the second device, a pickup demand mode of the second user corresponding to the second device. Then, in the case where the sound pickup demand mode of the first user is mute and the sound pickup demand mode of the second user is non-mute, the control device removes or attenuates the sound of the first user in the mixed sound of the first sound and the second sound. Wherein the mixing after removing or weakening the sound of the first user can be played by the first device and/or the second device, and when the mixing is played, the listening user cannot hear or hear clearly the content of the first user speaking.
The scheme can be applied to a multi-person interaction scene, and the multi-person interaction scene comprises a pickup device for pickup, a control device for controlling the pickup device and the like. Each pickup device can actively pick up the sound of at least one user in the multi-person interaction process, and the control device can automatically identify the pickup demand mode of the user corresponding to the pickup device in the multi-person interaction process according to the sound picked up by the pickup device. And then, the control equipment processes the sound picked up by the sound pickup equipment corresponding to the user according to the sound pickup demand mode of the user. For example, when a user has a mute demand, a sound picked up by the sound pickup apparatus is subjected to a suppressing process or the like. The sound processing method can automatically identify the pickup demand mode of the user no matter where the user is, and can process the picked-up sound of the user according to the pickup demand mode of the user in real time when the pickup demand mode of the user changes, thereby meeting the demands of different users at different positions in the multi-user interaction, reducing the operation of manually changing the pickup state of the pickup device by the user, ensuring the user experience and improving the accuracy and quality of sound processing.
In an implementation manner of the first aspect, the control device may further enhance the sound of the first user in a mixed sound of the first sound and the second sound when the pickup demand mode of the first user is speaking and the pickup demand mode of the second user is non-speaking. Wherein the audio mix after enhancing the first user's sound is for being played by the first device and/or the second device, and when the audio mix is played, the listening user may mainly hear the first user's sound or may only hear the first user's sound. Therefore, the control device not only can restrain the sound of the first user according to the mute requirement of the first user, but also can strengthen the sound of the first user according to the speaking requirement of the first user, thereby meeting the unused requirement of the user and ensuring the experience of the user.
In an implementation manner of the first aspect, the control device may determine, according to the first sound, a number of sound sources corresponding to the first device, so as to determine a pickup demand mode of the first user. If the number of sound sources corresponding to the first device is 1, determining that the pickup demand mode of the first user is speaking; if the sound source data amount corresponding to the first device is greater than 1, it may be determined that the pickup demand mode of the first user is non-speaking. In this implementation manner, the control device may determine, according to the number of sound sources corresponding to each sound pickup device, whether a user speaks in a user corresponding to each sound pickup device, and a plurality of speaking users, so as to determine whether to process a sound picked up by each sound pickup device (for example, a sound of a user corresponding to the sound pickup device). If the number of sound sources corresponding to the sound pickup device is greater than 1, the fact that a plurality of users speak currently in the users corresponding to the sound pickup device is indicated, the control device cannot determine which user is speaking, and at the moment, the control device can not inhibit or enhance the sound of the users; if the number of sound sources corresponding to the sound pickup device is only 1, the fact that only one user of the users corresponding to the sound pickup device is speaking is indicated, and the control device can determine that the user is speaking, so that the sound of the user is enhanced.
In an implementation manner of the first aspect, the control device may determine, according to the first sound, a sound signal-to-noise ratio corresponding to the first device, so as to determine a pickup demand mode of the first user. If the sound signal-to-noise ratio corresponding to the first device is smaller than a first preset signal-to-noise ratio, the pickup demand mode of the first user can be determined to be mute; if the sound signal-to-noise ratio corresponding to the first device is greater than or equal to the first preset signal-to-noise ratio, the pickup demand mode of the first user can be determined to be non-mute. In this implementation manner, the control device may determine, according to the sound signal-to-noise ratio of each sound pickup device, a sound environment where a user corresponding to each sound pickup device is located, so as to determine whether to process sound picked up by each sound pickup device (for example, a sound of a user corresponding to the sound pickup device). If the control device determines that the signal-to-noise ratio of the sound pickup device is low, it may be considered that the sound quality picked up by the sound pickup device is poor, the environment in which the user corresponding to the sound pickup device is located is noisy, the user corresponding to the sound pickup device may have a plurality of users speaking simultaneously, and the like, and the control device may perform the suppressing process on the sound picked up by the sound pickup device (for example, the sound of the user corresponding to the sound pickup device), thereby reducing the interference of the sound picked up by the sound pickup device on the sound picked up by other sound pickup devices. If the control device determines that the sound signal-to-noise ratio of the sound pickup device is relatively high, it may be considered that the sound quality picked up by the sound pickup device is relatively good, the environment in which the user corresponding to the sound pickup device is located is relatively quiet, there may be one user corresponding to the sound pickup device and the user is speaking, and so on. In this case, the control device may perform enhancement processing on the sound picked up by the sound pickup device (for example, the sound of the user corresponding to the sound pickup device), so that the sound picked up by the sound pickup device may be heard clearly by the user when played, while the speech content of the user corresponding to the sound pickup device is heard clearly.
In an implementation manner of the first aspect, the control device may obtain semantic content of the first sound, and determine a pickup demand pattern of the first user according to the semantic content. If the similarity between the semantic content and the preset content of the first sound is smaller than the preset similarity, determining that the pickup demand mode of the first user is mute; if the similarity between the semantic content and the preset content of the first sound is greater than or equal to the preset similarity, the pickup demand mode of the first user can be determined to be non-mute. In this implementation manner, the control device may determine, according to the semantics of the sound content picked up by each sound pickup device, whether the user corresponding to each sound pickup device is speaking, and further determine whether the sound picked up by each sound pickup device (for example, the sound of the user corresponding to the sound pickup device) needs to be processed. If the control device determines that the semantic content of the sound picked up by the sound pickup device is not related to the current conference content, it may be considered that the user corresponding to the sound pickup device is speaking other content, and the conference is not discussed. In this case, the control device may perform the suppressing process on the sound picked up by the sound pickup device (for example, the sound of the user to which the sound pickup device corresponds), thereby reducing the interference of the sound picked up by this sound pickup device with the sound picked up by the other sound pickup devices. And if the control device determines that the semantic content of the sound picked up by the sound pickup device is related to the current conference content, it may be considered that the user corresponding to the sound pickup device is discussing the conference or speaking. In this case, the control device may perform enhancement processing on the sound picked up by the sound pickup device, so that the sound picked up by the sound pickup device can be heard by the user while being played, and the discussion of the user corresponding to the sound pickup device can be heard.
In an implementation manner of the first aspect, the control device may be a first device or a second device in the sound pickup system, that is, the first device or the second device in the sound pickup system may perform the recognition of the sound pickup demand pattern of the user, and the suppressing or enhancing process of the sound. Alternatively, the control device may be a cloud device, that is, the first device and the second device in the pickup system may also communicate with the cloud device.
In an implementation manner of the first aspect, in a process of removing or weakening a sound of the first user, the control device may perform an objectification process on the first sound first to obtain the sound of the first user. Thereafter, in the mixing of the first sound and the second sound, the sound of the first user is removed or attenuated. In this implementation manner, the control device may perform the objectifying process on the first sound, so as to determine that the first sound includes sounds of several users according to a result of the objectifying process. The control device then removes or attenuates the sound of one or more users from the mix picked up by the other sound pickup devices. So that the listening user cannot hear or hear the sound of the one or more users in an unclear manner while the mix is being played.
In an implementation manner of the first aspect, in a process of enhancing a sound of the first user, the control device may perform an objectification process on the first sound to obtain the sound of the first user, and then enhance the sound of the first user or remove the sound of other users except the first user in a mixing process of the first sound and the second sound. In this implementation manner, the control device may also perform the objectifying process on the first sound, so that it is determined that the first sound includes sounds of several users according to the result of the objectifying process. The control device then subjects the sound of one or more users to enhancement processing. So that when the mix is played, the listening user can only hear or mainly hear the sound of one or more users.
In an implementation manner of the first aspect, in a process of removing or weakening a sound of the first user, the control device may further perform objectification processing on the first sound, and determine, according to a result obtained by the objectification processing, a sound of the first user that meets a first preset sound quality condition in the first sound, where the first user includes one or multiple users. Then, the control device removes or attenuates the sound of the first user satisfying the preset sound quality condition in the mixing of the first sound and the second sound. In this implementation manner, if the first user includes a plurality of users, the control device determines a sound with a higher sound quality from the sounds of the plurality of users, and determines a user corresponding to the sound as a user needing silence, so that the sound of only one user is suppressed.
The first preset sound quality condition may indicate whether the signal-to-noise ratio of the sound of the user is greater than or equal to the second preset signal-to-noise ratio, whether the loudness is greater than or equal to the preset loudness, and so on.
In an implementation manner of the first aspect, in a process of enhancing the sound of the first user, the control device may further perform objectification processing on the first sound, and determine, according to a result obtained by the objectification processing, the sound of the first user that meets the second preset sound quality condition in the first sound, where the first user includes one or more users. Then, the control device enhances the sound of the first user satisfying the preset sound quality condition in the mixing of the first sound and the second sound. In this implementation manner, if the first user includes a plurality of users, the control device determines a sound with higher sound quality from the sounds of the plurality of users, and determines that the user corresponding to the sound is the user who needs to speak, so that enhancement processing is performed on the sound of only one user.
The second preset sound quality condition may also indicate whether the signal-to-noise ratio of the sound of the user is greater than or equal to the second preset signal-to-noise ratio, whether the loudness is greater than or equal to the preset loudness, and so on.
In other implementations of the aspect, the control device may further determine, according to the second sound, a number of sound sources corresponding to the second device, so as to determine a pickup demand mode of the second user; determining a sound signal-to-noise ratio corresponding to the second device according to the second sound, so as to determine a pickup demand mode of the second user; and acquiring the semantic content of the second sound, and determining a pickup demand mode of the second user according to the semantic content. And specific determining manners can be referred to the foregoing implementation manners, and are not repeated herein.
In a second aspect, there is provided another sound processing method applied to a sound pickup system including a control device, a first device, and a second device that picks up sound at a location where the first device is located. In the sound processing method, a first device picks up a first sound. The second device picks up the second sound. The control device determines a pickup demand mode of a first user corresponding to the first device according to the first sound, and determines a pickup demand mode of a second user corresponding to the second device according to the second sound. And, the control device removes or attenuates the sound of the first user in the mixed sound of the first sound and the second sound in the case where the pickup demand mode of the first user is mute and the pickup demand mode of the second user is non-mute.
And the first device can also play the mixed sound after the sound of the first user is removed or weakened, and the second device can also play the mixed sound after the sound of the first user is removed or weakened. And when the mix is played, the listening user cannot hear or hear clearly what the first user speaks.
The scheme can be applied to a multi-person interaction scene, and the multi-person interaction scene comprises a pickup device for pickup, a control device for controlling the pickup device and the like. Each pickup device can actively pick up the sound of at least one user in the multi-person interaction process, and the control device can automatically identify the pickup demand mode of the user corresponding to the pickup device in the multi-person interaction process according to the sound picked up by the pickup device. And then, the control equipment processes the sound picked up by the sound pickup equipment corresponding to the user according to the sound pickup demand mode of the user. For example, when a user has a mute demand, a sound picked up by the sound pickup apparatus is subjected to a suppressing process or the like. The sound processing method can automatically identify the pickup demand mode of the user no matter where the user is, and can process the picked-up sound of the user according to the pickup demand mode of the user in real time when the pickup demand mode of the user changes, thereby meeting the demands of different users at different positions in the multi-user interaction, reducing the operation of manually changing the pickup state of the pickup device by the user, ensuring the user experience and improving the accuracy and quality of sound processing.
In an implementation manner of the second aspect, the control device further enhances the sound of the first user in a mixed sound of the first sound and the second sound in a case where the sound pickup demand mode of the first user is speaking and the sound pickup demand mode of the second user is non-speaking. And the first device can also play the sound mixing after the sound of the first user is enhanced, and the second device can also play the sound mixing after the sound of the first user is enhanced. And when the mix is played, the listening user may hear mainly the first user's voice or only the first user's voice. Therefore, the control device not only can restrain the sound of the first user according to the mute requirement of the first user, but also can strengthen the sound of the first user according to the speaking requirement of the first user, thereby meeting the unused requirement of the user and ensuring the experience of the user.
In an implementation manner of the second aspect, the control device may determine the number of sound sources corresponding to the first device according to the first sound. If the number of sound sources corresponding to the first device is 1, determining that the pickup demand mode of the first user is speaking; if the sound source data amount corresponding to the first device is greater than 1, it may be determined that the pickup demand mode of the first user is non-speaking. In this implementation manner, the control device may determine, according to the number of sound sources corresponding to each sound pickup device, whether a user speaks in a user corresponding to each sound pickup device, and a plurality of speaking users, so as to determine whether to process a sound picked up by each sound pickup device (for example, a sound of a user corresponding to the sound pickup device).
In an implementation manner of the second aspect, the control device may determine a sound signal-to-noise ratio corresponding to the first device according to the first sound. If the sound signal-to-noise ratio corresponding to the first device is smaller than a first preset signal-to-noise ratio, determining that the pickup demand mode of the first user is mute; if the sound signal-to-noise ratio corresponding to the first device is greater than or equal to the first preset signal-to-noise ratio, the pickup demand mode of the first user can be determined to be non-mute. In this implementation manner, the control device may determine, according to the sound signal-to-noise ratio of each sound pickup device, a sound environment where a user corresponding to each sound pickup device is located, so as to determine whether to process sound picked up by each sound pickup device (for example, a sound of a user corresponding to the sound pickup device).
In an implementation manner of the second aspect, the control device may obtain semantic content of the first sound. If the similarity between the semantic content of the first sound and the preset content is smaller than the preset similarity, determining that the pickup demand mode of the first user is mute; if the similarity between the semantic content and the preset content of the first sound is greater than or equal to the preset similarity, the pickup demand mode of the first user can be determined to be non-mute. In this implementation manner, the control device may determine, according to the semantics of the sound content picked up by each sound pickup device, whether the user corresponding to each sound pickup device is speaking, and further determine whether the sound picked up by each sound pickup device (for example, the sound of the user corresponding to the sound pickup device) needs to be processed.
In an implementation manner of the second aspect, the control device may be a first device or a second device in the sound pickup system, that is, the first device or the second device in the sound pickup system may perform the recognition of the sound pickup demand pattern of the user and the suppressing or enhancing process of the sound. Alternatively, the control device may be a cloud device, that is, the first device and the second device in the pickup system may also communicate with the cloud device.
In an implementation manner of the second aspect, when the control device removes or attenuates the sound of the first user in the mixed sound obtained by mixing the first sound and the second sound, the control device may subject the first sound to an objectification process to obtain the sound of the first user. Thereafter, the control device removes or attenuates the sound of the first user in the mixing of the first sound and the second sound. In this implementation manner, the control device may perform the objectifying process on the first sound, so as to determine that the first sound includes sounds of several users according to a result of the objectifying process. The control device then removes or attenuates the sound of one or more users from the mix picked up by the other sound pickup devices. So that the listening user cannot hear or hear the sound of the one or more users in an unclear manner while the mix is being played.
In an implementation manner of the second aspect, when the control device enhances the sound of the first user in the mixed sound obtained by mixing the first sound and the second sound, the control device may subject the first sound to an objectification process to obtain the sound of the first user. The control device enhances the sound of the first user or removes the sound of other users than the first user in the mixing of the first sound and the second sound. In this implementation manner, the control device may also perform the objectifying process on the first sound, so that it is determined that the first sound includes sounds of several users according to the result of the objectifying process. The control device then subjects the sound of one or more users to enhancement processing. So that when the mix is played, the listening user can only hear or mainly hear the sound of one or more users.
In an implementation manner of the second aspect, when the control device removes or attenuates the sound of the first user in the mixed sound obtained by mixing the first sound and the second sound, the control device may perform objectification processing on the first sound, and determine, according to a result obtained by the objectification processing, the sound of the first user in the first sound that meets a first preset sound quality condition; wherein the first user comprises one or more users. The control device removes or attenuates the sound of the first user satisfying the preset sound quality condition in the mixing of the first sound and the second sound. In this implementation manner, if the first user includes a plurality of users, the control device determines a sound with a higher sound quality from the sounds of the plurality of users, and determines a user corresponding to the sound as a user needing silence, so that the sound of only one user is suppressed.
The first preset sound quality condition may indicate whether the signal-to-noise ratio of the sound of the user is greater than or equal to the second preset signal-to-noise ratio, whether the loudness is greater than or equal to the preset loudness, and so on.
In an implementation manner of the second aspect, when the control device enhances the sound of the first user in the mixed sound obtained by mixing the first sound and the second sound, the control device may perform objectification processing on the first sound, and determine, according to a result obtained by the objectification processing, the sound of the first user in the first sound that meets a second preset sound quality condition; wherein the first user comprises one or more users. The control device enhances the sound of the first user satisfying the preset sound quality condition in the mixing of the first sound and the second sound. In this implementation manner, if the first user includes a plurality of users, the control device determines a sound with higher sound quality from the sounds of the plurality of users, and determines that the user corresponding to the sound is the user who needs to speak, so that enhancement processing is performed on the sound of only one user.
The second preset sound quality condition may also indicate whether the signal-to-noise ratio of the sound of the user is greater than or equal to the second preset signal-to-noise ratio, whether the loudness is greater than or equal to the preset loudness, and so on.
In other implementations of the second aspect, the control device may further determine, according to the second sound, a number of sound sources corresponding to the second device, so as to determine a pickup demand mode of the second user; determining a sound signal-to-noise ratio corresponding to the second device according to the second sound, so as to determine a pickup demand mode of the second user; and acquiring the semantic content of the second sound, and determining a pickup demand mode of the second user according to the semantic content. And specific determining manners can be referred to the foregoing implementation manners, and are not repeated herein.
In a third aspect, a pickup system is provided that includes a control device, a first device, and a second device that picks up sound at a location where the first device is located. Wherein the first device is for picking up a first sound; the second device is used for picking up a second sound; the control device is used for determining a pickup demand mode of a first user corresponding to the first device according to the first sound, and determining a pickup demand mode of a second user corresponding to the second device according to the second sound; and when the sound pickup demand mode of the first user is mute and the sound pickup demand mode of the second user is non-mute, removing or weakening the sound of the first user in the mixed sound of the first sound and the second sound. And the first device is further configured to play the audio mix after removing or attenuating the sound of the first user. The second device is also used for playing the mixed sound after the sound of the first user is removed or weakened.
In an implementation manner of the third aspect, the control device may further enhance the sound of the first user in a mixed sound obtained by mixing the first sound and the second sound, in a case where the sound pickup demand mode of the first user is speaking and the sound pickup demand mode of the second user is non-speaking. Wherein the audio mix after enhancing the first user's sound is for being played by the first device and/or the second device, and when the audio mix is played, the listening user may mainly hear the first user's sound or may only hear the first user's sound.
In an implementation manner of the third aspect, the control device may determine, according to the first sound, a number of sound sources corresponding to the first device, so as to determine a pickup demand mode of the first user. If the number of sound sources corresponding to the first device is 1, determining that the pickup demand mode of the first user is speaking; if the sound source data amount corresponding to the first device is greater than 1, it may be determined that the pickup demand mode of the first user is non-speaking.
In an implementation manner of the third aspect, the control device may determine, according to the first sound, a sound signal-to-noise ratio corresponding to the first device, so as to determine a pickup demand mode of the first user. If the sound signal-to-noise ratio corresponding to the first device is smaller than a first preset signal-to-noise ratio, the pickup demand mode of the first user can be determined to be mute; if the sound signal-to-noise ratio corresponding to the first device is greater than or equal to the first preset signal-to-noise ratio, the pickup demand mode of the first user can be determined to be non-mute.
In an implementation manner of the third aspect, the control device may obtain semantic content of the first sound, and determine a pickup demand pattern of the first user according to the semantic content. If the similarity between the semantic content and the preset content of the first sound is smaller than the preset similarity, determining that the pickup demand mode of the first user is mute; if the similarity between the semantic content and the preset content of the first sound is greater than or equal to the preset similarity, the pickup demand mode of the first user can be determined to be non-mute.
In an implementation manner of the third aspect, the control device may be a first device or a second device in the sound pickup system, that is, the first device or the second device in the sound pickup system may perform the recognition of the sound pickup demand pattern of the user and the suppressing or enhancing process of the sound. Alternatively, the control device may be a cloud device, that is, the first device and the second device in the pickup system may also communicate with the cloud device.
In an implementation manner of the third aspect, in a process of removing or weakening a sound of the first user, the control device may perform an objectification process on the first sound first to obtain the sound of the first user. Thereafter, in the mixing of the first sound and the second sound, the sound of the first user is removed or attenuated.
In an implementation manner of the third aspect, in a process of enhancing the sound of the first user, the control device may perform objectification processing on the first sound first to obtain the sound of the first user, and then enhance the sound of the first user or remove the sound of other users than the first user in the mixing of the first sound and the second sound.
In an implementation manner of the third aspect, in a process of removing or weakening a sound of the first user, the control device may further perform objectification processing on the first sound, and determine, according to a result obtained by the objectification processing, a sound of the first user that meets a first preset sound quality condition in the first sound, where the first user includes one or multiple users. Then, the control device removes or attenuates the sound of the first user satisfying the preset sound quality condition in the mixing of the first sound and the second sound.
The first preset sound quality condition may indicate whether the signal-to-noise ratio of the sound of the user is greater than or equal to the second preset signal-to-noise ratio, whether the loudness is greater than or equal to the preset loudness, and so on.
In an implementation manner of the third aspect, in a process of enhancing the sound of the first user, the control device may further perform objectification processing on the first sound, and determine, according to a result obtained by the objectification processing, the sound of the first user that meets the second preset sound quality condition in the first sound, where the first user includes one or more users. Then, the control device enhances the sound of the first user satisfying the preset sound quality condition in the mixing of the first sound and the second sound.
The second preset sound quality condition may also indicate whether the signal-to-noise ratio of the sound of the user is greater than or equal to the second preset signal-to-noise ratio, whether the loudness is greater than or equal to the preset loudness, and so on.
In other realizations of the third aspect, the control device may further determine, according to the second sound, a number of sound sources corresponding to the second device, so as to determine a pickup demand mode of the second user; determining a sound signal-to-noise ratio corresponding to the second device according to the second sound, so as to determine a pickup demand mode of the second user; and acquiring the semantic content of the second sound, and determining a pickup demand mode of the second user according to the semantic content. And specific determining manners can be referred to the foregoing implementation manners, and are not repeated herein.
In a fourth aspect, an electronic device is provided, the electronic device comprising a memory, one or more processors; the memory is coupled with the processor; wherein the memory has stored therein computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the sound processing method as in any of the implementations of the first aspect or to perform the sound processing method as in any of the implementations of the second aspect.
Wherein when the electronic device performs the sound processing method as in any implementation manner of the second aspect, the electronic device may be the control device, the first device and/or the second device in the sound pickup system described above.
In a fifth aspect, there is provided a computer readable storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the sound processing method in any of the implementations of the first aspect or to perform the sound processing method as in any of the implementations of the second aspect.
In a sixth aspect, there is provided a computer program product for, when run on a computer, causing the computer to perform the sound processing method as in any of the implementations of the first aspect or the sound processing method as in any of the implementations of the second aspect.
It will be appreciated that the advantages achieved by the sound processing method according to the second aspect, the sound pickup system according to the third aspect, the electronic device according to the fourth aspect, the computer readable storage medium according to the fifth aspect, and the computer program product according to the sixth aspect are referred to as advantages in any one of the possible designs of the first aspect and the second aspect, and will not be described in detail herein.
Drawings
FIG. 1 is a schematic illustration of a conference site shown in an embodiment of the present application;
FIG. 2 is a schematic diagram of a pickup system according to an embodiment of the present application;
fig. 3 is a schematic hardware structure of an electronic device according to an embodiment of the present application;
fig. 4 is a schematic software structure of an electronic device according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a conference site in scenario one shown in an embodiment of the present application;
FIG. 6 is a schematic diagram of yet another pickup system shown in an embodiment of the present application;
fig. 7 is a schematic diagram of sound loudness during playing of a mixed sound according to an embodiment of the present application;
fig. 8 is a schematic diagram of sound loudness during playing of still another audio mixing according to an embodiment of the present disclosure;
fig. 9 is a schematic diagram of a conference site in scenario two shown in an embodiment of the present application;
fig. 10 is a schematic diagram of sound loudness during playing of still another audio mix according to an embodiment of the present disclosure;
fig. 11 is a schematic diagram of sound loudness during playing of still another audio mixing according to an embodiment of the present disclosure;
fig. 12 is a schematic view of yet another pickup system shown in an embodiment of the present application;
fig. 13 is a schematic hardware structure of still another electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. Wherein, in the description of the present application, "/" means that the related objects are in a "or" relationship, unless otherwise specified, for example, a/B may mean a or B; the term "and/or" in this application is merely an association relation describing an association object, and means that three kinds of relations may exist, for example, a and/or B may mean: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. Also, in the description of the present application, unless otherwise indicated, "a plurality" means two or more than two. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural. In addition, in order to clearly describe the technical solutions of the embodiments of the present application, in the embodiments of the present application, the words "first", "second", and the like are used to distinguish the same item or similar items having substantially the same function and effect. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ. Meanwhile, in the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as examples, illustrations, or descriptions. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion that may be readily understood.
In current conference scenarios, a microphone array or a plurality of microphones is typically used to pick up the sound of conference participants at the conference site. And then, the picked-up sound can be stored as a conference summary or conference recording and the like, so that the sound can be conveniently re-listened after the conference is finished, and the speaking content, conference content and the like of conference participants are obtained. Or in the scene that the conference participants at a plurality of conference sites participate in the same conference together, the sound picked up at one conference site can be sent to other conference sites for playing, so that the conference participants at other conference sites can hear the speaking content of the conference participants at the conference site.
In the above conference scenario, the conference participants may be users who can communicate with each other face-to-face in the conference site, i.e., users who can hear each other speaking. In this scenario, the picked up sound of the conference site may be saved without being played in real time. And, one conference site may include at least one conference participant.
In the scenario that the conference participants at the conference sites participate in the same conference together, the conference participants may be users who cannot communicate with each other face to face, that is, users who cannot hear each other to speak, for example, conference participants at different conference sites. In the scene, the picked-up sound of each conference site can be stored, or the picked-up sound of a certain conference site can be sent to other conference sites for playing, so that conference participants who cannot communicate face to face can communicate in the mode. Wherein, a participant at the same conference site may be considered a near-end off-line user, while a participant at another conference site may be considered a far-end on-line user relative to the near-end off-line user.
The conference may be a virtual conference, such as a Virtual Reality (VR) conference, or the like. In a virtual conference, some or all of the off-line users and on-line users may participate in the conference by wearing a head-display device, and the head-display device includes a microphone through which the user's sound may be picked up. The conference may be a general conference, in which a sound pickup device such as the microphone array or the microphone is provided at a conference site to pick up the sound of a user at the conference site.
The microphone array also includes a plurality of microphones. When a plurality of microphones are used to pick up sounds of a conference site, the microphones are directed in different directions, respectively. Illustratively, referring to fig. 1, conference participants in a conference site include user 1, user 2, user 3, user 4, and user 5, microphone 1 may be oriented in the direction of user 1, microphone 2 may be oriented in the direction of user 2 and user 3, and microphone 3 may be oriented in the direction of user 4 and user 5. Thus, the microphone 1, the microphone 2, and the microphone 3 can pick up the sound of each conference participant in the conference site from different directions, and also can improve the spatial resolution of the picked-up sound, contributing to improving the sound quality of the picked-up sound.
When participating in a conference, if a participant does not want to speak or the content of the discussion is not intended to be heard by other participants so as not to affect the other participants, the sound of the participant can be suppressed. Currently, a manner of suppressing the sound of a target participant who does not want to speak includes controlling sound pickup devices such as microphones near the target participant to mute, suppressing the sound of an area where the target participant is located, and the like.
In the manner of muting by using a sound pickup apparatus such as a microphone near a control target conference participant, for example, if the user 1 in fig. 1 does not want to speak, the user 1 needs to manually control the microphone 1 nearby to mute, or the host of the conference controls the microphone 1 nearby to mute, so that the microphone 1 cannot pick up the sound of the user 1, and the purpose of suppressing the sound of the user 1 is achieved. However, when the user 1 moves to other locations during the conference, the user 1 also needs to manually control the silence of other microphones in the vicinity thereof again, or the host of the conference controls the silence of other microphones in the vicinity of the user 1 again. Therefore, in this sound suppression manner, the target conference participant or the host needs to manually control the corresponding microphone continuously according to the pickup requirement of the target conference participant, and if the position of the target conference participant changes or the pickup requirement of the target conference participant changes, the target conference participant or the host needs to manually control the microphone continuously, so that the operation process is complicated, and the participant and host participating in the conference can be affected.
In the manner of suppressing the sound of the area where the target conference participants are located, for example, the space of the conference site may be partitioned according to the topology structure of the microphones in fig. 1, if the user 1 does not want to speak, the user 1 still needs to manually control the nearby microphone 1 to mute, or the host of the conference controls the microphone 1 nearby the user 1 to mute, then determines the target partition where the user 1 is located, and eliminates the sound in the picked-up target partition, thereby achieving the purpose of suppressing the sound of the user 1. However, since the location of the conference site microphone is fixed, the partition of the conference site is also fixed. When user 1 moves during a meeting or user 1 is at the juncture of two partitions, it is difficult to accurately determine which partition user 1 is at. When user 1 moves partition 2 from partition 1, it may be currently determined to suppress sound in partition 1, but not user 1 in partition 2. In order to accurately suppress the sound of the user 1, the user 1 still needs to manually control the silence of the nearby microphone again, or the moderator of the conference controls the silence of the other microphones nearby the user 1 again. Therefore, in this sound suppression manner, if the position of the target participant changes or the pickup requirement of the target participant changes, the target participant or the host needs to manually control the microphone constantly, so that the participant and host participating in the conference may also be affected.
Based on the above, the embodiments of the present application provide a sound processing method, which may be applied to a multi-person interaction scene, and the multi-person interaction scene includes a sound pickup apparatus for pickup, a control apparatus for controlling the sound pickup apparatus, and the like. Each pickup device can actively pick up the sound of at least one user in the multi-person interaction process, and the control device can automatically identify the pickup demand mode of the user corresponding to the pickup device in the multi-person interaction process according to the sound picked up by the pickup device. And then, the control equipment processes the sound picked up by the sound pickup equipment corresponding to the user according to the sound pickup demand mode of the user. For example, when a user has a mute demand, a sound picked up by the sound pickup apparatus is subjected to a suppressing process or the like. The sound processing method can automatically identify the pickup demand mode of the user no matter where the user is, and can process the picked-up sound of the user according to the pickup demand mode of the user in real time when the pickup demand mode of the user changes, thereby meeting the demands of different users at different positions in the multi-user interaction, reducing the operation of manually changing the pickup state of the pickup device by the user, ensuring the user experience and improving the accuracy and quality of sound processing.
The above-described sound processing method provided in the embodiment of the present application may be applied to a sound pickup system (or a sound pickup network). For example, referring to fig. 2, the pickup system may include a pickup device corresponding to each user in the multi-person interactive scene, a control device as a center device, and the like. And, the sound pickup apparatus and the control apparatus, etc. may be connected by a wired and/or wireless means.
The multi-person interaction scene in the embodiment of the application can be the interaction scene such as the conference, the interaction live broadcast, the interaction classroom, the interaction singing and the like in which the multi-person participates.
The pickup device in the multi-person interaction scene of the embodiment of the present application may be the microphone or the microphone array set on the conference site; the portable electronic device may be a mobile electronic device having a sound pickup function, such as a mobile phone, a tablet computer, a notebook computer, a head-mounted display device (e.g., a virtual reality device, an augmented reality device, or a mixed reality device), or a mobile phone, a tablet computer, or a notebook computer to which a portable sound pickup apparatus (e.g., a wired earphone, a bluetooth earphone, or the like) is connected; a combination of a microphone provided at a conference site and an electronic device having a sound pickup function is also possible. And, when the sound pickup apparatus is a movable electronic apparatus having a sound pickup function, at least three microphone units, two vector microphone units or a regular microphone array, an irregular microphone array, a directional microphone array, an acoustic vector microphone array, or the like may be included in the sound pickup apparatus, and the sound pickup apparatus, the control apparatus, or the like may perform sound source localization based on sound beams picked up by the microphone units or the microphone array, or the like, thereby determining a position of a user to which the sound pickup apparatus corresponds, realizing directional sound pickup, or the like. And when the sound pickup device is a movable electronic device with a sound pickup function, the electronic device is convenient to move, so that a user can get closer to the electronic device when speaking or speaking, and the electronic device can pick up near-field human voice with higher quality and clearer.
The control device in the multi-person interaction scenario of the embodiment of the present application may be an electronic device having an audio processing function and a data processing function, such as a conference box, a television, a projector, a host, etc. The specific form of the sound pickup apparatus and the control apparatus in the embodiment of the present application is not particularly limited. In some embodiments, the pickup device may be a device carried by a user engaged in a multi-person interaction. In addition, the control device may be any sound pickup device, cloud device, or the like.
Taking a multi-person interaction scene as an example, when a plurality of users participate in a conference on the same site together, the plurality of users can be regarded as near-end offline users, and sound pickup devices corresponding to the plurality of users can be regarded as near-end sound pickup devices. When a plurality of users at a conference site participate in the same conference together, the users at the same conference site can be regarded as near-end off-line users, and sound pickup devices corresponding to the near-end off-line users can be regarded as near-end sound pickup devices. Whereas a user at another conference site may be considered a far-end on-line user with respect to a near-end off-line user, and the far-end on-line user's corresponding sound pickup device may be considered a far-end sound pickup device.
In some embodiments, a server may be further included in the pickup system, and the server may be connected to the pickup device and the control device, respectively. The control device may send the processed sound to the sound pickup device via the server.
In the pickup system shown in fig. 2, a pickup device (e.g., a first device, a second device, a third device, etc.) may be connected to a control device through a preset access manner, thereby forming a pickup system. The preset access mode may include two-dimensional code access, link access, and the like. After the pickup device in the pickup system is connected, the sound in the interactive scene can be picked up. The sound picked up by the first device may be used as a first sound, the sound picked up by the second device may be used as a second sound, the sound picked up by the third device may be used as a third sound, etc. And, the user to which the sound pickup apparatus corresponds may include a first user, a second user, a third user, and the like.
The sound pickup apparatus can pick up sound in the maximum sound pickup range. For example, a microphone in a sound pickup apparatus may pick up sound in a maximum range in which the microphone can receive sound. In some embodiments, one or more microphones in the sound pickup apparatus may be configured (e.g., beam direction configured) to effect omnidirectional sound pickup (e.g., 360 ° sound pickup or omnidirectional sound pickup) of the sound pickup apparatus to pick up all sound. In some embodiments, one or more microphones in the pickup device may be configured (e.g., beam direction configured) to achieve directional pickup of the pickup device to facilitate individual pickup of each user's voice, to facilitate individual processing of each track, and to facilitate processing of mixed sounds.
Each pickup device in the pickup system can send picked-up sound to the control device, the control device determines sound information in the pickup system according to the received sound, and then automatically determines a pickup demand mode of a user corresponding to each pickup device according to the sound information. The pickup demand mode of the user may include speaking, mute, normal pickup, and the like modes. And, the sound information may include the number of sound sources corresponding to the sound pickup apparatuses, the sound signal-to-noise ratio of each sound pickup apparatus, the semantics of the sound content picked up by each sound pickup apparatus, and the like.
Under the condition that the target user corresponding to the target pickup device needs to be muted, the control device can inhibit the sound of the target user picked up by the target pickup device, so that the played sound belonging to the multi-person interaction scene does not comprise the sound of the target user or the sound of the target user is very small. Under the condition that a target user corresponding to the target pickup device needs to speak, the control device can strengthen the sound of the target user picked up by the target pickup device, so that the loudness of the sound of the target user in the played sound mixing belonging to the multi-person interactive scene is larger and clearer. Under the condition that the target user corresponding to the target pickup device does not need to mute or speak, the target user is considered to have the normal pickup requirement, and then the control device can further inhibit or strengthen the sound of the target user picked up by the target pickup device, so that the played sound belonging to the multi-user interaction scene comprises the sound of the target user and the sound of other users.
The sound information may be sound information corresponding to the sound picked up by each sound pickup apparatus, or sound information corresponding to the entire sound pickup system determined in association with the sound picked up by each sound pickup apparatus.
For example, if the sound information indicates that the sound sources corresponding to the target sound pickup apparatus in the sound pickup system are only 1, the control apparatus may determine that only one user currently speaking using the target sound pickup apparatus is likely to be the user who is speaking, and the determined user sound pickup demand mode may be speaking at this time, and the control apparatus may not perform suppression processing but may perform enhancement processing on the sound of the user, thereby improving the speaking effect and sound quality of the user. If the sound information indicates that the sound sources corresponding to the target pickup device are multiple, the sound picked up by the target pickup device can be suppressed or not processed, or the user requirement can be determined by continuously combining other sound information.
For example, if the sound information indicates that the signal-to-noise ratio of the sound picked up by the target sound pickup apparatus in the sound pickup system is high, the sound picked up by the target sound pickup apparatus can be considered to be good in quality, and at this time, the suppression processing of the sound picked up by the target sound pickup apparatus is not required. And if the sound information shows that the signal-to-noise ratio of the sound picked up by the target sound pickup device in the sound pickup system is low, the sound picked up by the target sound pickup device can be considered to be poor in quality, and at the moment, the sound picked up by the target sound pickup device can be subjected to inhibition processing, so that the influence on other sounds when the sound with poor quality is played is reduced.
For example, if the sound information indicates that the sound content picked up by the target sound pickup apparatus in the sound pickup system is irrelevant to the content of the current multi-person interaction scene, the sound picked up by the target sound pickup apparatus may be subjected to the suppression processing, thereby reducing the influence on other sounds when the sound talking about the irrelevant content is played.
When the control device suppresses the sound picked up by the target sound pickup device, the sound picked up by the target sound pickup device can be removed or weakened from the mixed sound picked up by the other sound pickup devices, thereby achieving the purpose of sound suppression.
When the control device performs enhancement processing on the sound picked up by the target sound pickup device, the control device may perform loudness enhancement processing, objectification processing, sound effect processing, and the like on the sound.
The control device can also store the mixed sound after the inhibition processing or the enhancement processing as a summary or a sound recording file and the like, so that a user can conveniently listen to the summary or the sound recording file again at other times to review the content in the multi-user interaction scene. Alternatively, in the multi-person interaction scenario, the control device may also send the mixed sound of the offline user after the suppression processing or the enhancement processing to the online user, so that the online user cannot hear the sound of the offline user who wants to mute, or the online user can hear the sound of the offline user who wants to speak.
In some embodiments, when the user has a mute requirement, the control device may also perform a mute operation on the sound pickup device corresponding to the user or switch the sound pickup device corresponding to the user to a mute state. The control apparatus may still perform the sound suppressing operation, the enhancing operation, and the like described above while the sound pickup apparatus is in a mute state.
The difference between the mute state of the sound pickup apparatus in the sound pickup system described above and the mute state of a single sound pickup apparatus is that: in a mute state of a single sound pickup device, the single sound pickup device cannot pick up sound, so that the aim of muting is fulfilled; in the mute state in the embodiment of the present application, the sound pickup device picks up the sound of the corresponding user and sends the sound to the control device, and then the control device removes or attenuates the sound from other sounds, thereby achieving the purpose of muting the sound.
In some embodiments, the control device may also be a device that is carried by a user and has an audio processing function and a data processing function, such as a notebook computer. In this way, any one of the sound pickup apparatuses in the sound pickup system can be used as a control apparatus, that is, the sound pickup apparatus can also be used as a center apparatus or a networking apparatus.
In some embodiments, the sound pickup system may further use a cloud device or a server to directly or indirectly receive the sound picked up by the sound pickup device, perform audio processing and/or data processing on the sound, and send the processed sound to a user using the sound pickup device. That is, the sound pickup system may not be provided with a control device.
In some embodiments, the mixed sound is a sound played by each sound pickup apparatus, control apparatus, speaker, or the like, and for example, the mixed sound may be a sound subjected to objectification processing, a sound subjected to suppression processing, a sound subjected to enhancement processing, or the like.
Taking an interaction scene of a multi-person conference as an example, in the sound processing method provided in the embodiment of the present application, each sound pickup device may actively pick up a sound of a user in a conference process, and the control device may automatically identify a sound pickup demand mode of the user corresponding to the sound pickup device in the conference process according to the sound picked up by the sound pickup device. And then, the control equipment processes the sound picked up by the sound pickup equipment corresponding to the user according to the sound pickup demand mode of the user. The sound processing method can automatically identify the pickup demand mode of the user no matter where the user is, and can process the picked-up sound of the user according to the pickup demand mode of the user in real time when the pickup demand mode of the user changes, thereby meeting the pickup demand modes of different users at different positions in a conference, reducing the operation of manually changing the pickup state of pickup equipment by the user, ensuring the user experience and improving the accuracy and quality of sound processing.
The sound processing method provided by the embodiment of the application can be applied to the electronic equipment such as the pickup equipment and the control equipment. Taking the example that the electronic device is a mobile phone, fig. 3 shows a schematic hardware structure of the electronic device 100. The electronic apparatus 100 shown in fig. 3 may be a sound pickup apparatus or a control apparatus, for example. As shown in fig. 3, the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193, a display 194, a user identification module (subscriber identification module, SIM) card interface 195, and the like.
It is to be understood that the structure illustrated in the present embodiment does not constitute a specific limitation on the electronic apparatus 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
Processor 110 may include one or more processing units. Wherein the different processing units may be separate devices or may be integrated in one or more processors. The controller may be a neural hub and command center of the electronic device 100. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.
A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the processor 110 may include one or more interfaces.
It should be understood that the connection relationship between the modules illustrated in this embodiment is only illustrative, and does not limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also use different interfacing manners, or a combination of multiple interfacing manners in the foregoing embodiments.
The charge management module 140 is configured to receive a charge input from a charger. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142. The power management module 141 is used for connecting the battery 142, and the charge management module 140 and the processor 110. In other embodiments, the power management module 141 may also be provided in the processor 110. In other embodiments, the power management module 141 and the charge management module 140 may be disposed in the same device.
The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like. The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals.
The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G, etc., applied to the electronic device 100. The mobile communication module 150 may receive electromagnetic waves from the antenna 1, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to the modem processor for demodulation.
The wireless communication module 160 may provide solutions for wireless communication including wireless local area network (wireless local area networks, WLAN) (e.g., wireless fidelity (wireless fidelity, wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field wireless communication technology (near field communication, NFC), infrared technology (IR), etc., as applied to the electronic device 100. The wireless communication module 160 receives electromagnetic waves via the antenna 2, modulates the electromagnetic wave signals, filters the electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, frequency modulate it, amplify it, and convert it to electromagnetic waves for radiation via the antenna 2.
In some embodiments, antenna 1 and mobile communication module 150 of electronic device 100 are coupled, and antenna 2 and wireless communication module 160 are coupled, such that electronic device 100 may communicate with a network and other devices through wireless communication techniques.
The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The display screen 194 is used to display images, videos, and the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, N being a positive integer greater than 1. The electronic device 100 may implement photographing functions through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like. The SP is used to process the data fed back by the camera 193. In some embodiments, the ISP may be provided in the camera 193. The camera 193 is used to capture still images or video.
The digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals. For example, when the electronic device 100 selects a frequency bin, the digital signal processor is used to fourier transform the frequency bin energy, or the like. Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs.
The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the electronic device 100. The external memory card communicates with the processor 110 through an external memory interface 120 to implement data storage functions. For example, files such as music, video, etc. are stored in an external memory card.
The internal memory 121 may be used to store computer executable program code including instructions. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121.
Illustratively, when the electronic apparatus 100 is the sound pickup apparatus in the above-described embodiment, the sound pickup apparatus is connected to the control apparatus to constitute a sound pickup system. The processor 110 of the sound pickup apparatus may pick up sound outside the sound pickup apparatus by executing instructions stored in the internal memory 121, and transmit the picked-up sound to the control apparatus.
Illustratively, when the present electronic device 100 is the control device in the above-described embodiment, the processor 110 of the control device may perform the suppression process, the enhancement process, and the like on the received sound by executing the instructions stored in the internal memory 121, and save the processed sound or transmit the processed sound to the user for playing.
The internal memory 121 may include a storage program area and a storage data area. The storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for at least one function of the operating system, etc. The storage data area may store data created during use of the electronic device 100 (e.g., audio data, phonebook, etc.), and so on.
The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playing, recording, etc.
The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or a portion of the functional modules of the audio module 170 may be disposed in the processor 110.
The speaker 170A, also referred to as a "horn," is used to convert audio electrical signals into sound signals. The electronic device 100 may listen to music, or to hands-free conversations, through the speaker 170A.
For example, when the electronic apparatus 100 is the sound pickup apparatus and/or the control apparatus in the above-described embodiments, the electronic apparatus 100 may play the sound of the user participating in the multi-person interaction through the speaker 170A. For example, when the multi-person interaction scenario is a multi-person conference in the above-described embodiment, the speaker 170A may play the speaking content, speaking voice, etc. of the conference participants.
A receiver 170B, also referred to as a "earpiece", is used to convert the audio electrical signal into a sound signal. When electronic device 100 is answering a telephone call or voice message, voice may be received by placing receiver 170B in close proximity to the human ear.
Illustratively, when the electronic apparatus 100 is the sound pickup apparatus and/or the control apparatus in the above-described embodiments, the receiver 170B may pick up sound outside the electronic apparatus 100.
Microphone 170C, also referred to as a "microphone" or "microphone", is used to convert sound signals into electrical signals. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, and may implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may also be provided with three, four, or more microphones 170C to enable collection of sound signals, noise reduction, identification of sound sources, directional recording functions, etc. In some embodiments, some of the microphones may be omni-directional microphones (omni-directional microphones may pick up sound omnidirectionally) and some of the microphones may be directional microphones (omni-directional microphones may pick up sound omnidirectionally). In some embodiments, all of the microphones may also be directional microphones. In some embodiments, one or more microphones of electronic device 100 may be configured such that electronic device 100 may implement omni-directional pickup or directional pickup.
For example, when the electronic apparatus 100 is the sound pickup apparatus and/or the control apparatus in the above-described embodiment, the microphone 170C may pick up sound outside the electronic apparatus 100. For example, when the multi-person interaction scenario is a multi-person conference in the above-described embodiment, the microphone 170C may pick up sounds of the conference site.
The earphone interface 170D is used to connect a wired earphone.
For example, when the electronic apparatus 100 is the sound pickup apparatus and/or the control apparatus in the above-described embodiment, the headphone interface 170D may be connected to a portable sound pickup device such as a headphone, so that the electronic apparatus 100 picks up sound outside the electronic apparatus 100 through the sound pickup device such as a headphone.
The keys 190 include a power-on key, a volume key, etc. The keys 190 may be mechanical keys. Or may be a touch key. The motor 191 may generate a vibration cue. The motor 191 may be used for incoming call vibration alerting as well as for touch vibration feedback. The indicator 192 may be an indicator light, may be used to indicate a state of charge, a change in charge, a message indicating a missed call, a notification, etc. The SIM card interface 195 is used to connect a SIM card.
The sound processing method in the embodiment of the present application may be implemented based on the electronic device 100 shown in fig. 3. Taking the above-mentioned multi-person conference scene as an example, the pickup device may be connected to the control device by a preset access manner, and constitute a pickup system. The preset access mode comprises two-dimension code access, conference link access and the like.
Each of the sound pickup apparatuses in the sound pickup system may pick up sound of a conference site through a microphone 170C, a receiver 170B, an earphone, or the like, and transmit the picked-up sound to the control apparatus, respectively. The processor 110 of the control device may recognize a pickup demand pattern of a user corresponding to each pickup device of the conference based on the received sound by executing instructions stored in the internal memory 121, and then perform a suppressing process or an enhancing process on the sound picked up by the pickup device corresponding to the user based on the pickup demand pattern of the user. For example, when the pickup demand mode of the user is mute, the sound picked up by the pickup apparatus may be subjected to the suppression process, and when the pickup demand mode of the user is speak, the sound picked up by the pickup apparatus may be subjected to the enhancement process. When the mixed sound after the suppression processing is played, the user who listens cannot hear or hear the sound of the user who has the mute requirement, and when the mixed sound after the enhancement processing is played, the user who listens can hear the sound of the user who has the speaking requirement. The sound processing method can automatically identify the pickup demand mode of the user no matter where the user is on the conference site, and can process the picked-up sound of the user according to the pickup demand mode of the user in real time when the pickup demand mode of the user changes, thereby meeting the pickup demand modes of different users at different positions in the conference, reducing the operation of manually changing the pickup state of the pickup device by the user, ensuring the user experience and improving the accuracy and quality of sound processing.
The software system of the electronic device 100 may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. In this embodiment, taking an Android system with a layered architecture as an example, a software structure of the electronic device 100 is illustrated.
Fig. 4 is a software configuration block diagram of the electronic apparatus 100 of the present embodiment.
The layered architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, an application layer, an application framework layer, an Zhuoyun row (Android run) and system libraries, and a kernel layer, respectively.
The application layer may include a series of application packages. As shown in fig. 4, the application package may include applications for cameras, gallery, calendar, phone calls, maps, navigation, WLAN, bluetooth, music, video, short messages, etc.
For example, when the electronic device 100 is a sound pickup device and/or a control device in the above-described embodiments, the application layer of the electronic device 100 may further include a conference application, a live application, a classroom application, or the like, which provides a multi-person interactive function.
The application framework layer provides an application programming interface (application programming interface, API) and programming framework for application programs of the application layer. The application framework layer includes a number of predefined functions. As shown in fig. 4, the application framework layer may include a window manager, a content provider, a view system, a telephony manager, a resource manager, a notification manager, and the like.
The window manager is used for managing window programs. The content provider is used to store and retrieve data and make such data accessible to applications. The view system includes visual controls, such as controls to display text, controls to display pictures, and the like. The telephony manager is used to provide communication functions of the electronic device 100, such as management of call status (including on, off, etc.). The resource manager provides various resources for the application program, such as localization strings, icons, pictures, layout files, video files, and the like.
The notification manager allows the application to display notification information in a status bar, can be used to communicate notification type messages, can automatically disappear after a short dwell, and does not require user interaction.
Android run time includes a core library and virtual machines. Android run is responsible for scheduling and management of the Android system.
The core library consists of two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.
The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application program layer and the application program framework layer as binary files. The virtual machine is used for executing the functions of object life cycle management, stack management, thread management, security and exception management, garbage collection and the like.
The system library may include a plurality of functional modules. For example: surface manager (surface manager), media Libraries (Media Libraries), three-dimensional graphics processing Libraries (e.g., openGL ES), 2D graphics engines (e.g., SGL), etc.
The surface manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications. Media libraries support a variety of commonly used audio, video format playback and recording, still image files, and the like. The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like. The 2D graphics engine is a drawing engine for 2D drawing.
For example, when the electronic apparatus 100 is the sound pickup apparatus and/or the control apparatus in the above-described embodiments, the system library of the electronic apparatus 100 may further include an audio processing module, a data processing module, and the like. The electronic device 100 thus suppresses, objectifies, loudness enhances, sound effect processes, etc., sound using an audio processing module or a data processing module, etc.
The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.
Illustratively, when the electronic apparatus 100 is a sound pickup apparatus and/or a control apparatus in the above-described embodiments, the core layer of the electronic apparatus 100 may also contain a microphone 170C drive, a receiver 170B drive, a speaker 170A drive, and the like.
The sound processing method in the embodiment of the present application may be implemented based on the electronic device 100 shown in fig. 4. Taking the above-mentioned multi-person conference scene as an example, the pickup device may be connected to the control device by a preset access manner, thereby forming a pickup system. The preset access mode comprises two-dimension code access, conference link access and the like.
Each sound pickup apparatus in the sound pickup system may pick up sound of a conference site based on the core layer driving microphone 170C, the receiver 170B, the headphones, and the like. And transmits the picked-up sound to the control device.
The control device identifies the sound pickup demand mode based on the audio processing module or the data processing module in the system library, and then suppresses or enhances the sound picked up by the pickup device corresponding to the user according to the pickup demand mode of the user. For example, when the pickup demand mode of the user is mute, the sound picked up by the pickup apparatus may be subjected to the suppression process, and when the pickup demand mode of the user is speak, the sound picked up by the pickup apparatus may be subjected to the enhancement process. When the mixed sound after the suppression processing is played, the user who listens cannot hear or hear the sound of the user who has the mute requirement, and when the mixed sound after the enhancement processing is played, the user who listens can hear the sound of the user who has the speaking requirement. The sound processing method can automatically identify the pickup demand mode of the user no matter where the user is on the conference site, and can process the picked-up sound of the user according to the pickup demand mode of the user in real time when the pickup demand mode of the user changes, thereby meeting the pickup demand modes of different users at different positions in the conference, reducing the operation of manually changing the pickup state of the pickup device by the user, ensuring the user experience and improving the accuracy and quality of sound processing.
The following describes a sound processing method in the embodiment of the present application, taking a multi-person interaction scenario as an example of a multi-person conference scenario. Wherein the multi-person conference scene includes two kinds of scenes. Scene one: multiple users are in the same conference site to participate in the same conference. Scene II: multiple users are in different conference sites to participate in the same conference, and each conference site includes at least one user.
Scene one
When a plurality of users are in the same conference site and participate in the same conference, the pickup devices in the conference site form a pickup network or pickup system. One sound pickup apparatus may correspond to one user, may correspond to a plurality of users, that is, one sound pickup apparatus may pick up the sound of one user, or may pick up the sound of a plurality of users.
For example, referring to fig. 5, the pick-up device at the conference site includes a microphone and a user's mobile phone, where microphone 1 corresponds to user 1, microphone 2 corresponds to user 2, mobile phone 1 corresponds to user 3, mobile phone 2 corresponds to user 4, and mobile phone 3 corresponds to user 5. Each sound pickup apparatus can pick up not only the sound of the corresponding user but also the sound of other users around.
In the conference scene as shown in fig. 5, after each sound pickup apparatus is connected to the sound pickup system, the sound pickup apparatus starts picking up sound and sends the picked-up sound to the control apparatus. And the control equipment determines the pickup demand mode of the user corresponding to each pickup equipment according to the received sound. Wherein, the pickup demand mode of user includes modes such as silence, speech, normal pickup to, speech and normal pickup represent the pickup demand mode of non-silence, and silence and normal pickup can also represent the pickup demand mode of non-speech. The control device may determine, according to the received sound, the number of sound sources corresponding to each sound pickup device in the sound pickup system, the signal-to-noise ratio of the sound of each sound pickup device, the semantics of the sound content picked up by each sound pickup device, and then determine, according to the number of sound sources corresponding to each sound pickup device, the signal-to-noise ratio of the sound of each sound pickup device, the semantics of the sound content picked up by each sound pickup device, and so on, the sound pickup demand mode of the user corresponding to each sound pickup device. Then, the control device suppresses or enhances the sound picked up by the pickup device corresponding to the user according to the pickup demand mode of the user.
In this embodiment, the behaviors of the user, the speaking content, the sound environment where the user is located, and the like can all represent the pickup demand mode of the user, for example, when the user wants to speak, the user can approach to a nearby pickup device, so that the sound quality and the sound signal to noise ratio picked up by the pickup device are relatively high, or only 1 sound source corresponding to the pickup device exists, and then the control device determines that the user wants to speak according to the number of sound sources, the relatively high sound quality, the sound signal to noise ratio, and the like after receiving the sound. For another example, the control device may recognize that the content spoken by the user is irrelevant to the conference content after receiving the sound, and may determine that the user needs to be muted.
In some embodiments, the pickup demand pattern of the user may be represented or characterized by sound information of sound picked up by the pickup device. The sound information may include sound source information, quality information, content information, and the like. The sound source information may represent the number of sound sources corresponding to each sound pickup apparatus, the quality information may represent the sound signal-to-noise ratio of each sound pickup apparatus, the content information may represent the semantics of the sound content picked up by each sound pickup apparatus, and the like.
In some embodiments, the control device may determine, according to the number of sound sources corresponding to each sound pickup device, whether a user of the users corresponding to each sound pickup device is speaking, and whether the sound picked up by each sound pickup device needs to be processed.
For example, if the number of sound sources corresponding to the microphone 1 is only 1, which means that only one user of the users corresponding to the microphone 1 is speaking, the control device may determine that the user is speaking, and the pickup demand mode of the user is speaking, then the control device may perform enhancement processing on the sound picked up by the microphone 1, and since the sound picked up by the microphone 1 includes the sound of the speaking user, it may be ensured that the sound of the speaking user may be clearly played, and other users may also clearly hear the speaking content of the user.
If the number of sound sources corresponding to the microphone 1 is greater than 1, it is indicated that a plurality of users currently speak among the users corresponding to the microphone 1, the control device cannot determine which user is speaking, at this time, it may be determined that the pickup demand mode of the user is normal pickup, and the control device may not perform suppression or enhancement processing on the sound picked up by the microphone 1.
In some embodiments, the control device may determine, according to the sound signal-to-noise ratio of each sound pickup device, a sound environment where a user corresponding to each sound pickup device is located, so as to determine whether to process sound picked up by each sound pickup device.
In some embodiments, the control device may compare the sound signal-to-noise ratio of each pickup device with a preset signal-to-noise ratio, thereby determining whether the sound signal-to-noise ratio of each pickup device is lower or higher. When the sound signal-to-noise ratio of the target pickup device is smaller than the first preset signal-to-noise ratio, the control device can restrain the sound picked up by the target pickup device; when the signal-to-noise ratio of the sound of the target sound pickup apparatus is greater than or equal to the first preset signal-to-noise ratio, the sound of the target sound pickup apparatus is considered to be high, and at this time, the control apparatus may not process the sound picked up by the target sound pickup apparatus, or enhance the sound picked up by the target sound pickup apparatus.
For example, if the control apparatus determines that the signal-to-noise ratio of the microphone 1 in the sound pickup system is low, it may be considered that the sound quality picked up by the microphone 1 is poor, the environment in which the user corresponding to the microphone 1 is located is noisy, the user corresponding to the microphone 1 may have a plurality and a plurality of users are speaking at the same time, or the like. At this time, the sound picked up by the microphone 1 is not heard when played, and the sound picked up by other sound pickup apparatuses is also affected, so that the sound picked up by other sound pickup apparatuses cannot be heard. In this case, the control device can perform the suppressing process on the sound picked up by the microphone 1, thereby reducing the interference of the sound picked up by the microphone 1 with the sound picked up by the other sound pickup devices.
If the control device determines that the sound signal-to-noise ratio of the microphone 1 in the sound pickup system is relatively high, it can be considered that the sound picked up by the microphone 1 is of good quality, that the environment in which the user corresponding to the microphone 1 is located is relatively quiet, that there may be one user corresponding to the microphone 1 and that the user is speaking, etc. In this case, the control device may perform enhancement processing on the sound picked up by the microphone 1, and since the sound picked up by the microphone 1 includes the sound of the speaking user, the sound picked up by the microphone 1 may be heard clearly by the user when played, while the speaking content of the user corresponding to the microphone 1 is heard clearly.
In some embodiments, the control device may determine, according to the semantics of the sound content picked up by each sound pickup device, whether the user corresponding to each sound pickup device is speaking, and further determine whether the sound picked up by each sound pickup device needs to be processed.
In some embodiments, the control device may be preconfigured with current meeting content, such as meeting topics, meeting summaries, meeting summary, and the like. The control device can perform semantic recognition on the sound picked up by each pickup device, and match the recognized semantic content with preset content of the conference. If the similarity between the semantic content corresponding to the sound picked up by the target sound pickup device and the preset content of the conference is greater than or equal to the preset similarity, the user corresponding to the target sound pickup device is considered to speak the content related to the current conference, and at this time, the control device may not process the sound picked up by the target sound pickup device, or perform enhancement processing on the sound picked up by the target sound pickup device.
And if the similarity between the semantic content corresponding to the sound picked up by the target sound pickup device and the preset content of the conference is smaller than the preset similarity, the user corresponding to the target sound pickup device is considered to speak the content irrelevant to the current conference, and at the moment, the control device can perform inhibition processing on the sound picked up by the target sound device.
For example, if the control device determines that the semantic content of the sound picked up by the microphone 1 in the sound pickup system is irrelevant to the current conference content, it may be considered that the user corresponding to the microphone 1 is speaking other content, and the conference is not discussed. At this time, the sound picked up by the microphone 1 also affects the sound picked up by other sound pickup apparatuses at the time of playback. In this case, the control device can perform the suppressing process on the sound picked up by the microphone 1, thereby reducing the interference of the sound picked up by the microphone 1 with the sound picked up by the other sound pickup devices.
Whereas if the control device determines that the semantic content of the sound picked up by the microphone 1 in the sound pickup system is related to the current conference content, the user corresponding to the microphone 1 may be considered to be discussing the conference or speaking. In this case, the control device may perform enhancement processing on the sound picked up by the microphone 1, and since the sound picked up by the microphone 1 includes the sound of the speaking user, the sound picked up by the microphone 1 may be heard clearly by the user when played, while the discussion of the user corresponding to the microphone 1 is heard clearly.
In some embodiments, the pickup system may be a preset distributed pickup system including a conference site microphone or microphone array, and the user may access the electronic device to the preset pickup system when the user arrives at the conference site with the electronic device in a convenient manner to participate in the conference.
For example, referring to fig. 6, a microphone 1, a microphone 2, a microphone 3, etc. disposed at a conference site may be respectively communicatively connected with a control device, thereby constituting a preset distributed sound pickup system. In the distributed pickup system, the position and the pickup direction of each microphone may be different, thereby achieving the purpose of picking up sounds in all directions on the conference site. When a user participating in a conference wants to participate in the conference by using the mobile phone 1 carried by the user, the user can access the mobile phone 1 into a preset distributed pickup system through the preset access mode, so that the mobile phone 1 can pick up the sound of the conference site. And the microphone and the mobile phone send the picked sound to the control equipment, the control equipment determines a pickup demand mode of a user corresponding to the mobile phone and the microphone according to the sound, and the sound picked by the mobile phone and the microphone is correspondingly processed according to the pickup demand mode.
In some embodiments, when the control device performs the suppression processing on the target sound picked up by the target sound pickup device, the target sound may be removed or weakened from the mixed sound picked up by the other sound pickup devices, so as to achieve the purpose of suppressing the target sound, and also achieve the purpose of silencing the user corresponding to the target sound pickup device.
The control device may acquire the sound characteristics of the target sound or subject the target sound to the targeting process, so as to determine that the target sound includes sounds of several users according to the sound characteristics or the targeting process result. In the case where the target sound includes only one user's sound, the control device removes or attenuates this user's sound from the mix picked up by the other sound pickup device. In the case where the target sound includes a plurality of user sounds, the control device may remove or attenuate these user sounds from the mix picked up by the other sound pickup device; alternatively, the control device may determine the sound of the target user having a higher sound quality from among the plurality of user sounds, and then remove or attenuate the sound of the target user from the sound mix picked up by the other sound pickup device.
For example, if the control device determines, according to the received sound, that the user corresponding to the mobile phone 1 in the conference site may have a need for silence, the control device may remove or attenuate the target sound picked up by the mobile phone 1, thereby the mix picked up by the other sound pickup devices. The target sound may include the sound of the user 3 corresponding to the mobile phone 1, or the target sound may also include the sound of the user 3 corresponding to the mobile phone 1 and the sound of the other users 2 and 4. The control device may remove or attenuate the sound of the user 3 from the mixed sound picked up by the microphone 1, the microphone 2, the mobile phone 2 and the mobile phone 3, so that when the suppressed mixed sound is played, the sound of the user 3 may not be included in the mixed sound, or as shown in (a) of fig. 7, the sound of the user 3 in the mixed sound may be small, specifically, the sound loudness (in dB) of the user 3 may be smaller than the sound loudness of other users, where the horizontal axis shown in fig. 7 represents time (in seconds, s). Alternatively, the control device may remove or attenuate the sounds of the user 2, the user 3, and the user 4 from the mixed sounds picked up by the microphone 1, the microphone 2, the mobile phone 2, and the mobile phone 3, so that the mixed sounds after being suppressed may not include the sounds of the user 2, the user 3, and the user 4 when played, or as shown in (b) of fig. 7, the sounds of the user 2, the user 3, and the user 4 in the mixed sounds are small, which may be specifically reflected in that the sound loudness of the user 2, the user 3, and the user 4 is smaller than that of other users. Still alternatively, after the control device performs the voice feature recognition or the objectification processing, it is determined that the voice of the user 3 is higher in the voice quality in the target voice, and then the control device may remove or attenuate the voice of the user 3 from the mixed voice picked up by the microphone 1, the microphone 2, the mobile phone 2 and the mobile phone 3, so that the mixed voice after being suppressed does not include the voice of the user 3 or the voice of the user 3 is small when the mixed voice after being suppressed is played.
In some embodiments, the control apparatus may determine whether the sound quality of the plurality of users obtained after the objectification processing is higher or lower by judging whether the sound quality satisfies the first preset sound quality condition. The first preset sound quality condition indicates whether the signal-to-noise ratio of the sound of the user is greater than or equal to the second preset signal-to-noise ratio, whether the loudness is greater than or equal to the preset loudness, and the like. When the signal-to-noise ratio of the user's voice is greater than or equal to the second preset signal-to-noise ratio, it may be determined that the user's voice quality is higher, and when the signal-to-noise ratio of the user's voice is less than the second preset signal-to-noise ratio, it may be determined that the user's voice quality is lower. When the loudness of the user's sound is greater than or equal to the preset loudness, it may be determined that the user's sound quality is high, and when the user's sound is relatively less than the preset loudness, it may be determined that the user's sound quality is low.
In some embodiments, the control device may remove the target sound from the above-described mixed sound based on the deep neural network. Specifically, the control device acquires the time-frequency masking value of the target sound through the deep neural network, and multiplies the time-frequency masking value of the target sound by the mixed sound, thereby removing the target sound from the mixed sound. Alternatively, in other embodiments, the target sound may be used as a reference signal, an adaptive filter is configured based on the existing multichannel signal processing technology, and the signal components related to the reference signal in the mixed sound signal are filtered by the adaptive filter, so as to achieve the purpose of suppressing the target sound.
In some embodiments, when the control device performs enhancement processing on the target sound picked up by the target sound pickup device, the control device may enhance the target sound or remove other sounds except the target sound in the mixed sound picked up by each sound pickup device, so as to achieve the purpose of speaking by the user corresponding to the target sound pickup device. The enhancement mode may include loudness enhancement, objectification processing, sound effect processing, and the like.
The control device may acquire the sound characteristics of the target sound or subject the target sound to the targeting process, so as to determine that the target sound includes sounds of several users according to the sound characteristics or the targeting process result. In the case where the target sound includes only one user's sound, the control device enhances this user's sound in the mix picked up by the other sound pickup device. In the case where the target sound includes a plurality of user sounds, the control device may enhance the user sounds in the mix picked up by the other sound pickup device; alternatively, the control device may determine the sound of the target user having a higher sound quality from among the plurality of user sounds, and then enhance the sound of the target user in the mixed sound picked up by the other sound pickup devices.
For example, if the control device determines, according to the received sound, that the user corresponding to the mobile phone 1 in the conference site may have a requirement for speaking, the control device may enhance the target sound picked up by the mobile phone 1 in the sound mixing picked up by the other sound pickup devices. The target sound may include the sound of the user 3 corresponding to the mobile phone 1, or the target sound may also include the sound of the user 3 corresponding to the mobile phone 1 and the sound of the other users 2 and 4. The control device may enhance the sound of the user 3 in the mixed sound picked up by the microphone 1, the microphone 2, the mobile phone 2 and the mobile phone 3, so that when the enhanced mixed sound is played, as shown in (a) of fig. 8, the mixed sound mainly includes the sound of the user 3, or the sound of the user 3 is very loud, which may be specifically reflected in that the sound loudness of the user 3 is greater than the sound loudness of other users. Alternatively, the control device may enhance the sound of the user 2, the user 3, and the user 4 in the mixed sound picked up by the microphone 1, the microphone 2, the mobile phone 2, and the mobile phone 3, so that when the enhanced mixed sound is played, as shown in (b) of fig. 8, the mixed sound mainly includes the sound of the user 2, the user 3, and the user 4, or the sound of the user 2, the user 3, and the user 4 is very loud, which may be specifically reflected in that the sound loudness of the user 2, the user 3, and the user 4 is greater than the sound loudness of other users. Still alternatively, after the control device performs the voice feature recognition or the objectification processing, it is determined that the voice of the user 3 is higher in the voice quality in the target voice, and then the control device may enhance the voice of the user 3 in the mixed voice picked up by the microphone 1, the microphone 2, the mobile phone 2 and the mobile phone 3, so that when the enhanced mixed voice is played, the mixed voice mainly includes the voice of the user 3, or the voice of the user 3 is very loud.
In some embodiments, the control device may determine whether the sound quality of the plurality of users obtained after the objectification process is higher or lower by determining whether the sound quality satisfies the second preset sound quality condition. The second preset sound quality condition may also indicate whether the signal-to-noise ratio of the sound of the user is greater than or equal to the second preset signal-to-noise ratio, whether the loudness is greater than or equal to the preset loudness, and so on. When the signal-to-noise ratio of the user's voice is greater than or equal to the second preset signal-to-noise ratio, it may be determined that the user's voice quality is higher, and when the signal-to-noise ratio of the user's voice is less than the second preset signal-to-noise ratio, it may be determined that the user's voice quality is lower. When the loudness of the user's sound is greater than or equal to the preset loudness, it may be determined that the user's sound quality is high, and when the user's sound is relatively less than the preset loudness, it may be determined that the user's sound quality is low.
The objective of the above-described targeting process of the target sound is to separate at least one user's sound from the target sound. And the purpose of the loudness enhancement processing on the target sound is to make the target sound in the sound mixing more prominent when the sound mixing is played, so as to realize the speaking purpose of the user to which the target sound belongs.
In some embodiments, after the control device performs the objectification processing on the target sound picked up by the target sound pickup device, the sound of each user obtained after the objectification processing may also perform corresponding sound effect processing on different users in the conference site. Specifically, the sound effect processing can be performed for a certain target user in the users at the conference site, and the control device determines the azimuth relation of other users relative to the target user, so that the azimuth relation is used as azimuth information of the target user. And then, the control equipment performs sound effect processing on the sound of other users according to the azimuth information of the target user, so that the playing effect of the sound of the other users is matched with the azimuth relation between the other users and the target user. Finally, the control device sends the sound after the sound effect processing to the target user, so that the target user can hear the speaking contents of other users in different directions, and the real sound effect in the conference scene can be restored, and the interaction experience of the user in the multi-person conference scene is enhanced.
After the control device performs the suppression or enhancement processing on the sound, the control device can save the sound; the sound may also be transmitted to an audio device or an electronic device in a sound pickup system or the like at the conference site, where the sound is played by the audio device or the electronic device.
In some embodiments, if the conference site space is small, the voices uttered between users can be heard each other, the process of picking up the voices by the sound pickup device can be regarded as a recording process, and the picked-up voices do not need to be played. The control device may store the processed sound as an audio file such as a meeting summary or a meeting recording, or store the meeting video of the meeting scene together with the processed sound as a video file such as a meeting summary or a meeting video.
When a user needs to review conference contents, the user can select to play the stored audio file or video file, if the sound picked up by the target pickup device in the recording process is restrained, the sound of the target user corresponding to the target pickup device cannot be heard or is heard clearly when the audio file or video file is played, so that the user listening to the audio file or video file can hear the sound of other users clearly, and the user speaking contents which want to mute can not be heard clearly; if the sound picked up by the target sound pickup apparatus is enhanced during recording, when the audio file or the video file is played, the user listening to the audio file may only hear the sound of the target user corresponding to the target sound pickup apparatus, or mainly hear the sound of the target user corresponding to the target sound pickup apparatus, so that the user listening to the audio file or the video file can hear the content of the user speaking which the user wants to speak.
In some embodiments, if the conference site space is relatively large, and a portion of users are far apart, so that speaking sounds cannot be heard by each other, the sound picked up by the sound pickup apparatus needs to be played on site, so that users far apart can hear the sound during the conference, and also can hear the speaking content of the speaking user, and so on. The control device can send the sound after the suppression or enhancement treatment to the audio device of the conference site or the electronic device carried by the user, and the sound in the conference process is played through the audio device or the electronic device. If the sound picked up by the target pickup device is suppressed during the conference, when the sound is played, the user on the conference site may not hear or hear the sound of the target user corresponding to the target pickup device, so that the content of speaking of the user who wants to mute is not heard, but the sound of other users can be heard; if the sound picked up by the target sound pickup apparatus is enhanced during the conference, a user at the conference site may only hear the sound of the target user corresponding to the target sound pickup apparatus or mainly hear the sound of the target user corresponding to the target sound pickup apparatus while playing the sound, so that the user who wants to speak can be heard clearly.
The audio device may be a device that can play sound such as a speaker, a sound box, or the like provided at a conference site, and the electronic device may be a device having a sound pickup function in the foregoing embodiment.
In some embodiments, after the control device subjects the mixed sound picked up by the sound pickup device to the objectification, a corresponding track may also be set for each obtained sound of the user, that is, one track for each user. Then, when the audio is mixed and played, a user listening to the audio can operate the audio track according to the own requirement, so that the purpose of actively selecting to listen to the speaking content is achieved.
In some embodiments, the targeting process, the loudness enhancement process, may be performed by a cloud device, or by some pickup device in a pickup system.
As can be seen from the foregoing, in the sound processing method according to the embodiment of the present application, each sound pickup device may actively pick up a sound of a user in a conference process, and the control device may automatically identify, according to the sound picked up by the sound pickup device, a sound pickup demand mode, such as silence, speaking, etc., of the user corresponding to the sound pickup device in the conference process. And then, the control equipment performs inhibition or enhancement processing on the sound picked up by the sound pickup equipment corresponding to the user according to the sound pickup demand mode of the user. The sound processing method can automatically identify the pickup demand mode of the user no matter where the user is, and can process the picked-up sound of the user according to the pickup demand mode of the user in real time when the pickup demand mode of the user changes, thereby meeting the demands of different users at different positions in a conference, reducing the operation of manually changing the pickup state of pickup equipment by the user, ensuring the user experience and improving the accuracy and quality of sound processing.
Scene two
Multiple users are in different conference sites to participate in the same conference, and each conference site includes at least one user. The sound pickup devices in each conference site may together form a sound pickup network or system. One sound pickup apparatus may correspond to one user, may correspond to a plurality of users, that is, one sound pickup apparatus may pick up the sound of one user, or may pick up the sound of a plurality of users.
For example, referring to fig. 9, conference site 1 and conference site 2 users are co-participating in a conference, and conference site 2 users may be considered online users, and conference site 1 users may be considered offline users, relative to conference site 1 users. The sound pickup apparatus in the conference site 2 may be regarded as a far-end apparatus with respect to the sound pickup apparatus in the conference site 1, and the sound pickup apparatus in the conference site 1 may be regarded as a near-end apparatus.
As shown in fig. 9, the sound pickup apparatus of the conference site 1 includes a microphone and a user's mobile phone, wherein the microphone 1 corresponds to the user 1, the microphone 2 corresponds to the user 2, the mobile phone 1 corresponds to the user 3, the mobile phone 2 corresponds to the user 4, and the mobile phone 3 corresponds to the user 5. Each sound pickup apparatus can pick up not only the sound of the corresponding user but also the sound of other users around. The sound pickup apparatus of the conference site 2 may also include a microphone, a user's mobile phone, and the like.
In the conference scenario shown in fig. 9, after the sound pickup apparatuses at different conference sites are connected to the sound pickup system, the sound pickup is started, and the picked-up sound is transmitted to the control apparatus. And the control equipment determines the pickup demand modes of the users corresponding to the pickup equipment in different conference sites according to the received sound. The pickup demand modes of the user comprise modes of silence, speaking, normal pickup and the like. After that, the control device may perform the suppressing or enhancing process on the sound picked up by the sound pickup device according to the pickup demand pattern of the user. In addition, the manner in which the control device recognizes the pickup demand pattern of the user, performs the suppressing process and the enhancing process on the sound may be referred to the foregoing embodiments, and will not be described here again.
After the control device performs the suppression or enhancement processing on the sound, the control device can save the sound; the audio device or the electronic device in the pickup system can also send the sound to the conference site, and the audio device or the electronic device can play the sound at the conference site, wherein the conference site can be one conference site comprising all off-line users, or can be different conference sites comprising off-line users and on-line users.
When the control device sends the mixed sound of the first conference site to the second conference site for playing, if the user corresponding to the target pickup device in the first conference site has a mute requirement, the user in the second conference site cannot hear the sound picked up by the target pickup device in the first conference site, or the heard sound picked up by the target pickup device is very small, so that the aim of restraining the target sound is fulfilled, and the aim of muting the user corresponding to the target pickup device can also be fulfilled; if the user corresponding to the target sound pickup device in the first conference site has a speaking requirement, the user in the second conference site can mainly hear the sound picked up by the target sound pickup device in the first conference site or hear the sound picked up by the target sound pickup device very much, so that the purpose of speaking by the user corresponding to the target sound pickup device is achieved.
For example, if the control device determines that the user corresponding to the handset 1 in the conference site 1 may have a need for silence, the control device may remove or attenuate the target sound picked up by the handset 1 from the sound mix picked up by the other sound pickup devices in the conference site 1. The target sound may include the sound of the user 3 corresponding to the mobile phone 1 in the conference site 1, or the target sound may also include the sound of the user 3 corresponding to the mobile phone 1 in the conference site 1 and the sound of the other users 2 and 4.
The control device may remove or attenuate the sound of the user 3 from the mixed sound picked up by the microphone 1, the microphone 2, the cellular phone 2, and the cellular phone 3, so that the suppressed mixed sound may not include the sound of the user 3 when the conference site 2 is played, or as shown in (a) of fig. 10, the sound of the user 3 is small, so that the user of the conference site 2 does not hear the sound and content of the user 3 speaking, or does not hear the sound and content of the user 3 speaking.
Alternatively, the control device may remove or attenuate the sounds of the users 2, 3 and 4 from the mixed sounds picked up by the microphone 1, the microphone 2, the cellular phone 2 and the cellular phone 3, so that the sounds of the users 2, 3 and 4 are not included in the mixed sounds when the suppressed mixed sounds are played at the conference site 2, or as shown in (b) of fig. 10, the sounds of the users 2, 3 and 4 are small, so that the users of the conference site 2 do not hear the sounds and contents of the voices of the users 2, 3 and 4, or hear the sounds and contents of the voices of the users 2, 3 and 4.
Still alternatively, after the control device performs the voice feature recognition or the objectification processing, it is determined that the voice of the user 3 is higher in the voice quality in the target voice, and then the control device may remove or weaken the voice of the user 3 from the mixed voice picked up by the microphone 1, the microphone 2, the mobile phone 2 and the mobile phone 3, so that when the suppressed mixed voice is played in the conference site 2, the mixed voice does not include the voice of the user 3, or the voice of the user 3 is small, so that the user of the conference site 2 does not hear the voice and the content of the voice of the user 3, or does not hear the voice and the content of the voice of the user 3.
As another example, if the control device determines that the user corresponding to the mobile phone 1 in the conference site 1 may have a requirement for speaking, the control device may enhance the target sound picked up by the mobile phone 1 in the sound mixing picked up by the other sound pickup devices in the conference site 1. The target sound may include the sound of the user 3 corresponding to the mobile phone 1 in the conference site 1, or the target sound may also include the sound of the user 3 corresponding to the mobile phone 1 in the conference site 1 and the sound of the other users 2 and 4.
The control device may enhance the sound of the user 3 in the mixed sound picked up by the microphone 1, the microphone 2, the mobile phone 2 and the mobile phone 3, so that when the enhanced mixed sound is played at the conference site 2, as shown in (a) of fig. 11, the mixed sound mainly includes the sound of the user 3, or the sound of the user 3 is very loud, so that the user at the conference site 2 can only hear the sound and content of the speech of the user 3, or the sound and content of the speech of the user 3 are clearer.
Alternatively, the control device may enhance the sound of the user 2, the user 3, and the user 4 in the mixed sound picked up by the microphone 1, the microphone 2, the mobile phone 2, and the mobile phone 3, so that when the enhanced mixed sound is played at the conference site 2, as shown in (b) of fig. 11, the mixed sound mainly includes the sound of the user 2, the user 3, and the user 4, or the sound of the user 2, the user 3, and the user 4 is loud, so that the user at the conference site 2 can hear only the sound and the content of the speech of the user 2, the user 3, and the user 4, or the sound and the content of the speech of the user 2, the user 3, and the user 4 are clearer.
Or after the control device performs voice feature recognition or objectification processing, it is determined that the voice of the user 3 is higher in voice quality in the target voice, so that the control device can enhance the voice of the user 3 in the mixed voice picked up by the microphone 1, the microphone 2, the mobile phone 2 and the mobile phone 3, and when the enhanced mixed voice is played in the conference site 2, the mixed voice mainly comprises the voice of the user 3, or the voice of the user 3 is quite loud, so that the user in the conference site 2 can only hear the voice and the content of the voice of the user 3, or the voice and the content of the voice of the user 3 are heard, and the voice and the content of the voice of the user 3 are clearer.
The manner in which the control device determines the sound quality may refer to the manner in which the sound quality is determined in the foregoing embodiment, which is not described herein.
When conference site 2 plays the sound of conference site 1, the sound may be played through the audio device in conference site 2. The audio device may include a speaker, a sound, or other devices provided at the conference site 2, or may include an electronic device carried by a user.
In some embodiments, after the control device subjects the mixed sound picked up by the sound pickup device to the objectification, a corresponding track may also be set for each obtained sound of the user, that is, one track for each user. Thereafter, the control device transmits the mix including the track information to another conference site. If the audio mixing is played by the electronic equipment carried by the user in another conference site, the user can select the audio tracks of different users on the electronic equipment according to the own requirements, so that the purpose of listening to the speech of different users is selected in a rotating way.
As can be seen from the foregoing, the sound processing method according to the embodiment of the present application may be applied to different conference scenes, where each sound pickup device located at a different conference site may actively pick up a sound of a user during a conference, and the control device may automatically identify, according to the sound picked up by the sound pickup device, a sound pickup demand mode, such as silence, speaking, etc., of a user corresponding to the sound pickup device in the different conference site during the conference. And then, the control equipment performs inhibition or enhancement processing on the sound picked up by the sound pickup equipment corresponding to the user according to the sound pickup demand mode of the user. The sound processing method can automatically identify the pickup demand mode of the user no matter where the user is, and can process the picked-up sound of the user according to the pickup demand mode of the user in real time when the pickup demand mode of the user changes, thereby meeting the demands of different users at different positions in a conference, reducing the operation of manually changing the pickup state of pickup equipment by the user, ensuring the user experience and improving the accuracy and quality of sound processing.
The preset access mode in the foregoing embodiment may be a mode of scanning two-dimensional code access, for example, the user 3 shown in fig. 12 scans a meeting two-dimensional code using the mobile phone 1, so as to enter a meeting, and the mobile phone 1 and other pickup devices form a pickup system or a pickup network. After that, the cellular phone 1 starts picking up the sound of the conference site. In other embodiments, the preset access manner may be a link access manner, for example, the user 3 clicks the conference link on the mobile phone 1, so as to enter the conference. After that, the cellular phone 1 starts picking up the sound of the conference site. The pickup equipment can be quickly connected into the conference system through a preset connection mode, so that the pickup system or the pickup network can be quickly formed, and the efficiency of holding the conference is improved.
In some embodiments, the control device may also determine the number of sound sources in the sound pickup system in combination with the sound picked up by the respective sound pickup device. Therefore, under the condition that one sound source exists in the sound pickup system, the control equipment can enhance the sound corresponding to the sound source in the mixed sound picked up by all the sound pickup equipment in the sound pickup system, so that when the mixed sound is played, the sound of speaking of the user corresponding to the sound source is larger, and the purpose of speaking of the user is achieved. In the case where a plurality of sound sources exist in the sound pickup system, the control apparatus cannot determine which sound source the user corresponding to is speaking, so that the sound picked up by the sound pickup apparatus does not need to be processed.
In the foregoing embodiment, the control device may determine the sound of the user in the sound pickup system from the sound picked up by the sound pickup device according to the voiceprint recognition, the voice recognition, or the like.
In order to avoid that the sound played by the sound pickup device includes the sound of the user corresponding to the sound pickup device when the same conference site sound is played in real time in the conference site, so as to generate an influence such as an echo on the user corresponding to the sound pickup device, in some embodiments, the control device may further remove the sound of the user picked up by the target sound pickup device from the mixed sound when sending the mixed sound to the target sound pickup device. Thus, even when the user speaks or speaks through the target sound pickup apparatus, the first apparatus only plays the voice and content of the speech of the other user, so that the user corresponding to the target sound pickup apparatus is not affected.
In the foregoing embodiments, the audio processing method in the embodiments of the present application is described by taking a multi-person conference scenario as an example. In practical applications, the sound processing method in the embodiment of the present application may also be applied to other multi-user interaction scenarios. In any multi-person interaction scene, the control device can determine the pickup demand mode of the user corresponding to the pickup device according to the sound picked up by the pickup device, and then inhibit or enhance the sound picked up by the pickup device according to the pickup demand mode, so that the purpose of automatically processing the sound in the multi-person interaction scene is achieved.
In the current common conference scene, if the target user does not want to speak, the target user needs to manually control the nearby microphone to mute, or the host of the conference controls the nearby microphone to mute, so that the microphone cannot pick up the sound of the target user, and the purpose of suppressing the sound of the target user is achieved. However, the sound of the target user in the conference site may be transmitted in various directions, and although the microphone corresponding to the target user cannot pick up the sound of the target user, other microphones may pick up the sound of the target user. Even if a microphone near a target user who does not want to speak is muted, it is difficult to avoid a phenomenon that the sound of the target user is picked up by other microphones, thereby greatly reducing the effect of sound suppression, and when the picked-up sound is played, the listening user may hear the sound of the target user and the speaking content. According to the sound processing method, the pickup demand mode of the user corresponding to the pickup equipment can be automatically identified according to the sound picked up by the pickup equipment, so that the sound picked up by the pickup equipment is automatically restrained or enhanced according to the pickup demand mode, the operation that the user actively controls the pickup state changed by the pickup equipment can be reduced, and the parametric experience of the user is improved. When the target user needs to mute, the sound pickup state of the sound pickup device corresponding to the target user does not need to be changed, but the sound picked up by the sound pickup device corresponding to the target user is removed or weakened from the sound mixture picked up by other sound pickup devices, so that even if a plurality of sound pickup devices can pick up the sound of the target user, the user listening to the sound mixture cannot hear the sound of the target user during playing, and the sound suppression effect is improved.
In the existing sound suppression method, the space of the conference site can be partitioned according to the topological structures of a plurality of microphones in the conference site, if a target user does not want to speak, the target partition where the user is located can be determined, and the picked sound in the target partition is eliminated, so that the purpose of suppressing the sound of the target user is achieved. However, in this manner of sound suppression, since the partitions are fixed, when the target user moves or is located at the boundary of the partitions, the sound of the target user may be picked up, which also greatly reduces the sound suppression effect, and when the picked-up sound is played, the listening user may hear the content of the target user intermittently. According to the sound processing method, the space of the conference site is not required to be partitioned, whether the target user moves on the conference site or is located at any position, the control device can recognize the sound pickup demand mode according to the sound picked up by the sound pickup device corresponding to the target user, and therefore whether the target user needs to mute, speak and the like is determined according to the sound pickup demand mode. When the target user needs to be muted, the control device can remove or weaken the sound picked up by the sound pickup device corresponding to the target user from the sound mixing picked up by other sound pickup devices, so that the user listening cannot hear the sound of the target user during the sound mixing playing, and the situation that the sound of the target user is intermittent can not occur.
In addition, in the foregoing embodiments, the pickup devices that form the pickup system or the pickup network may be mobile electronic devices such as a mobile phone, a notebook computer, and a tablet computer, so when participating in a multi-user interaction scene such as a conference, the pickup of the user's voice is not affected by the movement of the user and the position of the user, and the user may participate in the multi-user interaction scene more flexibly. In addition, the distance between the electronic equipment and the user is relatively close to the distance between the user and other preset pickup equipment such as a microphone, and the electronic equipment is easy to carry about and is not influenced by the movement of the user, so that the near-field human voice component of the user in the sound picked up by the electronic equipment is relatively strong, and the clear near-field human voice of the user is extracted by using the existing human voice enhancement algorithm, such as spectral subtraction noise reduction in a signal processing algorithm or a noise reduction network based on deep learning. Thus, the sound suppressing process or the enhancing process is more accurate, and the processing effect is better.
It will be appreciated that in order to achieve the above-described functionality, the electronic device comprises corresponding hardware and/or software modules that perform the respective functionality. The steps of an algorithm for each example described in connection with the embodiments disclosed herein may be embodied in hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application in conjunction with the embodiments, but such implementation is not to be considered as outside the scope of this application.
The present embodiment may divide the functional modules of the electronic device according to the above method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated modules described above may be implemented in hardware. It should be noted that, in this embodiment, the division of the modules is schematic, only one logic function is divided, and another division manner may be implemented in actual implementation.
Embodiments of the present application also provide an electronic device, as shown in fig. 13, which may include one or more processors 1001, memory 1002, and a communication interface 1003.
Wherein a memory 1002, a communication interface 1003, and a processor 1001 are coupled. For example, the memory 1002, the communication interface 1003, and the processor 1001 may be coupled together by a bus 1004.
Wherein the communication interface 1003 is used for data transmission with other devices. The memory 1002 has stored therein computer program code. The computer program code comprises computer instructions which, when executed by the processor 1001, cause the electronic device to perform the sound processing method in the embodiments of the present application.
The processor 1001 may be a processor or a controller, for example, a central processing unit (Central Processing Unit, CPU), a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an Application-specific integrated circuit (ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor may also be a combination that performs the function of a computation, e.g., a combination comprising one or more microprocessors, a combination of a DSP and a microprocessor, and the like.
The bus 1004 may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus, an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, or the like. The bus 1004 may be classified into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 13, but not only one bus or one type of bus.
Embodiments of the present application also provide a computer-readable storage medium having stored therein computer program code which, when executed by the above-mentioned processor, causes an electronic device to perform the relevant method steps in the above-mentioned method embodiments.
The present application also provides a computer program product which, when run on a computer, causes the computer to perform the relevant method steps of the method embodiments described above.
The electronic device, the computer storage medium or the computer program product provided in the present application are configured to perform the corresponding methods provided above, and therefore, the advantages achieved by the electronic device, the computer storage medium or the computer program product may refer to the advantages of the corresponding methods provided above, which are not described herein.
It will be apparent to those skilled in the art from this description that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and the parts displayed as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or contributing part or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, where the software product includes several instructions for causing a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely a specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (14)

1. The sound processing method is characterized by being applied to a control device in a sound pickup system, wherein the sound pickup system further comprises a first device and a second device which can pick up sound at the position of the first device; the sound processing method comprises the following steps:
according to the first sound picked up by the first equipment, determining a pickup demand mode of a first user corresponding to the first equipment, and according to the second sound picked up by the second equipment, determining a pickup demand mode of a second user corresponding to the second equipment;
removing or weakening the sound of the first user in the mixed sound of the first sound and the second sound when the sound pickup demand mode of the first user is mute and the sound pickup demand mode of the second user is non-mute; the audio mix after removing or attenuating the sound of the first user is for playing by the first device and/or the second device.
2. The method according to claim 1, wherein the method further comprises:
when the sound pickup demand mode of the first user is speaking and the sound pickup demand mode of the second user is non-speaking, enhancing the sound of the first user in the mixed sound of the first sound and the second sound; the audio mix after enhancing the first user's voice is for playing by the first device and/or the second device.
3. The method according to any one of claims 1-2, wherein determining, from the first sound picked up by the first device, a pickup demand pattern of a first user corresponding to the first device includes:
determining the number of sound sources corresponding to the first equipment according to the first sound;
if the number of sound sources corresponding to the first device is 1, determining that the pickup demand mode of the first user is speaking;
and if the sound source data volume corresponding to the first equipment is larger than 1, determining that the pickup demand mode of the first user is non-speaking.
4. The method according to any one of claims 1-2, wherein determining, from the first sound picked up by the first device, a pickup demand pattern of a first user corresponding to the first device includes:
determining a sound signal-to-noise ratio corresponding to the first equipment according to the first sound;
if the sound signal-to-noise ratio corresponding to the first equipment is smaller than a first preset signal-to-noise ratio, determining that the pickup demand mode of the first user is mute;
and if the sound signal-to-noise ratio corresponding to the first equipment is greater than or equal to a first preset signal-to-noise ratio, determining that the pickup demand mode of the first user is non-mute.
5. The method according to any one of claims 1-2, wherein determining, from the first sound picked up by the first device, a pickup demand pattern of a first user corresponding to the first device includes:
acquiring semantic content of the first sound;
if the similarity between the semantic content and the preset content of the first sound is smaller than the preset similarity, determining that the pickup demand mode of the first user is mute;
and if the similarity between the semantic content and the preset content of the first sound is greater than or equal to the preset similarity, determining that the pickup demand mode of the first user is non-mute.
6. The method of any of claims 1-5, wherein the control device is a first device or a second device in the pickup system; or the control device is cloud device.
7. The method according to any one of claims 4-6, wherein said removing or attenuating the sound of the first user in the mixed sound of the first sound and the second sound comprises:
performing objectification processing on the first sound to obtain the sound of the first user;
in the mixing of the first sound and the second sound, the sound of the first user is removed or attenuated.
8. The method of claim 2, wherein the enhancing the first user's voice in the mixed sound of the first and second voices comprises:
performing objectification processing on the first sound to obtain the sound of the first user;
in the mixing of the first sound and the second sound, the sound of the first user is enhanced or the sound of other users except the first user is removed.
9. The method according to any one of claims 4-6, wherein said removing or attenuating the sound of the first user in the mixed sound of the first sound and the second sound comprises:
performing objectification processing on the first sound, and determining the sound of a first user meeting a first preset tone quality condition in the first sound according to a result obtained by the objectification processing; the first user comprises one or more users;
and removing or weakening the sound of the first user meeting the preset tone quality condition in the sound mixing of the first sound and the second sound.
10. The method of claim 2, wherein the enhancing the first user's voice in the mixed sound of the first and second voices comprises:
Performing objectification processing on the first sound, and determining the sound of the first user meeting the second preset tone quality condition in the first sound according to the result obtained by the objectification processing; the first user comprises one or more users;
in the mixing of the first sound and the second sound, the sound of the first user meeting the preset tone quality condition is enhanced.
11. The sound processing method is characterized by being applied to a sound pickup system, wherein the sound pickup system comprises a control device, a first device and a second device capable of picking up sound at the position of the first device; the sound processing method comprises the following steps:
the first device picking up a first sound;
the second device picking up a second sound;
the control device determines a pickup demand mode of a first user corresponding to the first device according to the first sound, and determines a pickup demand mode of a second user corresponding to the second device according to the second sound;
removing or weakening the sound of the first user in the mixed sound of the first sound and the second sound when the sound pickup demand mode of the first user is mute and the sound pickup demand mode of the second user is non-mute; the audio mix after removing or attenuating the sound of the first user is for playing by the first device and/or the second device.
12. A pickup system comprising a control device, a first device, and a second device operable to pick up sound from a location where the first device is located;
the first device is used for picking up a first sound;
the second device is used for picking up a second sound;
the control device is used for determining a pickup demand mode of a first user corresponding to the first device according to the first sound, and determining a pickup demand mode of a second user corresponding to the second device according to the second sound; removing or weakening the sound of the first user in the mixed sound of the first sound and the second sound when the sound pickup demand mode of the first user is mute and the sound pickup demand mode of the second user is non-mute;
the first device is further configured to play a mixed sound after removing or weakening the sound of the first user;
the second device is further configured to play a mix after removing or attenuating the sound of the first user.
13. An electronic device comprising a memory, one or more processors; the memory is coupled with the processor; wherein the memory has stored therein computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the sound processing method of any one of claims 1-10 or the sound processing method of claim 11.
14. A computer readable storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the sound processing method of any one of claims 1-10 or to perform the sound processing method of claim 11.
CN202210983429.1A 2022-08-16 2022-08-16 Sound processing method, sound pickup system and electronic equipment Pending CN117641191A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210983429.1A CN117641191A (en) 2022-08-16 2022-08-16 Sound processing method, sound pickup system and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210983429.1A CN117641191A (en) 2022-08-16 2022-08-16 Sound processing method, sound pickup system and electronic equipment

Publications (1)

Publication Number Publication Date
CN117641191A true CN117641191A (en) 2024-03-01

Family

ID=90036426

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210983429.1A Pending CN117641191A (en) 2022-08-16 2022-08-16 Sound processing method, sound pickup system and electronic equipment

Country Status (1)

Country Link
CN (1) CN117641191A (en)

Similar Documents

Publication Publication Date Title
KR20150117693A (en) Video analysis assisted generation of multi-channel audio data
CN107301028B (en) Audio data processing method and device based on multi-person remote call
CN109360549B (en) Data processing method, wearable device and device for data processing
US11650790B2 (en) Centrally controlling communication at a venue
CN108449507A (en) Voice communication data processing method, device, storage medium and mobile terminal
CN109379490B (en) Audio playing method and device, electronic equipment and computer readable medium
US20150117674A1 (en) Dynamic audio input filtering for multi-device systems
CN111863011B (en) Audio processing method and electronic equipment
US8914007B2 (en) Method and apparatus for voice conferencing
US11741984B2 (en) Method and apparatus and telephonic system for acoustic scene conversion
CN117641191A (en) Sound processing method, sound pickup system and electronic equipment
US20230362571A1 (en) Information processing device, information processing terminal, information processing method, and program
US20200184973A1 (en) Transcription of communications
CN116962919A (en) Sound pickup method, sound pickup system and electronic equipment
CN113129915B (en) Audio sharing method, device, equipment, storage medium and program product
WO2024021712A1 (en) Audio playback method and electronic device
US11830120B2 (en) Speech image providing method and computing device for performing the same
EP4184507A1 (en) Headset apparatus, teleconference system, user device and teleconferencing method
CN116546126B (en) Noise suppression method and electronic equipment
CN116320144B (en) Audio playing method, electronic equipment and readable storage medium
EP4280211A1 (en) Sound signal processing method and electronic device
CN117729287A (en) Audio sharing method and device and storage medium
CN107124494B (en) Earphone noise reduction method and device
CN117854529A (en) Audio processing method and system
CN116036591A (en) Sound effect optimization method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination