CN114267365A

CN114267365A - Audio processing method, device, system and storage medium

Info

Publication number: CN114267365A
Application number: CN202111553969.8A
Authority: CN
Inventors: 韩润强; 张晨; 邢文浩; 王文峰
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-04-01

Abstract

The present disclosure provides an audio processing method, apparatus, system, and storage medium. The audio processing method may include the steps of: collecting voice signals sent by users in the first space; and sending information comprising the voice signal to a server, wherein the information further comprises a space identification of the first space, so that the server sends the voice signal to equipment in a space different from the space identification based on the space identification to realize voice communication between the user in the first space and the user in the different space.

Description

Audio processing method, device, system and storage medium

Technical Field

The present disclosure relates to the field of audio processing technologies, and in particular, to an audio processing method, apparatus, system, and storage medium for audio enhancement in a video conference scenario.

Background

Modern enterprises are getting larger and larger, and the personnel distribution is also getting wider and more dispersed. Therefore, the demand of people for the cooperation of different places is more urgent. The video conference can transmit sound and image data in real time, so that the video conference becomes an effective communication tool for cooperation with offices in different places.

For example, when a plurality of people perform audio and video communication with other people at a far end in a conference room, in order to obtain a good communication effect, professional audio equipment can be adopted to collect near-end voice and play far-end voice. There are many conference microphones on the market for users to choose from, but most audio devices that can provide a quality experience are expensive. In addition, with the development of mobile internet, people can also use devices such as smart phones and tablet computers to carry out video conferences. If a plurality of people join the video conference in one room, only the computer or the mobile phone of one participant is used for joining the conference, the sound collection and playing effects of other participants are unbalanced during the conference, the participant close to the equipment joining the video conference can obtain a better playing effect, and the participant far away from the equipment joining the video conference obtains a poor playing effect. A loud howling can occur if each participant in the same room uses his own equipment for the video conference. This is because the sound collected by one device is transmitted to another device in the same room through the network, amplified by the device, and played. This sound is again picked up by the device, forming a positive feedback, causing howling, which seriously affects the conference experience.

Disclosure of Invention

The present disclosure provides an audio processing method, an audio processing apparatus, an audio processing system, an electronic device, and a storage medium to solve at least the above-mentioned problems.

According to a first aspect of the embodiments of the present disclosure, there is provided an audio processing method applied to a first device belonging to one of a plurality of devices in a first space, the audio processing method may include: collecting voice signals sent by users in the first space; and sending information comprising the voice signal to a server, wherein the information further comprises a space identification of the first space, so that the server sends the voice signal to equipment in a space different from the space identification based on the space identification to realize voice communication between the user in the first space and the user in the different space.

Optionally, the audio processing method may further include: receiving, from a server, a plurality of voice signals collected by a plurality of devices in another space different from the first space; and performing synchronous processing on the plurality of voice signals, and playing the plurality of voice signals after the synchronous processing.

Optionally, performing synchronous processing on the plurality of voice signals may include: performing a synchronization process on the plurality of voice signals based on at least one of a time stamp of a current frame to which the plurality of voice signals are played and a target level of a jitter buffer of the plurality of voice signals at a preset time interval, wherein the target level of the jitter buffer represents a buffer amount of frame data.

Alternatively, the target level of the jitter buffer may be determined based on a transmission delay of the plurality of voice signals in the current network and a network jitter.

Alternatively, performing the synchronization process on the plurality of voice signals based on at least one of a time stamp of a current frame to which the plurality of voice signals are played and a target water level of a jitter buffer of the plurality of voice signals may include: inserting a blank frame in front of the voice signal of which the timestamp of the current frame is greater than a preset threshold value in the plurality of voice signals, so that the timestamps of the played current frames of the plurality of voice signals are aligned; the target water level of the jitter buffer of the voice signal into which the blank frame is inserted is updated to the sum of the current target water level and the inserted blank frame.

According to a second aspect of embodiments of the present disclosure, there is provided a first device belonging to one of a plurality of devices in a first space, the first device may include: a collection module configured to collect voice signals uttered by a user in the first space; a transmitting module configured to transmit information including the voice signal to a server, wherein the information further includes a space identification of the first space, such that the server transmits the voice signal to a device in a space different from the space identification based on the space identification to enable voice communication between the user in the first space and the user in the different space.

Optionally, the first device may further include: a receiving module configured to receive, from a server, a plurality of voice signals collected by a plurality of devices in another space different from the first space; and a synchronization module configured to perform synchronization processing on the plurality of voice signals and play the plurality of voice signals after the synchronization processing.

Optionally, the synchronization module may be configured to: performing a synchronization process on the plurality of voice signals based on at least one of a time stamp of a current frame to which the plurality of voice signals are played and a target level of a jitter buffer of the plurality of voice signals at a preset time interval, wherein the target level of the jitter buffer represents a buffer amount of frame data.

Optionally, the synchronization module may be configured to: inserting a blank frame in front of the voice signal of which the timestamp of the current frame is greater than a preset threshold value in the plurality of voice signals, so that the timestamps of the played current frames of the plurality of voice signals are aligned; the target water level of the jitter buffer of the voice signal into which the blank frame is inserted is modified to be the sum of the current target water level and the inserted blank frame.

According to a third aspect of the embodiments of the present disclosure, there is provided an audio processing method, applied to a server, including: receiving a plurality of pieces of information from a plurality of first devices in a first space, respectively, wherein each piece of information includes a voice signal acquired by the first device and a space identification of the first space; transmitting a plurality of voice signals included in the plurality of pieces of information to at least one second device in a second space different from the space identification based on the space identification to enable voice communication between the user in the first space and the user in the second space.

According to a fourth aspect of embodiments of the present disclosure, there is provided a server, which may include: a receiving module configured to receive a plurality of pieces of information from a plurality of first devices in a first space, respectively, wherein each piece of information of the plurality of pieces of information includes a voice signal acquired by the first device and a space identification of the first space; a transmitting module configured to transmit a plurality of voice signals included in the plurality of pieces of information to at least one second device in a second space different from the space identification based on the space identification to enable voice communication between the user in the first space and the user in the second space.

According to a fifth aspect of embodiments of the present disclosure, there is provided an audio processing method, which may include: respectively acquiring voice signals sent by users in a first space by a plurality of first devices in the first space; respectively sending a plurality of pieces of information to a server by the plurality of first devices, wherein each piece of information comprises a voice signal collected by the first device and a space identifier of the first space; transmitting, by a server, a plurality of voice signals included in the plurality of pieces of information to at least one second device in a second space different from the space identification based on the space identification; playing, by the at least one second device, the plurality of voice signals to enable voice communication between the user in the first space and the user in the second space.

Optionally, playing, by the at least one second device, the plurality of voice signals may include: and performing synchronous processing on the plurality of voice signals, and playing the plurality of voice signals after the synchronous processing.

Alternatively, performing the synchronization process on the plurality of voice signals based on at least one of a time stamp of a current frame to which the plurality of voice signals are played and a target water level of a jitter buffer of the plurality of voice signals may include: inserting a blank frame in front of a data stream of which the time stamp of a current frame in the voice signals is greater than a preset threshold value, so that the time stamps of the played current frames of the voice signals are aligned; and updating the target water level of the jitter buffer of the data stream inserted with the blank frame to be the sum of the current target water level and the inserted blank frame.

Optionally, the audio processing method may further include: receiving, by the plurality of first devices, from a server, voice signals collected by the at least one second device; and respectively executing synchronous processing on the voice signals collected by the at least one second device by the plurality of first devices and playing the voice signals after synchronous processing.

According to a sixth aspect of embodiments of the present disclosure, there is provided an audio processing system, which may include: a plurality of first devices located in a first space and each configured to collect a voice signal uttered by a user in the first space and to transmit information including the voice signal, wherein the information further includes a spatial identification of the first space; at least one second device located in a second space different from the first space; and a server configured to transmit a plurality of voice signals collected by the plurality of first devices to the at least one second device different from the spatial identification based on the spatial identification to enable voice communication between the user in the first space and the user in the second space.

Optionally, the at least one second device may be configured to: and performing synchronous processing on the plurality of voice signals, and playing the plurality of voice signals after the synchronous processing.

Optionally, the at least one second device may be configured to: performing a synchronization process on the plurality of voice signals based on at least one of a time stamp of a current frame to which the plurality of voice signals are played and a target level of a jitter buffer of the plurality of voice signals at a preset time interval, wherein the target level of the jitter buffer represents a buffer amount of frame data.

Optionally, the at least one second device may be configured to: inserting a blank frame in front of the voice signal of which the timestamp of the current frame is greater than a preset threshold value in the plurality of voice signals, so that the timestamps of the played current frames of the plurality of voice signals are aligned; the target water level of the jitter buffer of the voice signal into which the blank frame is inserted is updated to the sum of the current target water level and the inserted blank frame.

Optionally, each of the plurality of first devices may be configured to: receiving a voice signal collected by the at least one second device from a server, and performing a synchronization process on the voice signal collected by the at least one second device and playing the voice signal after the synchronization process.

According to a seventh aspect of the embodiments of the present disclosure, there is provided an electronic apparatus, and the first electronic apparatus may include: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the audio processing method as described above.

According to an eighth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the audio processing method as described above.

According to a ninth aspect of embodiments of the present disclosure, there is provided a computer program product, instructions of which are executed by at least one processor in an electronic device to perform the audio processing method as described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the voice signals are ensured to be normally collected and played by multiple devices in the same room, namely the voice signals of all the devices in the same room are sent outwards, all the voice signals which are not in the room are played simultaneously, the howling phenomenon is avoided, and the balanced playing effect is obtained. In addition, by performing a synchronization operation on a plurality of voice signals related to the same voice in consideration of a time stamp of a currently played frame and a target level of a jitter buffer, a listening experience of a user is improved. In addition, the audio processing system provided by the disclosure can solve the problem that a professional conference microphone is required to be relied on in order to obtain a good listening experience in the existing video conference, thereby reducing the equipment cost.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a diagram of an application environment for video conferencing, according to an embodiment of the present disclosure;

fig. 2 is a flow diagram of an audio processing method according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a synchronization processing method according to an embodiment of the present disclosure;

fig. 4 is a flowchart of an audio processing method for a first device according to an embodiment of the present disclosure;

fig. 5 is a flowchart of an audio processing method for a server according to an embodiment of the present disclosure;

fig. 6 is a block diagram of an audio processing system according to an embodiment of the present disclosure;

fig. 7 is a block diagram of a first device according to an embodiment of the present disclosure;

FIG. 8 is a block diagram of a server according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an audio processing device according to an embodiment of the present disclosure;

fig. 10 is a block diagram of an electronic device according to an embodiment of the disclosure.

Throughout the drawings, it should be noted that the same reference numerals are used to designate the same or similar elements, features and structures.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of the embodiments of the disclosure as defined by the claims and their equivalents. Various specific details are included to aid understanding, but these are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

The terms and words used in the following description and claims are not limited to the written meaning, but are used only by the inventors to achieve a clear and consistent understanding of the disclosure. Accordingly, it should be apparent to those skilled in the art that the following descriptions of the various embodiments of the present disclosure are provided for illustration only and not for the purpose of limiting the disclosure as defined by the appended claims and their equivalents.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The existing video conference scheme can not predict whether the enhancement is a voice signal or not, the signal-to-noise ratio is improved to a limited extent, and meanwhile, the voice quality can not be effectively improved for the scenes that noise and voice are in the same direction and scatter noise. Based on this, the present disclosure has devised a set of video conferencing systems to address these problems.

Hereinafter, according to various embodiments of the present disclosure, a method, an apparatus, and a system of the present disclosure will be described in detail with reference to the accompanying drawings.

Fig. 1 is a diagram of an application environment for video conferencing, according to an embodiment of the present disclosure.

Referring to fig. 1, there are four devices in room a, one device in room B, and one device in room C. All access devices are audio conferencing via, for example, TCP/IP, and all devices in room A, B and C may be in headphone mode or in play mode.

The devices in each room may be the terminals where the users are located. For example, a terminal used by each user when conducting a video conference. The terminal may be at least one of a smartphone, a tablet, a laptop computer, a desktop computer, and the like. The terminal may be installed with a target application for, for example, application software for video conferencing. Those skilled in the art will appreciate that the number of rooms and the number of terminals in a room shown in fig. 1 are merely exemplary. The number of rooms, the number of terminals, and the type of devices are not limited in any way by the embodiments of the present disclosure.

The devices of each room may be communicatively connected to a server, such as a multi-terminal controller (MCU), via TCP/IP, such that real-time communication may be possible between devices in the same room as well as between devices in different rooms. The MCU may receive data streams (including, for example, voice uttered by a user) from and forward the received data streams to devices participating in the video conference in real time. In this way, a user participating in a conference using the device may obtain corresponding conference information, such as the participants' voices, in real time.

As shown in fig. 1, all participants in room a use their own devices to select the same room to join the video conference before entering the video conference. In this way, the devices in room A are able to record the room ID number of each participant. The microphone of each device in room a may capture audio (such as user speech) and each device transmits the captured audio to the MCU. The data stream sent to the MCU may contain the room ID number of the room in which the device is located. The MCU may decide whether to transmit the respective data streams to each device of the respective rooms participating in the video conference based on the room ID numbers. If the ID number of the room where the device sending the data stream is located is consistent with the ID number of the room where the device receiving the data stream is located, the MCU can not send the data stream to the device sending the data stream any more. In this way, the following problems can be avoided: participants in the same room have been able to hear the voice of others directly, and transmitting this voice again can result in stress (i.e., air-borne once, over a network again); if the voices of participants in the same room are played through other equipment and collected by other equipment, positive feedback can be formed to cause howling, and meeting experience is seriously influenced.

In addition, participants in a remote non-identical room will hear multiple voices from the same room. For example, a participant in room B can hear a voice signal collected from each device in room a. However, such differently delayed sounds are superimposed, and if the mutual delay is relatively small, such as less than 50ms, the influence on the hearing sensation is limited. If the mutual delay is more than, for example, 100ms, even more than 1s, the person can clearly hear an echo where a plurality of sounds are superimposed together, like shouting in a valley. This makes the listening experience of the participants poor, and therefore, it is necessary to synchronize/align the multiple voice signals in the same room so that they are played at the same time.

As an example, the device of each room may perform the alignment process in consideration of a time stamp (time stamp) of a current frame played by respective voice signals related to the same voice and a target water level of a Jitter Buffer (Jitter Buffer) of the respective voice signals. How to perform the synchronization process on the respective voice signals related to the same voice will be described in detail below with reference to fig. 3.

According to an embodiment of the present disclosure, in order to ensure that multiple devices in the same room normally acquire and play, the data stream of each device in the same room sends data to the outside (such as a server), and all data streams in all rooms other than the room are played at the same time. And for the synchronization of the voice signals in the same room, the time stamp and the target water level of the jitter buffer are considered, so that the sound collection and playing effects of the equipment of each participant on all participants are more balanced, the voice quality is effectively improved, and the user experience is improved.

Fig. 2 is a flowchart of an audio processing method according to an embodiment of the present disclosure. The audio processing method described in fig. 2 may be applied to a video conferencing system, such as the overall system described in fig. 1.

Referring to fig. 2, voice signals uttered by a user in a first space are respectively collected by a plurality of first devices in the first space and information including the voice signals is transmitted to a server, wherein the information may additionally include a space identification of the first space at step S201. In the present disclosure, a speech signal and a corresponding spatial signature captured by a device may be considered as a data stream transmitted by the device.

For example, when a participant in a first space makes an opinion, multiple devices in the first space each collect speech spoken by the participant and each send the collected data to a server, such as the MCU of FIG. 1. According to an embodiment of the present disclosure, when transmitting the corresponding voice signal, a spatial identification (such as a room ID number) of a space where the device is located is transmitted together.

In step S202, a plurality of voice signals collected by a plurality of devices in a first space are transmitted by a server to at least one second device in a second space different from the space identification based on the space identification in the received information.

For example, when the MCU determines whether to transmit a downstream code stream to each device in each room, it may determine whether to transmit the code stream according to the room ID number. And if the room number of the sending end for sending the code stream is consistent with the room number of the receiving end for receiving the code stream, the code stream is not sent to the sending end any more.

In step S203, a plurality of voice signals collected by a plurality of devices in the first space are played by at least one second device in the second space.

As an example, the synchronization process is performed on the received plurality of voice signals by each of the at least one second device, and the synchronized plurality of voice signals are played. For example, the second apparatus may perform the synchronization process on the plurality of voice signals based on at least one of a time stamp of a current frame to which the plurality of voice signals are played and a target water level of a jitter buffer of the plurality of voice signals at preset time intervals. The target level of the jitter buffer may represent a buffer amount of the frame data.

The second device may determine a current target level of the jitter buffer based on a transmission delay of the plurality of voice signals in the current network and a network jitter.

The second device may insert a blank frame in front of a voice signal of which a time stamp of a current frame is greater than a preset threshold among the plurality of voice signals so that time stamps of the played current frames of the plurality of voice signals are aligned, and then update a target water level of a jitter buffer of the voice signal into which the blank frame is inserted to a sum of the current target water level and the inserted blank frame. How to perform the synchronization operation based on the time stamp and the jitter buffer of the played current frame will be described in detail below with reference to fig. 3.

Fig. 3 is a schematic diagram of a synchronization processing method according to an embodiment of the present disclosure. In fig. 3, the synchronous operation performed when the voice signals of the four devices in room a are played by the devices in room B is shown.

Referring to fig. 3, different speech signals (1), (2), (3) and (4) are different in time stamp (time stamp) played to the current frame. In order to align these speech signals, blank frames may be inserted in front of the speech signals (2) and (3) whose timestamps are greater than a preset threshold, while the speech signals (1) and (4) whose timestamps are later may remain unchanged. For example, a blank frame (i.e., a blank frame) of 40ms may be padded in front of the speech signal (2), and a blank frame of 60ms may be padded in front of the speech signal (3), so that the timestamps of the current frames to which the speech signals (1), (2), (3), and (4) are currently played are aligned. Here, the preset threshold may be differently set according to design requirements.

Meanwhile, the target water level of the jitter buffer of each voice signal can be adjusted, so that equipment in the room B can not play each voice signal in an accelerated manner, and the problem of new misalignment is avoided. For example, the target water level of the voice signal (2) may be updated from the current target water level of 40ms to the new target water level of 80ms, the target water level of the voice signal (3) may be updated from the current target water level of 60ms to the new target water level of 120ms, and the target water levels of the voice signals (1) and (4) may be maintained unchanged. The example shown in fig. 3 is merely exemplary, and the present disclosure is not limited thereto.

In the present disclosure, the entire synchronization/alignment operation may be performed at predetermined intervals, for example, every few seconds. The transmission and playback of the voice signal are relatively stable for a short time, while the alignment operation needs to be performed at continuous intervals to ensure the alignment of the voice signal for a long-time video conference.

Fig. 4 is a flowchart of an audio processing method for a first device according to an embodiment of the present disclosure. The first device may belong to one of a plurality of devices in the first space. The method shown in fig. 4 may be implemented by a terminal carried by a participant participating in a video conference.

Referring to fig. 4, in step S401, a voice signal uttered by a user in a first space is collected. When a participant in the first space speaks, the first device may capture a corresponding voice.

In step S402, information including the collected voice signal is transmitted to the server, wherein the information may further include a spatial identification of the first space, so that the server may determine to transmit the voice signal to a device in a space different from the spatial identification based on the spatial identification.

In step S403, a plurality of voice signals related to another audio captured by a plurality of devices in another space different from the first space are received from the server. For example, the other audio may be speech uttered by a user in another space (different from the first space) that the plurality of devices in the other space capture.

In step S404, a synchronization process is performed on the plurality of voice signals received from the server, and the plurality of voice signals after the synchronization process are played. For example, the first device may perform the synchronization process on a plurality of voice signals (such as voice signals respectively collected by a plurality of devices in the a-room) based on at least one of a time stamp of a current frame to which the plurality of voice signals are played and a target water level of a jitter buffer of the plurality of voice signals at preset time intervals during the video conference. The first device inserts a blank frame in front of a voice signal of which the timestamp of a current frame is greater than a preset threshold value in the plurality of voice signals, so that the timestamps of the played current frames of the plurality of voice signals are aligned, and updates the target water level of a jitter buffer of the voice signal inserted with the blank frame to the sum of the current target water level and the inserted blank frame. The current target water level can be determined by the transmission delay and network jitter of the voice signal in the current network.

The steps shown in fig. 4 are merely exemplary, and the present disclosure is not limited to the above-described order of steps.

Fig. 5 is a flowchart of an audio processing method for a server according to an embodiment of the present disclosure. The method illustrated in fig. 5 may be implemented by a server of a video conferencing system.

Referring to fig. 5, in step S501, a plurality of pieces of information are respectively received from a plurality of first devices in a first space, wherein each piece of information may include a voice signal collected by the first device and a spatial identification of the first space.

In step S502, a plurality of voice signals included in a plurality of pieces of information are transmitted to at least one second device in a second space different from the space identification based on the received space identification to enable voice communication between the user in the first space and the user in the second space.

The method shown in fig. 5 avoids: people in the same room have been able to hear the voices of others directly, and retransmission of this sound can result in stress (once airborne and once again over the network); if the voice of people in the same room is played by other people and then collected by other equipment, positive feedback can be formed to cause howling, and meeting experience is seriously influenced.

Fig. 6 is a block diagram of an audio processing system according to an embodiment of the present disclosure.

Referring to fig. 6, an audio processing system 600 includes a plurality of first devices in a first space, at least one second device in a second space, and a server.

The plurality of first devices in the first space can respectively collect audio signals sent by users in the first space and send information comprising the voice signals, wherein the information can also comprise the space identification of the first space.

The server may send, based on the received spatial identification, a plurality of voice signals transmitted by a plurality of first devices in a first space to at least one second device in a second space different from the spatial identification.

The at least one second device may perform a synchronization process on the plurality of voice signals transmitted by the plurality of first devices in the first space and play the plurality of voice signals after the synchronization process.

The at least one second device may perform a synchronization process on the received voice signals based on at least one of a time stamp of a current frame to which the voice signals are played and a target level of a jitter buffer of the voice signals at a preset time interval, wherein the target level of the jitter buffer represents a buffer amount of frame data.

The target level of the jitter buffer may be determined based on a transmission delay of the voice signal in the current network and a network jitter.

The at least one second device may insert a blank frame in front of a voice signal whose timestamp is greater than a preset threshold value among the received plurality of voice signals such that timestamps of the played current frames of the voice signals are aligned, and then update a target level of a jitter buffer of the voice signal into which the blank frame is inserted to a sum of the current target level and the inserted blank frame.

Further, each of the plurality of first devices in the first space may receive, from the server, a voice signal related to another audio (such as a voice uttered by a user in the second space) captured by the at least one second device, and perform a synchronization process on the voice signal captured by the at least one second device and play the synchronized voice signal.

The spatial data and the number of devices in the system shown in fig. 6 are merely exemplary, and the present disclosure is not limited thereto.

Fig. 7 is a block diagram of a first device according to an embodiment of the present disclosure. The first device belongs to one of a plurality of devices in a first space. The first device shown in fig. 7 may be used for video conferencing as part of the electronic device or as a separate device.

Referring to fig. 7, the first device 700 may include an acquisition module 701, a transmission module 702, a reception module 703, and a synchronization module 704. Each module in the first device 700 may be implemented by one or more modules, and names of the corresponding modules may vary according to types of the modules. In various embodiments, some modules in the first device 700 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus may equivalently perform the functions of the respective modules/elements prior to combination.

The collecting module 701 may collect a voice signal emitted by a user in a first space.

The transmitting module 702 may transmit information including the voice signal to the server. The information may also include a spatial identification of the first space, such that the server transmits voice signals to devices in a space different from the spatial identification based on the spatial identification to enable voice communication between the user in the first space and the user in the different space.

The receiving module 703 may receive, from the server, a plurality of voice signals collected by a plurality of devices in another space different from the first space.

The synchronization module 704 may perform a synchronization process on the plurality of voice signals and play the plurality of voice signals after the synchronization process.

The synchronization module 704 may perform a synchronization process on the plurality of voice signals based on at least one of a time stamp of a current frame to which the plurality of voice signals are played and a target level of a jitter buffer of the plurality of voice signals, at a preset time interval, wherein the target level of the jitter buffer represents a buffer amount of frame data.

The target level of the jitter buffer may be determined based on a transmission delay of the plurality of voice signals in the current network and a network jitter.

The synchronization module 704 may insert a blank frame in front of a voice signal having a timestamp of a current frame of the plurality of voice signals greater than a preset threshold, so that the timestamps of the played current frames of the plurality of voice signals are aligned, and update a target water level of a jitter buffer of the voice signal inserted with the blank frame to a sum of the current target water level and the inserted blank frame.

Fig. 8 is a block diagram of a server according to an embodiment of the present disclosure.

Referring to fig. 8, a server 800 may include a transmitting module 801 and a receiving module 802. Each module in the server 800 may be implemented by one or more modules, and names of the corresponding modules may vary according to types of the modules. In various embodiments, some modules in server 800 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus may equivalently perform the functions of the respective modules/elements prior to combination.

The receiving module 802 may receive a plurality of pieces of information from a plurality of first devices in a first space, respectively, wherein each piece of information may include a voice signal collected by the first device and a spatial identification of the first space.

The transmitting module 801 may transmit a plurality of voice signals included in a plurality of pieces of information to at least one second device in a second space different from the space identification based on the space identification to enable voice communication between the user in the first space and the user in the second space.

Fig. 9 is a schematic structural diagram of an audio processing device of a hardware operating environment according to an embodiment of the present disclosure.

As shown in fig. 9, the audio processing apparatus 900 may include: a processing component 901, a communication bus 902, a network interface 903, an input output interface 904, a memory 905, and a power component 906. Wherein a communication bus 902 is used to enable connective communication between these components. The input output interface 904 may include a video display (such as a liquid crystal display), a microphone and speakers, and a user interaction interface (such as a keyboard, mouse, touch input device, etc.), and optionally, the input output interface 904 may also include a standard wired interface, a wireless interface. The network interface 903 may optionally include a standard wired interface, a wireless interface (e.g., a wireless fidelity interface). The memory 905 may be a high-speed random access memory or a stable nonvolatile memory. The memory 905 may optionally be a storage device separate from the processing component 901 described above.

Those skilled in the art will appreciate that the configuration shown in fig. 5 does not constitute a limitation of the audio processing device 900 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 5, the memory 905, which is one type of storage medium, may include therein an operating system (such as a MAC operating system), a data storage module, a network communication module, a user interface module, an audio processing program, and a database.

In the audio processing device 900 shown in fig. 5, the network interface 903 is mainly used for data communication with an external electronic device/terminal; the input/output interface 904 is mainly used for data interaction with a user; the processing component 901 and the memory 905 in the audio processing apparatus 900 may be provided in the audio processing apparatus 900, and the audio processing apparatus 900 executes the audio processing method provided by the embodiments of the present disclosure by the processing component 901 calling the audio processing method program stored in the memory 905 and various APIs provided by the operating system.

The processing component 901 may include at least one processor, and the memory 905 has stored therein a set of computer-executable instructions that, when executed by the at least one processor, perform an audio processing method according to an embodiment of the disclosure. Further, the processing component 901 may perform encoding operations and decoding operations, and the like. However, the above examples are merely exemplary, and the present disclosure is not limited thereto.

Taking the audio processing device 900 as a terminal as an example, the audio processing device 900 may collect a voice signal sent by a user, and send the collected information related to the voice to the server, where the information further includes a space identifier of a space where the device is located, so that the server sends the voice signal to a device in a space different from the space identifier based on the space identifier. Further, the audio processing apparatus 900 may receive, from the server, a plurality of voice signals related to another voice collected by a plurality of apparatuses in another space different from the space where the audio processing apparatus 900 is located, perform synchronization processing on the voice signals, and play the voice signals after the synchronization processing.

Taking the audio processing apparatus 900 as an example of a server, the audio processing apparatus 900 can determine whether to transmit a voice signal to each terminal according to a spatial identification (such as a room ID number) included in the received information.

The audio processing apparatus 900 may perform a corresponding audio processing method as an electronic apparatus or a server. The audio processing device 900 may receive data from the electronic device through the input output interface 904 and may transmit/forward data.

By way of example, the audio processing device 900 may be a PC computer, tablet device, personal digital assistant, smartphone, or other device capable of executing the set of instructions described above. The audio processing device 900 need not be a single electronic device, but can be any collection of devices or circuits capable of executing the above-described instructions (or sets of instructions), either individually or in combination. The audio processing device 900 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the audio processing device 900, the processing component 901 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a dedicated processor system, a microcontroller, or a microprocessor. By way of example and not limitation, processing component 901 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.

The processing component 901 may execute instructions or code stored in a memory, where the memory 905 may also store data. Instructions and data may also be sent and received over a network via the network interface 903, where the network interface 903 may employ any known transmission protocol.

The memory 905 may be integrated with the processing component 901, for example, by having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the memory 905 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device that may be used by a database system. The memory and processing components 901 may be operatively coupled or may communicate with each other, e.g., through I/O ports, network connections, etc., such that the processing components 901 can read data stored in the memory 905.

According to an embodiment of the present disclosure, an electronic device may be provided. Fig. 10 is a block diagram of an electronic device according to an embodiment of the disclosure, the electronic device 1000 may include at least one memory 1002 and at least one processor 1001, the at least one memory 1002 storing a set of computer-executable instructions that, when executed by the at least one processor 1001, perform an audio processing method according to an embodiment of the disclosure. For example, the electronic device 1000 may function as an electronic device or as a server.

The processor 1001 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 1001 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, or the like.

The memory 1002, which is one type of storage medium, may include an operating system (e.g., a MAC operating system), a data storage module, a network communication module, a user interface module, an audio processing method program, and a database.

The memory 1002 may be integrated with the processor 1001, for example, RAM or flash memory may be disposed within an integrated circuit microprocessor or the like. Further, memory 1002 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 1002 and the processor 1001 may be operatively coupled, or may communicate with each other, such as through I/O ports, network connections, etc., so that the processor 1001 can read files stored in the memory 1002.

Further, the electronic device 800 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 800 may be connected to each other via a bus and/or a network.

Those skilled in the art will appreciate that the configuration shown in fig. 8 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform an audio processing method according to the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an embodiment of the present disclosure, there may also be provided a computer program product, in which instructions are executable by a processor of a computer device to perform the above-mentioned audio processing method.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An audio processing method applied to a first device belonging to one of a plurality of devices in a first space, the audio processing method comprising:

collecting voice signals sent by users in the first space;

transmitting information including the voice signal to a server,

wherein the information further comprises a space identification of the first space, such that the server transmits the voice signal to a device in a space different from the space identification based on the space identification to enable voice communication between the user in the first space and the user in the different space.

2. The audio processing method of claim 1, further comprising:

receiving, from a server, a plurality of voice signals collected by a plurality of devices in another space different from the first space;

and performing synchronous processing on the plurality of voice signals, and playing the plurality of voice signals after the synchronous processing.

3. A first device belonging to one of a plurality of devices in a first space, the first device comprising:

a collection module configured to collect voice signals uttered by a user in the first space;

a transmitting module configured to transmit information including the voice signal to a server,

4. An audio processing method applied to a server includes:

receiving a plurality of pieces of information from a plurality of first devices in a first space, respectively, wherein each piece of information includes a voice signal acquired by the first device and a space identification of the first space;

transmitting a plurality of voice signals included in the plurality of pieces of information to at least one second device in a second space different from the space identification based on the space identification to enable voice communication between the user in the first space and the user in the second space.

5. A server, comprising:

a receiving module configured to receive a plurality of pieces of information from a plurality of first devices in a first space, respectively, wherein each piece of information of the plurality of pieces of information includes a voice signal acquired by the first device and a space identification of the first space;

a transmitting module configured to transmit a plurality of voice signals included in the plurality of pieces of information to at least one second device in a second space different from the space identification based on the space identification to enable voice communication between the user in the first space and the user in the second space.

6. An audio processing method, comprising:

respectively acquiring voice signals sent by users in a first space by a plurality of first devices in the first space;

respectively sending a plurality of pieces of information to a server by the plurality of first devices, wherein each piece of information comprises a voice signal collected by the first device and a space identifier of the first space;

transmitting, by a server, a plurality of voice signals included in the plurality of pieces of information to at least one second device in a second space different from the space identification based on the space identification;

playing, by the at least one second device, the plurality of voice signals to enable voice communication between the user in the first space and the user in the second space.

7. An audio processing system comprising:

a plurality of first devices located in a first space and each configured to collect a voice signal uttered by a user in the first space and to transmit information including the voice signal, wherein the information further includes a spatial identification of the first space;

at least one second device located in a second space different from the first space; and

a server configured to transmit a plurality of voice signals collected by the plurality of first devices to the at least one second device different from the spatial identification based on the spatial identification to enable voice communication between the user in the first space and the user in the second space.

8. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the audio processing method of any of claims 1 to 2, 4 or 6.

9. A computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the audio processing method of any of claims 1 to 2, 4 or 6.

10. A computer program product in which instructions are executed by at least one processor in an electronic device to perform the audio processing method of any of claims 1 to 2, 4 or 6.