CN117896469A

CN117896469A - Audio sharing method, device, computer equipment and storage medium

Info

Publication number: CN117896469A
Application number: CN202410298526.6A
Authority: CN
Inventors: 高毅
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-03-15
Filing date: 2024-03-15
Publication date: 2024-04-16

Abstract

The application relates to an audio sharing method, an audio sharing device, a computer device, a storage medium and a computer program product. The method involves artificial intelligence techniques, including: during a voice call, determining audio data to be shared of a first terminal; the audio data to be shared and the remote call audio data sent to the first terminal by the second terminal are respectively played at the local end of the first terminal through different local playing devices; acquiring local audio data acquired at a local end of a first terminal; the local audio data comprises shared audio echo data and near-end call audio data; delay alignment of the audio data to be shared is carried out based on the local audio data, and delay audio data aligned with the shared audio echo data is obtained; mixing the local audio data with the delayed audio data, and transmitting the mixed audio data to the second terminal for playing based on the voice call. By adopting the method, the echo influence of the shared audio can be reduced, and the quality of the shared audio is improved.

Description

Audio sharing method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technology, and in particular, to an audio sharing method, an audio sharing apparatus, a computer device, a storage medium, and a computer program product.

Background

Along with the development of computer technology, the sharing requirement of users in the voice call process is continuously improved, for example, when the users want to see the screen of the local terminal in the call process, the users can hear the sound played by the local terminal, so that the audio played by the local terminal is shared to the call counterpart.

However, at present, in the voice communication process, the voice of the communication counterpart and the audio to be shared can be respectively played by different playing devices at the local end, and when the locally played audio is shared to the communication counterpart, the locally played audio is easily collected by the microphone to form an echo, so that the quality of the audio shared to the communication counterpart is affected.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an audio sharing method, apparatus, computer device, computer readable storage medium, and computer program product that can reduce the echo effect of the shared audio and improve the quality of the shared audio.

In a first aspect, the present application provides an audio sharing method. The method comprises the following steps:

During the voice communication between the first terminal and the second terminal, determining audio data to be shared of the first terminal; the audio data to be shared and the remote call audio data sent to the first terminal by the second terminal are respectively played at the local end of the first terminal through different local playing devices associated with the first terminal;

Acquiring local audio data acquired at a local end of a first terminal; the local audio data comprise shared audio echo data of audio data to be shared and near-end call audio data for performing voice call aiming at the far-end call audio data;

performing delay alignment on the audio data to be shared based on the local audio data to obtain delay audio data aligned with the shared audio echo data in the time dimension;

Mixing the local audio data with the delayed audio data, and transmitting the mixed audio data obtained by mixing to the second terminal for playing based on voice communication.

In a second aspect, the application further provides an audio sharing device. The device comprises:

The audio to be shared determining module is used for determining audio data to be shared of the first terminal in the process of voice communication between the first terminal and the second terminal; the audio data to be shared and the remote call audio data sent to the first terminal by the second terminal are respectively played at the local end of the first terminal through different local playing devices associated with the first terminal;

The local audio acquisition module is used for acquiring local audio data acquired at a local end of the first terminal; the local audio data comprise shared audio echo data of audio data to be shared and near-end call audio data for performing voice call aiming at the far-end call audio data;

the delay alignment module is used for carrying out delay alignment on the audio data to be shared based on the local audio data to obtain delay audio data aligned with the shared audio echo data in the time dimension;

and the audio mixing module is used for mixing the local audio data with the delayed audio data and sending the mixed audio data obtained by mixing to the second terminal for playing based on the voice call.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the audio sharing method when executing the computer program.

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the above audio sharing method.

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of the above audio sharing method.

In the audio sharing method, the device, the computer equipment, the storage medium and the computer program product, in the process of carrying out voice communication between the first terminal and the second terminal, the audio data to be shared of the first terminal and the remote communication audio data sent to the first terminal by the second terminal are determined in the process of carrying out voice communication between the first terminal and the second terminal, the audio data to be shared and the remote communication audio data sent to the first terminal by the second terminal are respectively played at the local end of the first terminal through different local playing equipment associated with the first terminal, the local audio data collected at the local end of the first terminal are subjected to delay alignment based on the local audio data, the delay audio data aligned with the sharing echo data in the time dimension are mixed with the local audio data, and the mixed audio data obtained by mixing is sent to the second terminal to be played based on voice communication, so that the audio data to be shared to the second terminal through voice communication. When the remote call audio data and the audio data to be shared are played at the local end of the first terminal through different local playing devices, the audio data to be shared is subjected to delay alignment based on the local audio data collected at the local end of the first terminal, and the delay audio data aligned with the shared audio echo data in the time dimension is mixed with the local audio data and then sent to the second terminal, so that the echo influence of the shared audio echo data on the delay audio data can be reduced, the quality of the mixed audio data sent to the second terminal is ensured, and the quality of the shared audio is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

FIG. 1 is an application environment diagram of an audio sharing method in one embodiment;

FIG. 2 is a flow chart of an audio sharing method according to an embodiment;

fig. 3 is a schematic diagram of an architecture for implementing audio sharing based on a call terminal in an embodiment;

FIG. 4 is a schematic diagram of an architecture for implementing audio sharing based on non-telephony devices in one embodiment;

FIG. 5 is a flow chart illustrating remote speech echo cancellation in one embodiment;

FIG. 6 is a schematic diagram of an architecture including far-end speech echo cancellation in one embodiment;

FIG. 7 is a schematic diagram of a device selection interface in one embodiment;

FIG. 8 is a schematic block diagram of a far-end speech echo cancellation process for telephony applications in one embodiment;

FIG. 9 is a schematic block diagram of delay alignment processing for telephony applications in one embodiment;

FIG. 10 is a schematic block diagram of delay alignment processing in one embodiment;

FIG. 11 is a block diagram illustrating an audio sharing device according to an embodiment;

FIG. 12 is an internal block diagram of a computer device in one embodiment;

Fig. 13 is an internal structural view of a computer device in another embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Key technologies to speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future. The large model technology brings reform for the development of the voice technology, and the pre-training models such as WavLM, uniSpeech and the like which use a transducer architecture have strong generalization and universality and can excellently finish voice processing tasks in all directions.

Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The pre-training model is the latest development result of deep learning, and integrates the technology.

With the research and advancement of artificial intelligence technology, the research and application of artificial intelligence technology is developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, digital twin, virtual man, robot, artificial Intelligence Generation Content (AIGC), conversational interactions, smart medical treatment, smart customer service, game AI, etc., and it is believed that with the development of technology, the artificial intelligence technology will be applied in more fields and with increasing importance value.

The scheme provided by the embodiment of the application relates to the technology of artificial intelligence such as voice technology and machine learning, and particularly can be used for carrying out delay alignment, mixing and other processing on audio data based on the voice technology and the machine learning, and is specifically described by the following embodiment.

The audio sharing method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. The first terminal 102 and the second terminal 104 perform a voice call through a network, and the first terminal 102 may also communicate with the server 106 through the network. The data storage system may store data that the server 106 needs to process. The data storage system may be provided separately, may be integrated with the server 106, or may be located on a cloud or other server.

The first terminal 102 and the second terminal 104 may perform a voice call, for example, a voice call may be performed through a call application, in the process that the first terminal 102 and the second terminal 104 perform a voice call, the first terminal 102 may receive far-end call audio data sent by the second terminal 104 to the first terminal 102, and the first terminal 102 may play the far-end call audio data at a local end through an associated first local play device; meanwhile, the first terminal 102 may also play the audio data to be shared at the local end through the associated second local play device, and specifically, the audio data to be shared may be played through the second local play device based on the audio play application installed on the first terminal 102. The first terminal 102 may collect audio data at a local end to send to the second terminal 104 for playing, and specifically, the microphone of the first terminal 102 may collect audio data at the local end.

During the voice call between the first terminal 102 and the second terminal 104, the server 106 may determine the audio data to be shared of the first terminal 102. For local audio data collected at the local end of the first terminal 102, the server 106 performs delay alignment on the audio data to be shared based on the local audio data to obtain delay audio data aligned with the shared audio echo data in the time dimension, the server 106 mixes the delay audio data with the local audio data, and sends the mixed audio data obtained by mixing to the second terminal 104 for playing based on voice call, so that the audio data to be shared is shared to the second terminal 104 through voice call. In addition, in some embodiments, the audio sharing method may also be implemented by the first terminal 102 alone, that is, the first terminal 102 directly performs delay alignment on the audio data to be shared based on the local audio data collected at the local end, and mixes the delayed audio data aligned with the shared audio echo data in the time dimension with the local audio data, and then sends the mixed delayed audio data to the second terminal 106.

The first terminal 102 and the second terminal 104 may be, but not limited to, various desktop computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 106 may be implemented as a stand-alone server or as a cluster of servers.

In an exemplary embodiment, as shown in fig. 2, an audio sharing method is provided, where the method is performed by a computer device, specifically, may be performed by a computer device such as a terminal or a server, or may be performed by the terminal and the server together, and in an embodiment of the present application, the method is applied to a first terminal in fig. 1, which is illustrated by taking the application of the method to the first terminal as an example, where the first terminal includes, but is not limited to, a mobile phone, a computer, a smart voice interaction device, a smart home appliance, a vehicle-mounted terminal, an aircraft, and the like, and includes the following steps 202 to 208. Wherein:

Step 202, determining audio data to be shared of a first terminal in a process of voice communication between the first terminal and a second terminal; and the audio data to be shared and the remote call audio data sent to the first terminal by the second terminal are respectively played at the local end of the first terminal through different local playing devices associated with the first terminal.

The voice call can be performed between different users through different terminals, for example, the voice call can be performed based on a call application installed on the terminal. The call applications may include a VOIP (Voice Over Internet Protocol, network protocol call technology) application, each of which may be installed with a VOIP application through which a user may make a voice call based on a network protocol. The audio data to be shared is audio data to be shared to the second terminal through the first terminal, and specifically may include local audio data of the first terminal or network audio data obtained from a network, for example, may include various audio data such as recorded audio data and music. The remote call audio data is call audio data transmitted from the second terminal to the first terminal in the voice call process, and the remote call audio data can be acquired by the second terminal at the local end of the second terminal. When the second terminal collects audio data, a user corresponding to the second terminal can send out voice, so that the collected remote call audio data can comprise voice data sent out by the user of the second terminal; further, if other clutter sounds exist in the environment when the user of the second terminal makes a voice, the remote call audio data collected by the second terminal may further include environmental noise data of the second terminal.

The audio data to be shared and the remote call audio data are respectively played by different local playing devices associated with the first terminal, wherein the local playing devices are playing devices at the local end of the first terminal, and can specifically comprise various playing devices such as a loudspeaker, a sound box, an earphone and the like. For example, the first terminal may be associated with the first local playing device and the second local playing device, respectively, so that in a process of performing a voice call between the first terminal and the second terminal, remote call audio data is played through the first local playing device, and audio data to be shared is played through the second local playing device. The audio data to be shared and the remote call audio data are respectively played at the first terminal through different local playing devices, so that a user can be supported to configure different output devices for different audio data, and the playing effect of the different audio data is improved. For example, in the voice call process, if the remote call audio data and the audio data to be shared are both played by the same local playing device of the first terminal, the remote call audio data and the audio data to be shared are overlapped and played, the volume is uniformly adjusted by the same local playing device, and the audio data to be shared easily affects the user to listen to the remote call audio data; the remote call audio data are played through the earphone of the first terminal, the audio data to be shared are played through the Bluetooth loudspeaker box connected with the first terminal, the respective volume and tone quality of the remote call audio data and the audio data to be shared can be independently adjusted, the noise superposition of playing different audio data through a single playing device can be reduced, the influence of the audio data to be shared on the remote call audio data is reduced, and accordingly the playing effect is improved.

Specifically, different users can perform voice call through the terminals respectively held, and in the process of performing voice call between the first terminal and the second terminal, the first terminal can acquire remote call audio data sent by the second terminal. For example, in the process that the first terminal and the second terminal conduct voice call through the call application, the first terminal can receive far-end call audio data sent by the second terminal through the network based on the call application. In some embodiments, the number of the second terminals may include at least one, that is, the first terminal may simultaneously make a voice call with one or more terminals, and then, for each terminal making a voice call with the first terminal, the audio sharing process may be performed as the second terminal. The first terminal can play the obtained remote call audio data at the local end, and particularly can play the remote call audio data through the associated first local playing equipment.

The first terminal may determine audio data to be shared, where the audio data to be shared may be selected by a user in the first terminal, and specifically may be audio data played at a local end of the first terminal through a local playing device associated with the first terminal, that is, the user may select audio data played at the local end of the first terminal as the audio data to be shared. In a specific application, a user can play audio data to be shared while carrying out voice communication with a second terminal through a first terminal, and remote communication audio data and audio data to be shared received in the voice communication process can be respectively played through different local playing devices at local ends.

In a specific implementation, the first terminal may be associated with a plurality of local playback devices and configure different audio outputs for different local playback devices. For example, the first terminal a may be associated with 3 local playing devices, where the first terminal a may play the video 1 and the music 2 at the same time, and the audio of each of the video 1 and the music 2 may be played by the same local playing device, for example, the audio of each of the video 1 and the music 2 may be played by a speaker built in the first terminal a. Further, the user may also configure the output of the audio of each of the video 1 and the music 2, for example, may configure the audio of the video 1 to be played through a built-in speaker, and configure the music 2 to be played through a bluetooth speaker connected to the bluetooth of the first terminal a, so that the audio of each of the video 1 and the music 2 may be respectively played at the local end through different local playing devices. In a specific application, a user can flexibly configure local playing devices of the remote call audio data and the audio data to be shared in a terminal interface of the first terminal. For example, the remote call audio data may be played through a first local playing device associated with the first terminal, and the audio data to be shared may be played through a second local playing device associated with the first terminal. The first local playing device and the second local playing device belong to different playing devices, so that remote call audio data and audio data to be shared can be simultaneously played at the local end of the first terminal through different local playing devices.

Step 204, acquiring local audio data acquired at a local end of a first terminal; the local audio data includes shared audio echo data of audio data to be shared and near-end call audio data for performing a voice call with respect to the far-end call audio data.

In the process of performing voice communication between the first terminal and the second terminal, a user at the local end of the first terminal needs to send audio data to the second terminal so as to realize voice communication. The first terminal can perform audio collection at the local terminal so as to collect and obtain conversation voice of the local terminal user. The local audio data are audio data acquired at a local end of the first terminal, and the local audio data can be acquired through a microphone at the local end of the first terminal. The local end of the first terminal plays the audio data to be shared through the associated local playing equipment, and when the first terminal collects the audio data at the local end, the first terminal collects the audio data to be shared, which is played in the air by the local playing equipment, in addition to the voice data sent by the user of the first terminal, so that the shared audio echo data is formed. The shared audio echo data are audio data acquired again by a microphone of the first terminal after the audio data to be shared, which are broadcast by the local broadcasting equipment, are subjected to air propagation. The near-end call audio data is voice data sent by a user of the first terminal at the local end, and specifically can be voice data sent by the user at the local end of the first terminal in the process of carrying out voice call communication on the far-end call audio data.

In an exemplary embodiment, the first terminal may acquire local audio data collected at a local end, and the local end of the first terminal may be provided with at least one sound receiving device, for example, at least one microphone may be provided, so as to perform audio collection at the local end through the at least one microphone. The acquired local audio data at least comprises shared audio echo data and near-end call audio data.

In step 206, delay alignment is performed on the audio data to be shared based on the local audio data, so as to obtain delay audio data aligned with the shared audio echo data in the time dimension.

If the audio data to be shared of the first terminal needs to be shared to the second terminal, the audio data to be shared and the near-end call audio data collected at the local end of the first terminal can be directly mixed and then sent to the second terminal. However, when the local audio data collected by the first terminal further includes the shared audio echo data, since the shared audio echo data is collected again after the audio data to be shared is air-borne, the shared audio echo data lags behind the audio data to be shared in time dimension, that is, the shared audio echo data has a certain time delay relative to the audio data to be shared, if the local audio data and the audio data to be shared are directly mixed and then sent to the second terminal, the audio data to be shared and the shared audio echo data in the audio received by the second terminal form sequential echoes, and the quality of the shared audio is affected. And the delay alignment is to align the audio data to be shared according to the local audio data in the time dimension so as to obtain delay audio data, align the delay audio data with the shared audio echo data in the local audio data in the time dimension, and send the mixed audio data to the second terminal so as to reduce the echo influence.

The first terminal may delay and align the audio data to be shared in a time dimension based on the local audio data, and may specifically delay the audio data to be shared for a certain period of time to obtain delayed audio data, where the delayed audio data is aligned with the shared audio echo data in the time dimension, that is, when the delayed audio data is mixed with the local audio data, the echo effect is reduced when the delayed audio data is overlapped with the shared audio echo data in the local audio data. In a specific implementation, the first terminal may perform delay detection on the local audio data and the audio data to be shared, so as to determine a delay duration between the shared audio echo data and the audio data to be shared in the local audio data, and the first terminal may delay the audio data to be shared in a time dimension according to the delay duration, so as to obtain delay audio data aligned to the shared audio echo data in the time dimension.

And step 208, mixing the local audio data and the delayed audio data, and transmitting the mixed audio data obtained by mixing to the second terminal for playing based on the voice call.

The mixed audio data is obtained by mixing the local audio data and the delayed audio data, and the local audio data and the delayed audio data can be integrated into a stereo audio track or a single audio track through mixing, so that the synthesis of the local audio data and the delayed audio data is realized, and the mixed audio data is obtained.

Specifically, the first terminal may mix the local audio data with the delayed audio data, thereby implementing mixing of the local audio data and the delayed audio data, and obtaining mixed audio data. The first terminal may send the mixed audio data to the second terminal, and specifically may send the mixed audio data to the second terminal based on a voice call, and the second terminal may play the received mixed audio data. The mixed audio data comprises near-end call audio data aiming at far-end call audio data, delay audio data aligned in time dimension and shared audio echo data, so that the first terminal can share the audio data to be shared, which is played by the local terminal, to the second terminal based on a voice call mode.

In a specific application, as shown in fig. 3, a first terminal may perform a voice call with a second terminal, the second terminal may send remote call audio data to the first terminal, and the first terminal transmits the received remote call audio data to an associated first local playing device for playing; meanwhile, the first terminal can transmit the audio data to be shared to the associated second local playing device for playing. The microphone associated with the first terminal can collect audio at the local end of the first terminal, the collected local audio data comprises shared audio echo data of the audio data to be shared and near-end call audio data for voice call of the far-end call audio data, and the microphone can transmit the collected local audio data to the first terminal. The first terminal can perform delay alignment on the audio data to be shared based on the local audio data to obtain delay audio data, and the delay audio data is aligned with the shared audio echo data in the local audio data in the time dimension. The first terminal can mix the local audio data with the delayed audio data to obtain mixed audio data, and send the mixed audio data to the second terminal for playing.

In some embodiments, the audio sharing method may also be implemented by other computer devices outside the first terminal, for example, the audio sharing may be performed by a terminal or a server outside the first terminal. In a specific application, as shown in fig. 4, a first terminal may perform a voice call with a second terminal, the second terminal may send far-end call audio data to the first terminal, and the first terminal transmits the received far-end call audio data to an associated first local playing device for playing; meanwhile, the first terminal can transmit the audio data to be shared to the associated second local playing device for playing. The microphone associated with the first terminal can collect audio at the local end of the first terminal, and the microphone can transmit the collected local audio data to the first terminal. The computer device may obtain audio data to be shared and local audio data from the first terminal. The computer device may delay align the audio data to be shared based on the local audio data to obtain delayed audio data, where the delayed audio data is aligned in a time dimension with the shared audio echo data in the local audio data. The first terminal can mix the local audio data with the delayed audio data to obtain mixed audio data, and send the mixed audio data to the first terminal, and the first terminal sends the mixed audio data to the second terminal for playing based on voice communication.

In the audio sharing method, in the process of performing a voice call between the first terminal and the second terminal, the audio data to be shared of the first terminal and the remote call audio data sent to the first terminal by the second terminal are determined, the audio data to be shared and the remote call audio data sent to the first terminal by the second terminal are respectively played at the local end of the first terminal through different local playing devices associated with the first terminal, the local audio data collected at the local end of the first terminal are subjected to delay alignment based on the local audio data, the delay audio data aligned with the sharing audio echo data in the time dimension are mixed with the local audio data, and the mixed audio data obtained by mixing is sent to the second terminal to be played based on the voice call, so that the audio data to be shared is shared to the second terminal through the voice call. When the remote call audio data and the audio data to be shared are played at the local end of the first terminal through different local playing devices, the audio data to be shared is delayed and aligned based on the local audio data collected at the local end of the first terminal, and the delayed audio data aligned with the shared audio echo data in the time dimension is mixed with the local audio data and then sent to the second terminal, so that the echo influence of the shared audio echo data on the delayed audio data can be reduced, the quality of the mixed audio data sent to the second terminal is ensured, the quality of the shared audio is improved, and the user experience of audio sharing in the voice call process is improved.

In an exemplary embodiment, delay-aligning audio data to be shared based on local audio data to obtain delay audio data aligned with shared audio echo data in a time dimension includes: determining delay time of the shared audio echo data relative to the audio data to be shared based on the local audio data and the audio data to be shared; and aligning the audio data to be shared with the shared audio echo data in the time dimension according to the delay time length to obtain delay audio data aligned with the shared audio echo data.

The delay time is a lag time of the shared audio echo data relative to the audio data to be shared after the shared audio echo data is acquired at the local end of the first terminal after being subjected to air propagation. The delay time length may be determined by delay detection based on the local audio data and the audio data to be shared.

Optionally, the first terminal may determine a delay duration of the shared audio echo data relative to the audio data to be shared, that is, determine a duration of the shared audio echo data in the local audio data after the audio data to be shared. In a specific application, the first terminal may perform delay detection based on the local audio data and the audio data to be shared, for example, the delay duration may be determined according to a correlation between the local audio data and the audio data to be shared. The first terminal may align the audio data to be shared with the shared audio echo data in the time dimension according to the determined delay duration, and specifically the first terminal may delay the audio data to be shared according to the delay duration, so as to obtain delay audio data aligned with the shared audio echo data in the time dimension. For example, after the first terminal determines that the delay time t of the shared audio echo data relative to the audio data to be shared is 300ms (milliseconds), the first terminal may delay the audio data to be shared for 300ms to obtain delayed audio data, where the delayed audio data and the shared audio echo data are aligned in a time dimension.

In this embodiment, a delay time length of the shared audio echo data relative to the to-be-shared audio data is determined based on the local audio data and the to-be-shared audio data, and time dimension alignment processing is performed on the to-be-shared audio data and the shared audio echo data according to the delay time length, so as to obtain the delayed audio data, reduce an echo influence of the shared audio echo data on the delayed audio data, and ensure quality of the mixed audio data sent to the second terminal, thereby improving quality of the shared audio.

In an exemplary embodiment, determining a delay period of the shared audio echo data relative to the audio data to be shared based on the local audio data and the audio data to be shared includes: obtaining local audio detection data corresponding to the current time from the local audio data, and obtaining audio detection data to be shared corresponding to the current time from the audio data to be shared; determining a correlation parameter between the local audio detection data and the audio detection data to be shared; and determining the delay time length of the shared audio echo data relative to the audio data to be shared according to the correlation parameters.

The current time is a time corresponding to the delay detection processing, and in the process of performing voice communication between the first terminal and the second terminal, multiple delay detections can be performed at different times to dynamically determine delay time lengths corresponding to different times, so as to perform delay alignment processing. In a specific application, in the process of performing voice communication between the first terminal and the second terminal, delay detection can be performed periodically, so that delay detection is performed for multiple times continuously for different moments, and respective delay time lengths corresponding to the moments are determined. The local audio detection data is audio data for delay detection processing obtained from the local audio data, and the audio detection data to be shared is audio data for delay detection processing obtained from the audio data to be shared. In some embodiments, the local audio data and the audio data to be shared may be screened respectively to obtain the local audio detection data and the audio detection data to be shared for delay detection processing, for example, the audio data may be extracted from the local audio data and the audio data to be shared respectively according to a preset duration range based on the current time, so as to obtain the local audio detection data and the audio detection data to be shared.

The correlation parameter is used for representing the correlation degree between each data of the local audio detection data and the audio detection data to be shared, and the lag time of the shared audio echo data in the local audio detection data compared with the audio detection data to be shared can be determined based on the correlation degree, so that the delay time is obtained. In a specific application, the local audio data and the audio data to be shared are used as audio signals, the correlation parameters can comprise a cross-correlation function, and the similarity and the time delay relation between the two different signals can be described by constructing the cross-correlation function between the local audio data and the audio data to be shared, so that the time delay between the local audio data and the audio data to be shared is determined. In a specific implementation, the correlation parameter between the local audio detection data and the audio detection data to be shared may be obtained based on at least one of a cross-power spectral density (Cross Power Spectrum Density, CPSD) algorithm, a kernel density estimation (KERNEL DENSITY estimation, KDE) algorithm, a correlation analysis (Correlation Analysis) algorithm, a fourier transform (Fourier Transform) algorithm, or a phase transform generalized cross-correlation (GCC-phas) algorithm.

The first terminal may perform data extraction on the local audio data and the audio data to be shared respectively, so as to obtain the local audio detection data and the audio detection data to be shared respectively corresponding to the current moment. In a specific application, the local audio data and the audio data to be shared are continuous audio signals, and each data point of the local audio data and each data point of the audio data to be shared can be respectively sampled at the current moment to obtain local audio detection data and local audio detection data to be shared, wherein the local audio detection data comprises a plurality of data points. The first terminal may perform correlation analysis on the local audio detection data and the audio detection data to be shared, for example, may construct a correlation parameter based on each data point of the local audio detection data and the audio detection data to be shared by using a GCC-heat algorithm or a cross power spectral density algorithm. The first terminal may analyze the obtained correlation parameter, for example, search each data point of the local audio detection data and the audio detection data to be shared based on the correlation parameter, so as to determine a delay duration of the shared audio echo data relative to the audio data to be shared.

In this embodiment, based on the correlation parameter between the local audio detection data corresponding to the current time in the local audio data and the audio detection data to be shared corresponding to the current time in the audio data to be shared, and the delay time length of the shared audio echo data relative to the audio data to be shared is determined according to the correlation parameter, the delay time length can be determined by using the data points corresponding to the current time in the local audio data and the audio data to be shared, so that the data volume of delay detection processing can be reduced, and the efficiency of delay detection is improved.

In one exemplary embodiment, determining a correlation parameter between local audio detection data and audio detection data to be shared includes: frequency domain conversion is respectively carried out on the local audio detection data and the audio detection data to be shared, so that local audio frequency domain data corresponding to the local audio detection data and audio frequency domain data to be shared corresponding to the audio detection data to be shared are obtained; based on a cross-correlation algorithm, according to data points included in the local audio frequency domain data and the audio frequency domain data to be shared, obtaining correlation parameters between the local audio frequency domain data and the audio frequency domain data to be shared.

Frequency domain conversion is an analysis method for converting a signal from a time domain to a frequency domain, and the mathematical basis is fourier transformation. Through frequency domain conversion, the frequency components of the signals can be revealed, so that the characteristics of the signals can be better understood, and more efficient and accurate signal processing can be realized in practical application. The local audio detection data and the audio detection data to be shared can be respectively converted from the time domain to the frequency domain through frequency domain conversion, so that corresponding frequency domain data are obtained. The local audio frequency domain data are frequency domain data corresponding to the local audio detection data, and the audio frequency domain data to be shared are frequency domain data corresponding to the audio detection data to be shared. The cross-correlation algorithm is used for calculating the correlation between the local audio frequency domain data and the audio frequency domain data to be shared, and specifically may include a GCC-heat algorithm, so that the correlation parameter is obtained by using each data point of the local audio frequency domain data and each data point of the audio frequency domain data to be shared.

Specifically, the first terminal may determine the correlation parameter after converting the local audio detection data and the audio detection data to be shared into the frequency domain. The first terminal may perform frequency domain conversion on the local audio detection data and the audio detection data to be shared respectively, and may specifically convert the local audio detection data and the audio detection data to be shared into local audio frequency domain data and audio frequency domain data to be shared respectively based on a fourier transform algorithm, where the local audio frequency domain data and the audio frequency domain data to be shared may include a plurality of data points respectively. The first terminal may determine the correlation parameter using a plurality of data points respectively included in the local audio frequency domain data and the audio frequency domain data to be shared based on a cross-correlation algorithm. In a specific implementation, the cross-correlation algorithm may include a GCC-phas algorithm, that is, the first terminal may construct a corresponding correlation expression according to the GCC-phas algorithm by using each data point included in the local audio frequency domain data and the audio frequency domain data to be shared, so as to obtain a correlation parameter between the local audio frequency domain data and the audio frequency domain data to be shared.

Further, determining a delay time of the shared audio echo data relative to the audio data to be shared according to the correlation parameter includes: determining the number of delay data points of the local audio frequency domain data relative to the audio frequency domain data to be shared based on the correlation parameters; and determining the delay time length of the shared audio echo data relative to the audio data to be shared according to the number of the delay data points.

The number of delay data points refers to the number of data points with lag of the data points in the local audio frequency domain data relative to the corresponding data points in the audio frequency domain data to be shared. For example, the local audio frequency domain data and the audio frequency domain data to be shared each include 500 data points, and for 320 th data point a in the local audio frequency domain data, the data point a corresponds to 20 th data point B in the audio frequency domain data to be shared, that is, the data point a and the data point B are data points carrying the same audio information, it may be determined that the data point a includes 300 data points with respect to the data point B, that is, the number of delayed data points is 300. The delay time length may be determined based on the number of delay data points and the interval time length between the data points, which may be determined based on the local audio detection data and the sampling frequency of the audio detection data to be shared.

For example, the first terminal may analyze respective data points of the local audio frequency domain data and the audio frequency domain data to be shared based on the correlation parameter to determine the number of delay data points. For example, the first terminal obtains the number of delayed data points from the difference between the data point numbers by searching for the data point number that maximizes the value of the correlation parameter. The first terminal may obtain a delay duration of the shared audio echo data relative to the audio data to be shared based on the number of delay data points, and the specific first terminal may calculate the delay duration based on a product of the number of delay data points and an interval duration between the data points.

In this embodiment, after the local audio detection data and the audio detection data to be shared are converted into the frequency domain, the correlation parameters are obtained by using each data point in the respective frequency domain data based on the cross correlation algorithm, the number of delay data points is obtained by analyzing based on the correlation parameters, and the delay time length is determined according to the number of delay data points, so that the correlation analysis can be performed on the local audio detection data and the audio detection data to be shared based on the frequency domain, and the accurate delay time length is obtained, thereby ensuring the delay alignment effect, ensuring the quality of the mixed audio data sent to the second terminal, and being beneficial to improving the quality of the shared audio.

In an exemplary embodiment, determining a delay period of the shared audio echo data relative to the audio data to be shared according to the correlation parameter includes: determining the delay time length of the shared audio echo data relative to the current time of the audio data to be shared according to the correlation parameters; acquiring historical delay time of the shared audio echo data relative to the audio data to be shared; and determining the delay time of the shared audio echo data relative to the audio data to be shared according to the delay time of the current moment and the historical delay time.

The current time delay time is a delay time determined based on the audio data of the current time, and the audio data of the current time may include local audio detection data corresponding to the current time and audio detection data to be shared. The historical delay time is a delay time determined based on audio data of a historical time, and the audio data of the historical time can include local audio detection data corresponding to the historical time and audio detection data to be shared. In a specific application, the historical delay duration may directly include delay durations corresponding to one or more historical moments, and the historical delay duration may also be obtained comprehensively according to delay durations corresponding to the historical moments, for example, may be obtained according to weighted averages of delay durations corresponding to the historical moments.

In an exemplary embodiment, the first terminal may determine a current time delay duration based on a correlation parameter corresponding to the current time, where the current time delay duration represents a delay duration of the shared audio echo data obtained based on the audio data at the current time relative to the audio data to be shared. The first terminal may also obtain a historical delay time, where the historical delay time is a delay time obtained based on the audio data at the historical time. For example, the historical delay time may be obtained based on the audio data of each of the first N times before the current time, where each of the historical times corresponds to a delay time. The first terminal can comprehensively obtain the delay time of the shared audio echo data relative to the audio data to be shared based on the delay time of the current moment and the historical delay time. For example, the first terminal may weight the current time delay period and the historical delay period to obtain the delay period.

In this embodiment, the delay duration of the current time and the delay duration of the history are integrated to determine the final delay duration, the delay duration of the history can be referenced to adjust the delay duration of the current time, and stability and accuracy of the delay duration can be ensured, so that quality of the mixed audio data sent to the second terminal can be ensured, and the quality of the shared audio can be improved.

In an exemplary embodiment, the historical delay time length includes a previous time delay time length corresponding to a time previous to the current time; according to the delay time length of the current moment and the historical delay time length, determining the delay time length of the shared audio echo data relative to the audio data to be shared, wherein the method comprises the following steps: and carrying out smoothing processing on the delay time length at the current moment and the delay time length at the previous moment according to the smoothing coefficient to obtain the delay time length of the shared audio echo data relative to the audio data to be shared.

The delay time length at the previous time is a delay time length corresponding to the previous time of the current time, and specifically can be obtained by performing delay detection according to the audio data corresponding to the previous time. The smoothing coefficient is used for adjusting the value of the delay time length of the previous moment, and specifically, smoothing processing is performed on the delay time length of the current moment based on the smoothing coefficient and the delay time length of the previous moment so as to obtain the delay time length. The smoothing coefficient can be set according to actual needs, for example, can be set to values of 0.98, 0.9 and the like. Optionally, the historical delay time obtained by the first terminal includes a previous time delay time, the first terminal may obtain a preset smoothing coefficient, perform smoothing processing on the current time delay time and the previous time delay time according to the smoothing coefficient, and specifically may perform weighted calculation on the current time delay time and the previous time delay time according to the smoothing coefficient, so as to obtain delay time of the shared audio echo data relative to the audio data to be shared.

In this embodiment, the smoothing coefficient performs smoothing processing on the delay duration of the current time and the delay duration of the previous time, so that stability and accuracy of the delay duration can be ensured, quality of the mixed audio data sent to the second terminal can be ensured, and quality of the shared audio can be improved.

In an exemplary embodiment, obtaining local audio detection data corresponding to a current time from local audio data, and obtaining audio detection data to be shared corresponding to the current time from audio data to be shared includes: when the delay detection triggering condition is met, downsampling is respectively carried out on the local audio data and the audio data to be shared, so that downsampled local audio data and downsampled audio data to be shared are obtained; according to the data screening conditions, local audio detection data corresponding to the current moment are screened out from the local audio data after downsampling; and screening the audio detection data to be shared corresponding to the current moment from the audio data to be shared after downsampling according to the data screening conditions.

The delay detection trigger condition is used for judging whether to trigger the delay detection processing, and the delay detection trigger condition can be set according to actual needs, for example, the delay detection trigger condition can comprise various trigger conditions such as reaching a delay detection period, receiving a delay detection instruction and the like. Downsampling (downsampling) is the process of reducing the sampling rate of a signal. In downsampling, the new sampling rate must meet the nyquist sampling theorem to avoid signal aliasing and distortion. In a specific application, a digital filter may be used to low pass filter the signal and then decimate portions of the sample points to achieve a reduction in the sample rate. Downsampling may effectively reduce the data volume of the signal, thereby saving memory space and computing resources. By downsampling the local audio data and the audio data to be shared, for example, the audio data can be downsampled from a 48kHz sampling rate to a 1kHz sampling rate, the data amount of the audio data for which delay detection is performed can be reduced.

The data filtering condition is used for further filtering data points of all data points in the downsampled audio data to obtain the audio data corresponding to the current moment for delay detection. The data filtering condition may be set for the current time, for example, a preset time period or a preset number before the current time. For example, the current time is 1000ms, the data filtering condition may be 300ms before the current time, that is, the audio data may be filtered according to the time from 700ms to 1000ms, so as to obtain audio data corresponding to 700ms to 1000ms, which may specifically include local audio detection data corresponding to 700ms to 1000ms and audio detection data to be shared corresponding to 700ms to 1000 ms. For another example, the data filtering condition may be preset N data points before the current time, and then N data points before the current time may be filtered to obtain the audio data for delay detection.

The first terminal may detect according to a preset delay detection trigger condition, and when the delay detection trigger condition is detected to be met, consider that the delay detection needs to be performed by triggering, and the first terminal may respectively perform downsampling for the local audio data and the audio data to be shared, for example, may respectively perform downsampling for the local audio data and the audio data to be shared according to a preset downsampling rate, so as to obtain downsampled audio data, which specifically includes the downsampled local audio data and the downsampled audio data to be shared. The first terminal obtains preset data screening conditions, and performs data screening on the downsampled local audio data and the downsampled audio data to be shared according to the data screening conditions to obtain local audio detection data and audio detection data to be shared, wherein the local audio detection data and the audio detection data to be shared correspond to the current moment respectively. For example, the first terminal may screen the local audio detection data and the audio detection data to be shared for a period of time T before the current time from the local audio data after downsampling and the audio data to be shared after downsampling, respectively.

In this embodiment, when the delay detection trigger condition is satisfied, downsampling is performed on the audio data and data screening is performed according to the data screening condition, and multiple dynamic delay detection can be performed based on the local audio data and the audio data to be shared, so that accuracy of delay duration is ensured, and the data volume of delay detection can be reduced through downsampling and the data screening condition, which is beneficial to improving processing efficiency of delay detection.

In an exemplary embodiment, aligning the audio data to be shared with the shared audio echo data in a time dimension according to a delay duration to obtain delay audio data aligned with the shared audio echo data includes: caching the audio data to be shared, and determining the caching duration of the audio data to be shared; and when the buffer time length reaches the delay time length, obtaining delay audio data aligned with the shared audio echo data in the time dimension.

The buffer time length refers to a time length that lasts after the audio data to be shared is buffered, and specifically, the time can be counted after the audio data to be shared is buffered, so as to obtain the buffer time length. The first terminal may buffer the audio data to be shared, for example, the first terminal may store the audio data to be shared into a buffer queue for buffering, and trigger buffering timing to obtain a buffering duration of the audio data to be shared. The first terminal can compare the buffer time length with the delay time length, and when the buffer time length reaches the delay time length, the first terminal can obtain delay audio data based on the buffered audio data to be shared, and the delay audio data and the shared audio echo data are aligned in the time dimension. In some embodiments, after the first terminal stores the audio data to be shared in the buffer queue for buffering, when the buffering duration reaches the delay duration, the first terminal may obtain the delay audio data from the buffer queue.

In this embodiment, the audio data to be shared is cached, and the delayed audio data is obtained when the caching duration reaches the delay duration, so that the echo influence of the shared audio echo data on the delayed audio data can be reduced, the quality of the mixed audio data sent to the second terminal is ensured, and the quality of the shared audio is improved.

In an exemplary embodiment, the local audio data further includes call audio echo data of the far-end call audio data, as shown in fig. 5, and the audio sharing method further includes a process of far-end voice echo cancellation, specifically including:

Step 502, echo cancellation processing is performed on call audio echo data included in the local audio data through the far-end call audio data, so as to obtain the local audio data after the call audio echo cancellation.

The remote call audio data is played at the local end of the first terminal through local playing equipment associated with the first terminal, and when the first terminal collects audio data at the local end, the first terminal collects voice data sent by a user of the first terminal and remote call audio data played in air by the local playing equipment, so that call audio echo data are formed. The call audio echo data are audio data acquired again by a microphone of the first terminal after the remote call audio data broadcast by the local broadcasting equipment are air-transmitted. Echo cancellation is a technology for canceling echo received by a microphone, and the basic principle is that an echo path is estimated and cancelled by an adaptive filter technology, and the echo cancellation can be specifically implemented by various echo cancellation algorithms, such as a least mean Square (LEAST MEAN Square, LMS) algorithm, a Normalized Least Mean Square (NLMS) algorithm, and the like.

Optionally, the first terminal may perform echo detection with respect to the collected local audio data, so as to detect whether call audio echo data of the remote call audio data is collected. When determining that the local audio data further includes call audio echo data of far-end call audio data, the first terminal may perform echo cancellation processing on the call audio echo data included in the local audio data, and specifically may perform echo cancellation processing on the local audio data based on various echo cancellation algorithms by using the far-end call audio data as reference data, to obtain local audio data after echo cancellation of the call audio. The local audio data after the call audio echo cancellation can at least partially cancel the call audio echo data of the far-end call audio data, thereby ensuring the audio quality of the local audio data.

Step 504, delay alignment is performed on the audio data to be shared based on the local audio data after the call audio echo cancellation, so as to obtain delay audio data aligned with the shared audio echo data in the time dimension.

Specifically, the first terminal may perform delay alignment on the audio data to be shared in the time dimension based on the local audio data after the call audio echo cancellation, and specifically may delay the audio data to be shared for a certain period of time to obtain delayed audio data.

Step 506, mixing the local audio data after the call audio echo cancellation with the delayed audio data, and sending the mixed audio data obtained by mixing to the second terminal for playing based on the voice call.

The first terminal may mix the local audio data with the delayed audio data, so as to realize the mixing of the local audio data and the delayed audio data, obtain mixed audio data, and send the mixed audio data to the second terminal, so as to realize sharing of the audio data to be shared, which is played by the local end of the first terminal, to the second terminal based on the voice call mode.

In a specific application, as shown in fig. 6, a first terminal may perform a voice call with a second terminal, the second terminal may send remote call audio data to the first terminal, and the first terminal transmits the received remote call audio data to an associated first local playing device for playing; meanwhile, the first terminal can transmit the audio data to be shared to the associated second local playing device for playing. The microphone associated with the first terminal can collect audio at the local end of the first terminal, the collected local audio data comprises shared audio echo data of the audio data to be shared, near-end call audio data for voice call of the far-end call audio data and call audio echo data of the far-end call audio data, and the microphone can transmit the collected local audio data to the first terminal. The first terminal can perform echo cancellation processing on the local audio data through the far-end call audio data to obtain local audio data after the call audio echo cancellation, so that call audio echo data included in the local audio data are at least partially cancelled. The first terminal performs delay alignment on the audio data to be shared based on the local audio data after the call audio echo cancellation to obtain delay audio data, and the delay audio data is aligned with the shared audio echo data in the local audio data in the time dimension. The first terminal can mix the local audio data after the call audio echo cancellation with the delay audio data to obtain mixed audio data, and send the mixed audio data to the second terminal for playing.

In this embodiment, when the local audio data further includes call audio echo data of the far-end call audio data, echo cancellation processing may be performed on the call audio echo data by the far-end call audio data, and audio sharing processing may be performed based on the local audio data after the call audio echo cancellation, so that an influence of the call audio echo data may be reduced, and quality of mixed audio data sent to the second terminal may be ensured, thereby improving quality of shared audio.

In an exemplary embodiment, the audio sharing method further includes: and acquiring far-end call audio data sent to the first terminal by the second terminal through the call application in the process of carrying out voice call between the first terminal and the second terminal through the call application.

The communication application may include a VOIP application, each terminal may be installed with a VOIP application, and a user may perform a voice communication based on a network protocol through the VOIP application. For example, a voice call may be performed between the first terminal and the second terminal through a call application, and the first terminal may receive far-end call audio data transmitted by the second terminal through the call application.

Further, mixing the local audio data with the delayed audio data, and transmitting the mixed audio data obtained by mixing to the second terminal for playing based on the voice call, including: mixing the local audio data with the delay audio data to obtain mixed audio data; and sending the mixed audio data to the second terminal for playing through the call application.

Optionally, the first terminal may mix the local audio data and the delayed audio data, so as to mix the local audio data and the delayed audio data, and obtain mixed audio data. The first terminal may transmit the mixed audio data to the second terminal through the call application for playback by the second terminal.

In this embodiment, the first terminal receives the far-end call audio data sent by the second terminal through the call application, and sends the mixed audio data to the second terminal through the call application, so as to share the audio data played by the local terminal while performing voice call through the call application, and meanwhile, the echo influence of the shared audio echo data on the delayed audio data is reduced in the mixed audio data, so that the quality of the mixed audio data sent to the second terminal is ensured, and the quality of the shared audio is improved.

In an exemplary embodiment, the audio sharing method further includes: performing echo cancellation processing on the shared audio echo data contained in the local audio data through the audio data to be shared to obtain the local audio data after the shared audio echo cancellation; and mixing the local audio data after the shared audio echo cancellation with the audio data to be shared, and sending the obtained mixed audio data to a second terminal for playing based on voice communication.

Specifically, for the local audio data including the shared audio echo data of the audio data to be shared, the first terminal may perform echo cancellation processing on the shared audio echo data, specifically may use the audio data to be shared as reference data, and perform echo cancellation processing on the shared audio echo data in the local audio data based on various echo cancellation algorithms, so as to obtain the local audio data after the shared audio echo cancellation. The shared audio echo data of the audio data to be shared can be at least partially eliminated from the local audio data after the shared audio echo is eliminated, so that the audio quality of the local audio data is ensured.

In this embodiment, the to-be-shared audio data performs echo cancellation processing on the shared audio echo data in the local audio data, mixes the local audio data after the shared audio echo cancellation with the to-be-shared audio data, and sends the mixed audio data to the second terminal for playing, so that the influence of the shared audio echo data can be reduced, the quality of the mixed audio data sent to the second terminal is ensured, and the quality of the shared audio is improved.

The application also provides an application scene, which applies the audio sharing method. Specifically, the application of the audio sharing method in the application scene is as follows:

In the course of VOIP voice calls, there is often a need to share local audio. For example, in the process of audio and video conference call, the opposite party wants to see the desktop of the own computer (namely the so-called shared screen) and can hear the sound played by other application programs on the own computer, for example, in the process of call, the opposite party wants to play a song by using a music player for listening, or the sound of the network video played by the browser can also be sent to the opposite party. The audio sharing function allows audio played by other applications to be sent to the remote user in real time for hearing during the call using the audio-video application, and simultaneously enables normal call with the remote user. More specifically, for a scenario in which two different speakers are adopted to play a speaker playing far-end speaker voice and a speaker playing local audio in an external manner in an audio sharing process, the audio sharing method provided in this embodiment supports that a user can select a speaker different from a default playing device of a system as a call playing device through a VOIP application UI (user interface) interface, wherein audio played by the default playing device of the system is collected through an operating system API (Application Programming Interface ) and is sent to the far-end through mixing, and in order to eliminate the influence of echo formed by the external sound of the default playing device entering a microphone, the echo is masked by adopting a delay alignment method.

Specifically, the user may select a playback device and a recording device on the local VOIP application UI interface. In a terminal, such as a PC (Personal Computer ) computer, multiple audio peripherals may be connected simultaneously, for example, a computer may have a built-in speaker, and may also be connected simultaneously to a 3.5mm cable earphone, a USB (Universal Serial Bus ) speaker, a bluetooth playing device, etc., where the system may default to one speaker as a default device of the system, and when a general application plays music or video, the system may play music or video through the default device of the system. Further, a UI interface may be provided that may select a playback device by itself, for example, a part of professional audio-video conference APP (Application), and a user may select a playback device and a recording device that the APP wants to use through the UI interface. As shown in fig. 7, the user may select a "default" playing device of the system to play the sound transmitted from the far end in the VOIP call, or may select other speaker devices, for example, "speaker 1" may be selected as the playing device of the VOIP application, so that when the user is talking with the counterpart, the voice of the counterpart (far end) is played from the "speaker 1" device. The terminal can also be connected with a plurality of recording devices, so that the audio collection can be performed at the local end through the selected recording device, for example, a 'system default' option can be selected, and the audio collection can be performed through the recording device defaulting to the system.

When the user selects "horn 1", the default playing device of the system may be "horn 2", and the playing device of the VOIP application and the default playing device of the system are not the same playing device. In this case, the far-end speaker's voice is played from "Horn 1", while the voice played by other APP's, such as music applications, is played through the default device of the system, namely "Horn 2". Sound played by the loudspeaker 2 enters the microphone to form echo after being transmitted by air, so that the tone quality of the audio sharing function is affected, and the user experience is reduced.

Specifically, in the current digital device call process, such as PC, mobile phone, etc., the far-end and near-end users will install VOIP application program on the operating system, and the application program collects the local speaker voice from the hardware microphone through the API of the operating system, and sends the received far-end audio to the loudspeaker 1 for playing. After the two parties initiate the VOIP audio-video call, the voice data of the far-end speaker is transmitted to the near end, and the digital audio signal is obtained after the voice data are unpacked by the data package and the coded voice is decoded, and the signal is transmitted to the loudspeaker 1 to be played (the assumption is that the user selects the loudspeaker 1 through the UI interface). As shown in fig. 8, in the two-person or multi-person VOIP call architecture, on the near-end call application (VOIP application) side, a data packet can be received from the far end and decoded to obtain a digital voice signal x, where the digital voice signal x received from the far end is generally called a far-end voice signal, and is also used as a reference signal for echo cancellation. Sound played by the loudspeaker 1 propagates through the air (echo path 1) and then enters the near-end microphone to form an acoustic echo x'. The sound d collected by the microphone contains, in addition to the acoustic echo x ', the near-end speaker's voice, i.e. the sound d=s+x ' collected by the microphone. In order to prevent the echo signal entering the near-end microphone from being transmitted back to the far-end, so that the far-end speaker can hear the sound (namely, echo) of the near-end speaker, the near-end speaker generally needs to perform echo cancellation, further processes such as noise suppression, gain adjustment and the like, and then the near-end voice is packetized and transmitted after being subjected to voice coding. Specifically, the echo cancellation module may cancel the far-end voice echo x ' in the sound d collected by the microphone by using the far-end voice signal x as a reference signal, to obtain a signal e after echo cancellation, where at least a portion of the far-end voice echo x ' may be cancelled in the signal e after echo cancellation, that is, the e may include slightly damaged near-end voice s ' or include lossless near-end voice s.

Further, as shown in fig. 8, if the audio sharing function is enabled, the sound played by other local APPs is to be sent to the other party for hearing while talking, the sound sent to the loudspeaker 2 (assumed to be the default playing device of the system) needs to be collected back by using the API interface of the operating system, so as to obtain the sound m, and after being overlapped with e by the mixing module, the sound m is sent to the far end by the encoding packet sending module. The signal g after mixing ideally contains s' and the sound m played by other applications. If the loudspeaker 2 is a powerful loudspeaker box (relative to a headphone), sound radiated from the loudspeaker 2 will also enter the near-end microphone after being air-borne (echo path 2), and the d signal contains near-end speech s, far-end speech echoes x 'and other echoes m' of the sound being played by other applications. E after passing through the echo cancellation module contains s ' +m ' since x ' is cancelled. The signal g after mixing contains s '+m+m'. It can be seen that in g, besides the near-end speech, there are two audio signals played by other media applications, one is the original signal m and the other is the distorted signal m' after going through the paths of speaker playing-air propagation-microphone acquisition, etc. In general, if the volume of m' is small enough (e.g., there is little echo when playing through headphones), the impact is not great; or m' is small relative to m, for example, within tens of milliseconds, the two sounds almost overlap, and the sound effect heard by the far-end user is acceptable. However, if the delay of m' with respect to m is large, sound of the same content may be heard significantly twice after mixing, i.e., there is an echo effect.

Based on this, the audio sharing method provided in this embodiment may perform delay alignment processing for the shared audio. Specifically, as shown in fig. 9, a delay alignment process may be added, which may be specifically performed by the delay alignment module. Since e contains an approximate copy of m, the delay t can be obtained by calculating the correlation between e and m, and by obtaining the signal m″ aligned in time with m 'after the m delay time length t, g=s' +m '+m″ after mixing, and since m' and m″ overlap in time, the echo is not felt by the user after superposition of the two.

Further, for delay alignment processing, two signals m and e may be input to detect the delay time t of the echo signal m 'in e relative to m (since the delay time t may be represented by the number of delayed samples as a digital signal), and m″ obtained after m buffering the delay t is aligned in time with m', so that no echo can be heard. Specifically, as shown in fig. 10, e is downsampled to obtain a signal ed with a lower sampling rate, and m is downsampled to obtain a signal md with a lower sampling rate. For example, downsampling the signal from a 48kHz sampling rate to a 1kHz sampling rate will significantly reduce the amount of computation when computing the correlation. Delay detection is performed based on the signal ed with the lower sampling rate and the signal md with the lower sampling rate to obtain a delay time length t, and buffer alignment is performed on m according to the delay time length t to obtain m″ aligned with m' in a time overlapping manner.

Further, the delay detection process may be performed at fixed intervals, for example, 20ms, and the delay duration is calculated by using the GCC-phas algorithm for the ed signal and the md signal that are closest to the current time for a period of time (specifically, between 100ms and 1000ms may be taken according to the possible delay, for example, 500ms may be taken). The GCC-phas algorithm is an adaptive algorithm that determines the sound source location by calculating differential delay and phase information. For example, at 1000ms, the ed signal and the md signal from 500ms to 1000ms are taken to perform GCC-PHAT operation to obtain the kth delay value; in 1020ms, taking the ed signal and the md signal from 520ms to 1020ms to perform GCC-PHAT operation to obtain a k+1th delay value; thus, a number of delay samples is obtained every 20 ms.

Specifically, in order to calculate the kth delay value, for example, taking the ed signal with the sampling rate of 1000Hz for the last 500ms, the sample number L of the ed signal frame is 1000×0.5=500 sampling values. Similarly, the sampling rate and the frame duration of the md signal frame are the same, so the sampling point number L is also 500. The ed signal and the md signal of this kth frame are subjected to discrete time fourier transform (DTFT, discrete Fourier Transform) and then converted to the frequency domain, specifically as follows.

Wherein and/> are fourier transform results of the ed signal and the md signal, respectively, and f is a frequency point.

And calculating a correlation parameter between the two signals through a GCC-PHAT algorithm, wherein the correlation parameter specifically comprises a correlation value as shown in the following formula.

Wherein, ||represents conjugate operation, represents absolute value operation, is complex part after Fourier transformation, and/() is serial number of each sample data, and different sample positions can be represented. The number of sample points with the delayed ed relative to md at the kth frame can be obtained by searching the sample position/> with the maximum correlation value/> , which is specifically expressed as follows.

Wherein is the number of samples delayed relative to md at the kth frame.

And then, the estimated value of the delay sample number calculated before and the latest calculated delay sample number/> are smoothed to obtain a more stable estimated value of the delay sample number, which is specifically expressed as follows.

Where is the smoothing factor, for example,/> can take on a value of 0.95.

After calculating the delay time t of ed with respect to md, the delay time t may be represented by a delay sample number , may be aligned by buffering the m signal, in particular by buffering the m signal by FIFO (First Input First Output, first in first out), and obtaining a signal m″ aligned in time dimension with m' after delaying the m signal by/> samples.

In VOIP voice calls, there is often a need to share local audio. For example, in a call engineering, the partner wants to see his own computer desktop (i.e., a so-called shared screen) and can hear the sound played by his own computer. The audio sharing method provided by the embodiment supports that the computer plays the local audio through the loudspeaker in the communication process, simultaneously can share the played sound to the remote user in real time, and can communicate with the remote user normally. The audio sharing method provided in this embodiment may be specifically applied to the following two scenarios. The first scenario is that audio and video files to be played are selected in a VOIP application program to share sound in real time to a far-end user, specifically, the user can select a local audio and video file in the application program, decode and acquire a local digital audio stream, mix the local audio stream with the sound of a far-end speaker and send the mixed sound to a loudspeaker for playing, mix the local audio stream with a microphone audio stream after echo cancellation, and then encode and send the mixed sound to the far-end user. The second scenario is that, outside the VOIP application program, audio played by other APP on the operating system needs to be shared in real time to the far-end user, specifically, the loop echo audio stream is collected through the operating system loop back collection interface, the near-end echo signal collected by the microphone is eliminated by taking the loop echo audio stream as a reference signal, the far-end voice in the loop echo audio stream is eliminated by using another echo elimination module, and the signals obtained respectively are encoded and sent to the far-end user after being mixed.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides an audio sharing device for realizing the audio sharing method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation of the embodiment of one or more audio sharing devices provided below may be referred to the limitation of the audio sharing method described above, and will not be repeated here.

In an exemplary embodiment, as shown in fig. 11, there is provided an audio sharing apparatus 1100, including: an audio to be shared determination module 1102, a local audio acquisition module 1104, a delay alignment module 1106, and an audio mixing module 1108, wherein:

The audio to be shared determining module 1102 is configured to determine audio data to be shared of the first terminal in a process of performing a voice call between the first terminal and the second terminal; the audio data to be shared and the remote call audio data sent to the first terminal by the second terminal are respectively played at the local end of the first terminal through different local playing devices associated with the first terminal;

A local audio acquisition module 1104, configured to acquire local audio data acquired at a local end of the first terminal; the local audio data comprise shared audio echo data of audio data to be shared and near-end call audio data for performing voice call aiming at the far-end call audio data;

The delay alignment module 1106 is configured to perform delay alignment on the audio data to be shared based on the local audio data, so as to obtain delay audio data aligned with the shared audio echo data in a time dimension;

the audio mixing module 1108 is configured to mix the local audio data with the delayed audio data, and send the mixed audio data obtained by mixing to the second terminal for playing based on the voice call.

In one embodiment, the delay alignment module 1106 is further configured to determine a delay duration of the shared audio echo data relative to the audio data to be shared based on the local audio data and the audio data to be shared; and aligning the audio data to be shared with the shared audio echo data in the time dimension according to the delay time length to obtain delay audio data aligned with the shared audio echo data.

In one embodiment, the delay alignment module 1106 is further configured to obtain local audio detection data corresponding to the current time from the local audio data, and obtain audio detection data to be shared corresponding to the current time from the audio data to be shared; determining a correlation parameter between the local audio detection data and the audio detection data to be shared; and determining the delay time length of the shared audio echo data relative to the audio data to be shared according to the correlation parameters.

In an embodiment, the delay alignment module 1106 is further configured to perform frequency domain conversion on the local audio detection data and the audio detection data to be shared, so as to obtain local audio frequency domain data corresponding to the local audio detection data and audio frequency domain data to be shared corresponding to the audio detection data to be shared; based on a cross-correlation algorithm, obtaining correlation parameters between the local audio frequency domain data and the audio frequency domain data to be shared according to data points included in the local audio frequency domain data and the audio frequency domain data to be shared; determining the number of delay data points of the local audio frequency domain data relative to the audio frequency domain data to be shared based on the correlation parameters; and determining the delay time length of the shared audio echo data relative to the audio data to be shared according to the number of the delay data points.

In one embodiment, the delay alignment module 1106 is further configured to determine a delay duration of the shared audio echo data relative to a current time of the audio data to be shared according to the correlation parameter; acquiring historical delay time of the shared audio echo data relative to the audio data to be shared; and determining the delay time of the shared audio echo data relative to the audio data to be shared according to the delay time of the current moment and the historical delay time.

In one embodiment, the historical delay time length includes a previous time delay time length corresponding to a previous time of the current time; the delay alignment module 1106 is further configured to perform smoothing processing on the delay duration at the current time and the delay duration at the previous time according to the smoothing coefficient, so as to obtain a delay duration of the shared audio echo data relative to the audio data to be shared.

In one embodiment, the delay alignment module 1106 is further configured to, when the delay detection trigger condition is met, respectively downsample the local audio data and the audio data to be shared, to obtain downsampled local audio data and downsampled audio data to be shared; according to the data screening conditions, local audio detection data corresponding to the current moment are screened out from the local audio data after downsampling; and screening the audio detection data to be shared corresponding to the current moment from the audio data to be shared after downsampling according to the data screening conditions.

In one embodiment, the delay alignment module 1106 is further configured to cache the audio data to be shared, and determine a cache duration of the audio data to be shared; and when the buffer time length reaches the delay time length, obtaining delay audio data aligned with the shared audio echo data in the time dimension.

In one embodiment, the local audio data further comprises call audio echo data of the remote call audio data; the system also comprises a call audio echo cancellation module, which is used for performing echo cancellation processing on call audio echo data contained in the local audio data through remote call audio data to obtain the local audio data after the call audio echo cancellation; delay alignment is carried out on the audio data to be shared based on the local audio data after the call audio echo cancellation, and delay audio data aligned with the shared audio echo data in the time dimension is obtained; mixing the local audio data after the call audio echo cancellation with the delay audio data, and sending the mixed audio data obtained by mixing to a second terminal for playing based on the voice call.

In one embodiment, the system further comprises a far-end call audio acquisition module, which is used for acquiring far-end call audio data sent to the first terminal by the second terminal through the call application in the process of voice call between the first terminal and the second terminal through the call application; the audio mixing module 1108 is further configured to mix the local audio data with the delayed audio data to obtain mixed audio data; and sending the mixed audio data to the second terminal for playing through the call application.

In one embodiment, the system further includes a shared audio echo cancellation module, configured to perform echo cancellation processing on shared audio echo data included in the local audio data through the audio data to be shared, so as to obtain local audio data after the shared audio echo cancellation; and mixing the local audio data after the shared audio echo cancellation with the audio data to be shared, and sending the obtained mixed audio data to a second terminal for playing based on voice communication.

The modules in the audio sharing device may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one exemplary embodiment, a computer device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 12. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing audio data to be shared. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an audio sharing method.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 13. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement an audio sharing method. The display unit of the computer equipment is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device, wherein the display screen can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on a shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structures shown in fig. 12 and 13 are block diagrams of only portions of structures associated with the present inventive arrangements and are not limiting of the computer device to which the present inventive arrangements are applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are both information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to meet the related regulations.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile memory may include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high density embedded nonvolatile memory, resistive random access memory (ReRAM), magneto-resistive random access memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric memory (Ferroelectric Random Access Memory, FRAM), phase change memory (PHASE CHANGE memory, PCM), graphene memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. An audio sharing method, the method comprising:

during a voice call between a first terminal and a second terminal, determining audio data to be shared of the first terminal; the audio data to be shared and the remote call audio data sent to the first terminal by the second terminal are respectively played at the local end of the first terminal through different local playing devices associated with the first terminal;

acquiring local audio data acquired at a local end of the first terminal; the local audio data comprise shared audio echo data of the audio data to be shared and near-end call audio data for carrying out voice call on the far-end call audio data;

Performing delay alignment on the audio data to be shared based on the local audio data to obtain delay audio data aligned with the shared audio echo data in a time dimension;

And mixing the local audio data with the delayed audio data, and transmitting the mixed audio data obtained by mixing to the second terminal for playing based on the voice call.

2. The method of claim 1, wherein the delay-aligning the audio data to be shared based on the local audio data to obtain delay audio data aligned with the shared audio echo data in a time dimension comprises:

determining delay time of the shared audio echo data relative to the audio data to be shared based on the local audio data and the audio data to be shared;

And aligning the audio data to be shared with the shared audio echo data in the time dimension according to the delay time length to obtain delay audio data aligned with the shared audio echo data.

3. The method of claim 2, wherein the determining a delay period of the shared audio echo data relative to the audio data to be shared based on the local audio data and the audio data to be shared comprises:

obtaining local audio detection data corresponding to the current time from the local audio data, and obtaining audio detection data to be shared corresponding to the current time from the audio data to be shared;

Determining a correlation parameter between the local audio detection data and the audio detection data to be shared;

and determining the delay time length of the shared audio echo data relative to the audio data to be shared according to the correlation parameter.

4. The method of claim 3, wherein the determining a correlation parameter between the local audio detection data and the audio detection data to be shared comprises:

Frequency domain conversion is respectively carried out on the local audio detection data and the audio detection data to be shared, so that local audio frequency domain data corresponding to the local audio detection data and audio frequency domain data to be shared corresponding to the audio detection data to be shared are obtained;

Based on a cross-correlation algorithm, obtaining correlation parameters between the local audio frequency domain data and the audio frequency domain data to be shared according to data points included in the local audio frequency domain data and the audio frequency domain data to be shared;

The determining, according to the correlation parameter, a delay duration of the shared audio echo data relative to the audio data to be shared includes:

Determining the number of delay data points of the local audio frequency domain data relative to the audio frequency domain data to be shared based on the correlation parameter;

and determining the delay time length of the shared audio echo data relative to the audio data to be shared according to the delay data point number.

5. The method of claim 3, wherein determining a delay time of the shared audio echo data relative to the audio data to be shared according to the correlation parameter comprises:

determining delay time length of the shared audio echo data relative to the current time of the audio data to be shared according to the correlation parameters;

acquiring historical delay time length of the shared audio echo data relative to the audio data to be shared;

and determining the delay time of the shared audio echo data relative to the audio data to be shared according to the delay time of the current moment and the historical delay time.

6. The method of claim 5, wherein the historical delay period comprises a previous time delay period corresponding to a time previous to the current time;

The determining the delay time of the shared audio echo data relative to the audio data to be shared according to the current time delay time and the historical delay time includes:

And smoothing the delay time of the current moment and the delay time of the previous moment according to a smoothing coefficient to obtain the delay time of the shared audio echo data relative to the audio data to be shared.

7. The method of claim 3, wherein the obtaining the local audio detection data corresponding to the current time from the local audio data and obtaining the audio detection data to be shared corresponding to the current time from the audio data to be shared includes:

When the delay detection triggering condition is met, downsampling is respectively carried out on the local audio data and the audio data to be shared, so as to obtain downsampled local audio data and downsampled audio data to be shared;

According to the data screening condition, screening out local audio detection data corresponding to the current moment from the down-sampled local audio data;

And screening the audio detection data to be shared corresponding to the current moment from the audio data to be shared after downsampling according to the data screening conditions.

8. The method of claim 2, wherein aligning the audio data to be shared with the shared audio echo data in the time dimension according to the delay duration to obtain the delayed audio data aligned with the shared audio echo data comprises:

caching the audio data to be shared, and determining the caching duration of the audio data to be shared;

and when the buffer time length reaches the delay time length, obtaining delay audio data aligned with the shared audio echo data in the time dimension.

9. The method of claim 1, wherein said local audio data further comprises call audio echo data for said remote call audio data; the method further comprises the steps of:

Performing echo cancellation processing on the call audio echo data contained in the local audio data through the far-end call audio data to obtain local audio data after call audio echo cancellation;

Performing delay alignment on the audio data to be shared based on the local audio data after the call audio echo cancellation to obtain delay audio data aligned with the shared audio echo data in a time dimension;

And mixing the local audio data after the call audio echo cancellation with the delay audio data, and sending the mixed audio data obtained by mixing to the second terminal for playing based on the voice call.

10. The method according to any one of claims 1 to 9, further comprising:

acquiring far-end call audio data sent to a first terminal by a second terminal through a call application in the process of carrying out voice call between the first terminal and the second terminal through the call application;

the step of mixing the local audio data with the delayed audio data and transmitting the mixed audio data obtained by mixing to the second terminal for playing based on the voice call includes:

Mixing the local audio data with the delay audio data to obtain mixed audio data;

and sending the mixed audio data to the second terminal for playing through the call application.

11. The method according to any one of claims 1 to 9, further comprising:

performing echo cancellation processing on the shared audio echo data included in the local audio data through the audio data to be shared to obtain the local audio data after the shared audio echo cancellation;

And mixing the local audio data after the sharing audio echo cancellation with the audio data to be shared, and sending the obtained mixed audio data to the second terminal for playing based on the voice call.

12. An audio sharing device, the device comprising:

The local audio acquisition module is used for acquiring local audio data acquired at a local end of the first terminal; the local audio data comprise shared audio echo data of the audio data to be shared and near-end call audio data for carrying out voice call on the far-end call audio data;

The delay alignment module is used for carrying out delay alignment on the audio data to be shared based on the local audio data to obtain delay audio data aligned with the shared audio echo data in a time dimension;

13. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 11 when the computer program is executed.

14. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 11.

15. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 11.