CN113938746A

CN113938746A - Network live broadcast audio processing method and device, equipment, medium and product thereof

Info

Publication number: CN113938746A
Application number: CN202111144000.5A
Authority: CN
Inventors: 何鑫; 苏嘉昌
Original assignee: Guangzhou Huaduo Network Technology Co Ltd
Current assignee: Guangzhou Huaduo Network Technology Co Ltd
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2022-01-14
Anticipated expiration: 2041-09-28
Also published as: CN113938746B

Abstract

The application relates to a live network audio processing technology, and discloses a live network audio processing method, a live network audio processing device, live network audio processing equipment, live network audio processing media and live network audio processing products, wherein the live network audio processing method comprises the following steps: acquiring far-end audio data in a live broadcast room connection state, mixing the far-end audio data with local audio data into external audio data, and playing the external audio data; echo cancellation is carried out on the real-time audio data collected by the local machine by taking the far-end audio data as a reference signal, so as to obtain intermediate audio data, wherein the echo signal of the far-end audio data is cancelled, and the echo signal of the local audio data is reserved; mixing the local audio data with the intermediate audio data after the local audio data is superposed with a local loopback delay value to obtain mixed audio data; and pushing the live broadcast stream containing the audio mixing audio data to the live broadcast room. The method and the device can effectively eliminate the double-talk phenomenon caused by the connection of a plurality of people in the live network broadcast process, can effectively ensure the conversation tone quality, and realize the synchronous alignment of the sounds of a plurality of sound sources in the time domain.

Description

Network live broadcast audio processing method and device, equipment, medium and product thereof

Technical Field

The present application relates to a live webcast audio processing technology, and in particular, to a live webcast audio processing method, and a corresponding apparatus, computer device, computer-readable storage medium, and computer program product.

Background

Echo cancellation is often applied to voice communication, a transfer function of an echo path is estimated through an adaptive filtering algorithm, and a reference signal is filtered through the transfer function, so that an echo is removed. Under the condition of double-talk between two communication parties, the linear relation between the reference signal and the acquired signal is destroyed, the convergence of the coefficient of the adaptive filter is influenced, at the moment, the echo path is estimated through adaptive filtering, distortion can be caused, and the situations of word dropping, jamming, echo leakage and the like occur to the voice signal after echo cancellation.

In a live webcast scene, there are multiple sound sources. The anchor plays background music locally and pushes the music to the audience, if the anchor is connected with the audience under the scene, the anchor can also play the sound of the connected audience by pulling the stream. The above scenario processing flow is shown in fig. 1, wherein the dashed box "pull" indicates that the process is executed when the connection is made.

In the prior art, the echo cancellation principle is as shown in fig. 2, in the echo cancellation process, a locally played sound is used as a reference signal, and due to the continuity of music signals, the locally played sound is continuously played, so that a double-talk is formed as long as a main broadcasting end talks once no matter whether a situation of pulling a stream to a connected audience exists or not. Due to the frequent occurrence of double talk, the sound of the anchor user after echo cancellation can be in the situations of word dropping, echo missing and the like.

Fig. 3 is a schematic diagram of an actual measurement effect obtained by echo cancellation by means of the prior art, which is a waveform diagram corresponding to a main broadcast signal, a music signal, a collected signal and an echo cancellation signal from top to bottom, wherein the collected signal is a mixture of an echo of a main broadcast speaking signal collected by a terminal device and a local broadcast music signal after room reflection, and the echo cancellation signal is a signal obtained by performing echo cancellation on the collected signal by using the broadcast music signal as a reference signal. It can be seen from the "acquisition signal" that when the "anchor signal" is generated by the anchor speaking, the acquisition signal is always in the double-talk state. By comparing the "anchor signal" with the "echo cancellation signal", it can be seen that there is a lot of signal loss, most notably a dropped word, resulting in a click-through. In addition, a leaky echo also appears in the "echo cancellation signal", mainly at a 4 second position in the time domain thereof.

In view of this, a new technical scheme needs to be adopted to overcome the problems of word loss, stutter and echo leakage caused by frequent double talk in the scene of locally playing music and pushing stream by the anchor terminal in the network live broadcast process.

Disclosure of Invention

A primary object of the present application is to solve at least one of the above problems and provide a live network audio processing method and a corresponding apparatus, computer device, computer readable storage medium, and computer program product.

In order to meet various purposes of the application, the following technical scheme is adopted in the application:

one of the objectives of the present application is to provide a live webcast audio processing method, which includes the following steps:

acquiring far-end audio data in a live broadcast room connection state, mixing the far-end audio data with local audio data into external audio data, and playing the external audio data;

echo cancellation is carried out on the real-time audio data collected by the local machine by taking the far-end audio data as a reference signal, so as to obtain intermediate audio data, wherein the echo signal of the far-end audio data is cancelled, and the echo signal of the local audio data is reserved;

mixing the local audio data with the intermediate audio data after the local audio data is superposed with a local loopback delay value to obtain mixed audio data;

and pushing the live broadcast stream containing the audio mixing audio data to the live broadcast room.

In a further embodiment, the method for acquiring the far-end audio data in the connection state of the live broadcast room, mixing the far-end audio data with the local audio data and playing the audio data comprises the following steps:

acquiring a remote live broadcast stream pushed by a server in a live broadcast room connection state;

extracting remote audio data from the remote live stream;

mixing the far-end audio data with the local audio data to obtain external audio data;

and converting the voice signal according to the external audio data to play.

In a further embodiment, performing echo cancellation on the locally acquired real-time audio data by using the far-end audio data as a reference signal to obtain intermediate audio data, includes the following steps:

continuously acquiring real-time input voice signals from a local sound card to obtain real-time audio data;

applying a preset adaptive echo filtering algorithm, taking the far-end audio data as a reference signal, and carrying out echo cancellation processing on the real-time audio data so as to cancel an echo signal corresponding to the far-end audio data;

and reserving an echo signal corresponding to the local audio data in the real-time audio data as the intermediate audio data.

In an embodiment, the step of mixing the local audio data with the intermediate audio data after superimposing the local loopback delay value to obtain mixed audio data includes the following steps:

obtaining a loopback delay value corresponding to local equipment;

and controlling the local audio data to mix with the intermediate audio data according to the loopback delay value lag to obtain mixed audio data.

In a further embodiment, obtaining the loopback delay value corresponding to the local device includes the following steps:

presetting a loopback identification signal to the outgoing audio data at a first moment, wherein the loopback identification signal is a high-frequency signal outside a human ear hearing frequency band and comprises a plurality of single-frequency signals, and each single-frequency signal is arranged at equal intervals in a frequency domain;

detecting whether the real-time audio data has the loopback identification signal, and determining a second moment when the loopback identification signal is detected;

determining the loopback delay value according to the difference value of the first time and the second time;

storing the loopback delay value for subsequent direct invocation.

In a deepened embodiment, detecting whether the loopback identification signal exists in the real-time audio data comprises the following steps:

tracking a noise signal of the real-time audio data along a time domain, and transforming the noise signal to a frequency domain to obtain corresponding noise energy spectrum data;

positioning the peak position of each frequency point according to the voice energy spectrum data mapped by the voice frame of the real-time audio data;

calculating the existence probability of the loopback recognition signal in each voice frame according to the voice energy and the noise energy corresponding to each frequency point;

and judging to detect the loopback identification signal when the existence probability of a plurality of continuous voice frames meets a preset condition.

In an extended embodiment, before the step of performing echo cancellation on the locally acquired real-time audio data by using the far-end audio data as a reference signal to cancel an echo signal corresponding to the far-end audio data, and obtaining intermediate audio data, the method includes the following steps:

detecting an echo cancellation function switch of a user of the live broadcast room logged in the local machine, starting the subsequent steps when the echo cancellation function switch is in an activated state, otherwise, not executing the subsequent steps.

One of the purposes of this application provides a network broadcast audio processing device, including:

the stream playing module is used for acquiring remote audio data in a live broadcast room connection state, mixing the remote audio data with local audio data into external audio data and playing the external audio data;

the echo cancellation module is used for carrying out echo cancellation on the real-time audio data acquired by the local machine by taking the far-end audio data as a reference signal to obtain intermediate audio data, wherein the echo signal of the far-end audio data is cancelled and the echo signal of the local audio data is reserved;

the loopback correction module is used for mixing the local audio data with the intermediate audio data after the local audio data is superposed with the local loopback delay value to obtain mixed audio data;

and the live broadcast stream pushing module is used for pushing the live broadcast stream containing the audio mixing audio data to the live broadcast room.

The computer device comprises a central processing unit and a memory, wherein the central processing unit is used for calling and running a computer program stored in the memory to execute the steps of the live network audio processing method.

A computer-readable storage medium, which stores a computer program implemented according to the live network audio processing method in the form of computer-readable instructions, and when the computer program is called by a computer, executes the steps included in the method.

A computer program product provided to adapt to another object of the present application includes a computer program/instructions that when executed by a processor implement the steps of the live audio processing method described in any of the embodiments of the present application.

Compared with the prior art, the application has the following advantages:

firstly, in the application, under the live network connection state, echo cancellation is carried out in real-time audio data collected by the local computer, in the echo cancellation link, the local audio data and the far-end audio data are decoupled, only the far-end audio data is selected as a reference signal in the echo cancellation process, for canceling echo signals in real-time audio data generated by the remote audio data being locally played out, and the echo signal generated by playing the local audio data is reserved to obtain the intermediate audio data, therefore, the echo signal of the far-end audio data cannot interfere the local audio data to cause 'double talk' in the subsequent audio mixing stage, the situation that local sound sources are out of word, jammed or leaked in echo can not be caused in the audio stream which is formed by mixing the intermediate audio data and the local audio data and is pushed to the live broadcast room.

Secondly, this application utilizes the corresponding loopback delay value of this local computer to carry out the time delay compensation with this local computer audio data, make it according to the time delay lag that loopback delay value appointed with intermediate audio data carries out the audio mixing, obtains the audio mixing audio data, because still exist in the intermediate audio data echo signal corresponding to local audio data, this echo signal can align the stack with local audio data after the time delay compensation is synchronous, consequently, this echo signal can not cause the interference to local audio data, or can not produce obvious interference at least, has realized the effective protection to the tone quality of local audio data.

In addition, the application can be applied to application scenes with online conversation properties, including but not limited to scenes such as network video live broadcast and karaoke, under the scenes, based on the audio stream obtained after the processing of the application, a listener can obtain sound effects with a plurality of sound sources playing synchronously and sound fluently, perception from the perspective of the listener is ensured, the sound effect that the vocalization of the singer keeps synchronous with the lyrics and background music in the video can be perceived, and therefore user experience is improved.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a general functional block diagram of a technical architecture for processing audio data at a terminal device, which is also applicable in various embodiments of the present application;

fig. 2 is a schematic flow chart illustrating a process of performing echo cancellation on real-time audio data at a terminal device in the prior art, in which both a music echo and a viewer echo are cancelled during the echo cancellation process;

fig. 3 is a waveform diagram of various voice signals obtained by actual measurement after processing a voice signal in a live network connection state by applying the prior art, wherein the waveform diagram is a main broadcast signal, a music signal, a collected signal and a signal after echo cancellation from top to bottom;

fig. 4 is a schematic flowchart of an exemplary embodiment of a live audio processing method according to the present application;

fig. 5 is a schematic flowchart illustrating a process of obtaining a far-end live stream and a mixed audio play of local audio data in an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating a process of performing echo cancellation on real-time audio data according to an embodiment of the present application;

fig. 7 is a schematic flowchart of performing delay compensation according to a loopback delay value in an embodiment of the present application;

fig. 8 is a schematic flow chart illustrating a process of obtaining a loopback delay value by itself in an embodiment of the present application;

fig. 9 is a schematic diagram of the basic principle of calculating the loopback delay value of the audio signal, where T1 to T4 represent different times, the acquisition thread is responsible for audio data acquisition, the playing thread is responsible for audio data conversion and playing, and the anchor indicates an external input sound source, which can correspondingly acquire real-time audio data;

FIG. 10 is a flow diagram illustrating a process of detecting a looped-back identification signal by tracking a noise signal in real-time audio data according to an embodiment of the present application;

fig. 11 is a flowchart illustrating a procedure of performing echo cancellation on real-time audio data at a terminal device according to the present application, wherein only audience echo is cancelled, but music echo is not cancelled during the echo cancellation procedure;

fig. 12 is a schematic block diagram of delay compensation of playing music of local audio data by applying a loopback delay value;

fig. 13 is waveform diagrams before and after the time delay compensation is performed by applying the technical scheme of the present application in the same scene from top to bottom;

fig. 14 is a waveform diagram of various actually measured voice signals obtained after processing related voice signals by applying the technical scheme of the present application in a state that a main broadcasting user is not connected with other users, where the waveform diagram is, from top to bottom, a main broadcasting signal, a music signal, a signal after echo cancellation before improvement, and a signal after echo cancellation after improvement, respectively;

fig. 15 is a waveform diagram of various actually measured voice signals obtained after processing related voice signals by applying the technical scheme of the present application in a state that a main broadcasting user is connected with other users, where the waveform diagram includes, from top to bottom, a main broadcasting signal, an audience signal, a music signal, a signal after echo cancellation before improvement, and a signal after echo cancellation after improvement, respectively;

fig. 16 is a functional block diagram of an exemplary embodiment of a live audio processing device of the present application;

fig. 17 is a schematic structural diagram of a computer device used in the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As will be appreciated by those skilled in the art, "client," "terminal," and "terminal device" as used herein include both devices that are wireless signal receivers, which are devices having only wireless signal receivers without transmit capability, and devices that are receive and transmit hardware, which have receive and transmit hardware capable of two-way communication over a two-way communication link. Such a device may include: cellular or other communication devices such as personal computers, tablets, etc. having single or multi-line displays or cellular or other communication devices without multi-line displays; PCS (Personal Communications Service), which may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant), which may include a radio frequency receiver, a pager, internet/intranet access, a web browser, a notepad, a calendar and/or a GPS (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device having and/or including a radio frequency receiver. As used herein, a "client," "terminal device" can be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. The "client", "terminal Device" used herein may also be a communication terminal, a web terminal, a music/video playing terminal, such as a PDA, an MID (Mobile Internet Device) and/or a Mobile phone with music/video playing function, and may also be a smart tv, a set-top box, and the like.

The hardware referred to by the names "server", "client", "service node", etc. is essentially an electronic device with the performance of a personal computer, and is a hardware device having necessary components disclosed by the von neumann principle such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, an output device, etc., a computer program is stored in the memory, and the central processing unit calls a program stored in an external memory into the internal memory to run, executes instructions in the program, and interacts with the input and output devices, thereby completing a specific function.

It should be noted that the concept of "server" as referred to in this application can be extended to the case of a server cluster. According to the network deployment principle understood by those skilled in the art, the servers should be logically divided, and in physical space, the servers may be independent from each other but can be called through an interface, or may be integrated into one physical computer or a set of computer clusters. Those skilled in the art will appreciate this variation and should not be so limited as to restrict the implementation of the network deployment of the present application.

One or more technical features of the present application, unless expressly specified otherwise, may be deployed to a server for implementation by a client remotely invoking an online service interface provided by a capture server for access, or may be deployed directly and run on the client for access.

Various data referred to in the present application may be stored in a server remotely or in a local terminal device unless specified in the clear text, as long as the data is suitable for being called by the technical solution of the present application.

The person skilled in the art will know this: although the various methods of the present application are described based on the same concept so as to be common to each other, they may be independently performed unless otherwise specified. In the same way, for each embodiment disclosed in the present application, it is proposed based on the same inventive concept, and therefore, concepts of the same expression and concepts of which expressions are different but are appropriately changed only for convenience should be equally understood.

The embodiments to be disclosed herein can be flexibly constructed by cross-linking related technical features of the embodiments unless the mutual exclusion relationship between the related technical features is stated in the clear text, as long as the combination does not depart from the inventive spirit of the present application and can meet the needs of the prior art or solve the deficiencies of the prior art. Those skilled in the art will appreciate variations therefrom.

The network live broadcast audio processing method can be applied to terminal equipment in an offline environment and terminal equipment in a scene that a server supports instant messaging, wherein the scene comprises but is not limited to various exemplary application scenes such as instant messaging, network video live broadcast, online customer service, karaoke and the like, and the method has a wide application range as a basic technology of audio processing. The method can be implemented by programming as a computer program product, and is deployed in a terminal device for operation.

Referring to fig. 4 and fig. 1, in an exemplary embodiment of the present invention, a live audio processing method includes the following steps:

step S1100, acquiring far-end audio data in a live broadcast room connection state, mixing the far-end audio data with local audio data into external audio data, and playing:

in a live broadcast room of network video live broadcast, two or more users can be started to be connected with each other, the users can be all anchor users, i.e., the owner users of both live room instances, or the anchor user of one live room instance, establishes a connection with the viewer users in his live room, taking the example where one anchor user initiates a multi-user connection with other anchor users and/or audience users in the live broadcast room to enter the multi-user connection state, for the main broadcasting user, through the data communication link established between the main broadcasting user and other relative users, the live broadcasting stream of the opposite side is pulled to the server to be played to the local terminal equipment, meanwhile, the live stream of the user is pushed to each opposite party through a server, and is usually synchronously pushed to terminal equipment of other audience users in a live broadcast room for playing. Therefore, for the anchor user, any live stream of other users obtained through server push streaming and obtained through the data communication link is a remote live stream, and a live stream pushed by the anchor user is a near-end live stream.

The live stream generally includes a video stream and an audio stream, and the two streams of media can be separated and extracted at the terminal device for decoding and outputting respectively. Correspondingly, the audio data extracted from the far-end live stream constitutes far-end audio data, and the audio data in the locally-generated near-end live stream is near-end audio data.

The far-end audio data is relatively simple, is generated by processing and mixing real-time audio data collected by a far-end user in local terminal equipment of the far-end user, and can be regarded as a single sound source when the far-end audio data reaches the local terminal equipment of the anchor user. When there are multiple online users, there are theoretically multiple remote sound sources, but actually in the present application, when performing echo cancellation later, multiple remote online users may also be regarded as a summarized remote sound source to perform centralized echo cancellation, which is not shown here for the moment.

The near-end audio data is generated by mixing audio data generated by a plurality of sound sources for the terminal equipment of the anchor user. For example, when the local terminal device is playing background music, the local terminal device forms a local playing sound source, the background music may be a local music text, or audio data streamed from a remote media server, in short, for the local terminal device, audio data corresponding to the local playing sound source, that is, local audio data, will be generated. In addition, if the input device such as the microphone of the local terminal device is in an operating state, the sound card of the local terminal device collects an externally input voice signal to generate corresponding audio data, that is, real-time audio data continuously collected and obtained by the local terminal device. Therefore, it can be seen that when the anchor user plays music and turns on the voice input device, the local terminal device can obtain audio data corresponding to the two sound sources. If the remote user is in a connected state with the remote user, the remote user can also be regarded as at least one remote audio source.

Referring to fig. 1, the local terminal device is responsible for processing the output and the collection of the sound signal respectively by starting two threads, namely a playing thread and a collecting thread.

In the playing thread, the local terminal equipment recalls the audio data of the local playing sound source and the far-end audio data for audio mixing to obtain the external audio data, and the external audio data is converted into a voice signal and then output. It is understood that, when all the sound sources coexist, the played sound at least includes a background sound corresponding to the audio data of the local sound source and a sound corresponding to the remote audio data of the remote sound source. The process of converting the audio data generated by mixing into a voice signal and outputting the voice signal is within the technical scope known by those skilled in the art, and therefore, the process is omitted.

In the acquisition thread, the local terminal device continuously obtains real-time audio data acquired by the sound card and converted from an external voice signal, and under the condition that the external audio data is played through a loudspeaker such as a sound box, the real-time audio data may theoretically encapsulate various signals contained in the voice signal, including a speaker voice signal, an echo signal looped back after the local audio data is played, an echo signal looped back after the far-end audio data is played, and the like. According to the conventional practice, the real-time audio data is subjected to echo cancellation processing, and is subjected to echo cancellation and then mixed with the local audio data, and the mixed audio data is obtained and then contained in a live stream as a near-end audio stream and pushed to each far-end user.

The above process disclosed in the present application is mainly exemplified by the main broadcasting user side of the live broadcasting room, and it can be understood that other remote users in the connection state may also be adapted to the same, and both sides may be symmetrically arranged and all apply the technical solutions of the embodiments of the present application. In this regard, those skilled in the art will appreciate.

Step S1200, performing echo cancellation on the real-time audio data collected by the local device by using the far-end audio data as a reference signal, to obtain intermediate audio data, wherein an echo signal of the far-end audio data is cancelled and an echo signal of the local audio data is retained:

in the echo cancellation process of the process of acquiring the threads shown in fig. 1, the real-time audio data is processed differently from the prior art. The method mainly includes that in the application, an adaptive filtering algorithm is applied, only the far-end audio data is used as a reference signal, echo cancellation processing is carried out on the real-time audio data, echo signals generated by loopback after the far-end audio data is externally played in local terminal equipment are correspondingly filtered out from the real-time audio data, and echo signals generated by loopback of the local audio data are reserved due to the fact that the local audio data are not referred.

The echo cancellation is realized by adopting the self-adaptive filtering algorithm, the basic principle is that an unknown echo channel is subjected to parameter identification by using a self-adaptive filter, an audio signal model of audio data causing the echo is established on the basis of the correlation between a loudspeaker signal and generated multi-channel echoes, an echo path is simulated, the impulse response of the model is approximate to the real echo path through the adjustment of the self-adaptive filter, and then the echo cancellation function can be realized by subtracting an estimated value from a voice signal received by a microphone through the real-time audio data. It can be seen that when the model of the audio signal is established to simulate the echo path, the reference signal based on the reference will have a decisive influence on the correlation of echo cancellation. In particular, if multiple echo signals are present in the real-time audio data, the cancellation of these echo signals depends on the provided reference signal. In the present application, the far-end audio data is used as the reference signal, and finally the echo signal caused by the far-end audio data is eliminated. In accordance with this principle, almost all existing adaptive filtering algorithms can be adapted to achieve the cancellation of echo signals of far-end audio data from the real-time audio data.

The adaptive filtering algorithm can adopt various known algorithms in the prior art, and only needs to control the corresponding reference signal to be constrained into the voice signal of the far-end audio data:

incidentally, in 2007, Valin, Jean Marc proposed a "adjustment of learning rate in dual voice-domain echo cancellation" scheme for the optimization of dual-talk:

Valin,Jean Marc."On Adjusting the Learning Rate in Frequency Domain Echo Cancellation With Double-Talk."IEEE Transactions on Audio Speech&Language Processing 15.3(2007):1030-1034。

in the above solution, Valin uses a dynamically calculated learning factor to handle double talk. The method is specifically realized in such a way that when double talk occurs, the correlation between the estimated echo signal and the error signal after adaptive filtering is small, so that the adaptive learning factor is small and the adaptive filtering coefficient is updated slowly. By adopting the method, filter coefficient divergence under the condition of double talk is avoided, and echo can be better eliminated under the condition that the self-adaptive filter coefficient is well converged, but the filter coefficient is not ideal under the condition that the self-adaptive filter coefficient is well converged, for example, in a live broadcasting scene mentioned in the application, under the condition that local play accompaniment generates a voice signal corresponding to a local play sound source and main play singing generates a voice signal of a speaker, double talk can be always realized, so that the filter coefficient is slowly updated under the condition that the convergence is not good, and the conditions of word dropping, blockage, echo leakage and the like can be caused.

After the technical scheme of the application is applied and the far-end audio data is used as the reference signal, under the same scene, a filter system is rapidly converged, the double-talk condition is greatly improved, and the situations of word dropping, jamming, echo leakage and the like do not occur any more.

As for the far-end audio data, in some embodiments, when there are a plurality of far-end audio data corresponding to a plurality of online users, all the far-end audio data may be processed as the same far-end audio data.

After the echo cancellation processing in this step, intermediate audio data is obtained, and an echo signal generated after the local audio data is played still remains in the intermediate audio data, which is what the technical solution of the present application means, it is to be retained in the audio mixing link of the acquisition thread and to be superimposed with the local audio data after the time delay processing, so as to implement the sound quality protection of the voice signal of the local audio data.

Step 1300, mixing the local audio data with the intermediate audio data after superimposing the local loopback delay value, and obtaining mixed audio data:

referring to fig. 1, in this step, the loopback delay value of the local terminal device obtained in advance may be used to perform delay compensation on the local audio data, so that the local audio data is mixed with the intermediate audio data in the time domain according to the loopback delay value, and thus the voice signal encapsulated by the local audio data and the echo signal encapsulated by the intermediate audio data are aligned in synchronization in the time domain, thereby avoiding the echo signal of the local audio data from causing undesirable interference to the local audio data as much as possible, and achieving protection of the tone quality of the voice signal of the local audio data. After sound mixing, sound mixing audio data can be obtained, wherein a voice signal corresponding to the local audio data, a voice signal of a speaker (if any) and an echo signal generated after the local audio data is played are packaged, but the three are aligned in a time domain.

The loopback delay value can be inherent data calculated for the local terminal equipment in advance and can be directly called, or can be the loopback delay value calculated by the application. For the latter, corresponding embodiments will be given in the subsequent examples of the present application to further enrich the inventive idea of the present application.

Step S1400, pushing the live stream containing the audio mixing audio data to the live broadcasting room:

after the audio mixing audio data is generated, the audio mixing audio data can be used as an audio stream in the live stream according to the processing logic of a live broadcast room at a terminal, and the audio mixing audio data is pushed to the online users in the online state and audience users in the live broadcast room. After receiving the live stream, the corresponding user extracts the audio stream from the live stream to play, so that the user can perceive that each sound source is synchronous in time domain.

In a further exemplary application scenario, in a live webcast room, a terminal device where a main broadcast user is located plays a song and includes background music accompanying the song, when the background music is played, corresponding lyrics are also displayed in a video stream of the terminal device, the main broadcast user in the live webcast room holds a microphone to concentrate on the lyrics to make a song, and audio data corresponding to the background music, which is local audio data, is not only locally played, but also transmitted to a server supporting the operation of the live webcast room, and pushed to a connecting user and an audience user in the live webcast room to listen to the audio data. In the process, because the connection user speaks occasionally, corresponding far-end audio data is generated and is mixed with the local audio data into external audio data for playing.

The live stream pushed to the live broadcast room comprises audio mixing audio data generated by a collection thread of the local terminal equipment, and the audio mixing audio data is used as an audio stream in the live stream for transmission. Because the echo signal corresponding to the far-end audio data is removed from the audio mixing audio data, and the echo signal caused by the local audio data is utilized to protect the voice signal in the local audio data after time delay compensation, the audio stream of the live stream overcomes the problem of double talk, when the audio stream is played by a far-end user, the voice signals corresponding to all paths of sound sources are synchronized, under the condition of singing the lyrics, the voice of the main broadcasting user when singing and the caption lyrics in the video stream can also be synchronized, and even if other connected users speak to generate the far-end audio data, the situations of word dropping, blocking, echo leakage and the like generated after the audio stream pushed by the local terminal equipment is played can not be caused.

According to an optional implementation example that is alternatively implemented in the exemplary embodiment, an echo cancellation function switch set by a main broadcasting user may be provided in an application program of a terminal device, and the user may determine whether to start "talk over two" optimization by applying the technical scheme of the present application by setting the echo cancellation function switch, so that when the technical scheme of the present application is started, a state of the echo cancellation function switch of a user logged in the local broadcasting room is detected, and when the state is an activated state, subsequent steps of the present application are executed, otherwise, the subsequent steps are not executed.

Through the principle disclosure of the exemplary embodiment, it can be understood that the technical solution of the present application can achieve the positive effects far superior to the prior art, including but not limited to the following aspects:

Referring to fig. 5, in a further embodiment, the step S1100 of obtaining the far-end audio data in the live broadcast connection state, mixing the far-end audio data with the local audio data, and playing the mixture includes the following steps:

step S1110, acquiring a remote live stream pushed by the server in the live broadcasting room connection state:

in an application scenario of live network video, a main broadcast user and other users start connection to enter a connection state, and both the main broadcast user and the other users can acquire live streams of the other users and transmit the live streams generated by both the main broadcast user and the other users to the other users under the support of a server supporting the operation of a live broadcast room.

Step S1120, extracting the far-end audio data from the far-end live stream:

for the local terminal device, the live stream needs to be parsed and output, so the remote live stream can be parsed in advance to obtain the video stream and the audio stream therein, and output correspondingly.

The audio stream encapsulates far-end audio data generated by a far-end user, and the far-end audio data can also be generated by applying the technical scheme of the application in the terminal equipment of the far-end user.

Step S1130, mixing the remote audio data with the local audio data to obtain external audio data:

as can be understood further by combining with the workflow of the playing thread in fig. 1, the remote audio data is mixed by the local audio data corresponding to the background music being played by the local terminal device, so as to obtain corresponding external audio data.

Step S1140, converting the audio signal into the audio signal according to the playback audio data, and playing:

and finally, converting the externally played audio data into a voice signal according to the traditional audio playing technology, and then playing the voice signal through a loudspeaker.

According to the embodiment, various beneficial effects obtained by the method are further shown through the revealing of the network video live broadcast scene, and it can be understood that in the network video live broadcast scene, the probability of double talk caused by all communication parties is higher, and after the technical scheme of the method is applied, adverse factors caused by double talk are eliminated, so that the voice quality of communication of all the communication parties is further improved, the communication quality of multi-party communication is effectively ensured, and the user experience is improved.

Referring to fig. 6, in a further embodiment, the step S1200 of performing echo cancellation on the locally acquired real-time audio data by using the far-end audio data as a reference signal to obtain intermediate audio data includes the following steps:

step 1210, continuously collecting real-time input voice signals from a local sound card to obtain real-time audio data:

the collection of the voice signal of the external input sound source is generally realized by a sound card of the local terminal device, and the voice signal is continuously collected by the sound card, and is subjected to digital-to-analog conversion to form a corresponding voice frame, and the corresponding voice frame is assembled into corresponding audio data, which will be known to those skilled in the art.

Step S1220, applying a preset adaptive echo filtering algorithm, with the far-end audio data as a reference signal, and performing echo cancellation processing on the real-time audio data to cancel an echo signal corresponding to the far-end audio data:

referring to the introduction of the exemplary embodiment of the present application, the echo cancellation processing is implemented on the real-time audio data by applying a preset adaptive echo filtering algorithm, such as an algorithm implemented by AEC or other neural network models, in which the far-end audio data in the far-end live stream is referred to as a reference signal for echo cancellation, and only the echo signal generated after the far-end audio data is locally played is cancelled. When multiple paths of far-end live streams exist, all the far-end audio signals of all the far-end live streams can be regarded as the same far-end audio signal to perform echo processing.

When the programming is realized, the uniform algorithm is adopted for realization. Due to the adoption of the self-adaptive mechanism, when the far-end audio data does not exist, the reference signal in the algorithm is 0, so that the problem of eliminating the echo of the far-end audio data does not exist, and the self-adaptive advantage is further embodied.

Step S1230, retaining an echo signal corresponding to the local audio data in the real-time audio data as the intermediate audio data:

after echo cancellation processing, in the obtained intermediate audio data, echo signals corresponding to the local audio data in the real-time audio data are still retained in the intermediate audio data, so that the hidden danger of double talk is eliminated, and the echo signals can still be used for protecting the tone quality of the local audio data.

In this embodiment, the echo cancellation process is introduced deeply, so that the flexibility of applying the adaptive filtering algorithm is highlighted, and whether the current terminal device is in a connected state or not, under the action of the adaptive mechanism, it can be ensured that the audio stream pushed by the local terminal device can obtain better tone quality.

Referring to fig. 7, in an embodiment, in step S1300, the step of mixing the local audio data with the intermediate audio data after superimposing the local loopback delay value to obtain mixed audio data includes the following steps:

step S1310, obtaining a loopback delay value corresponding to the local device:

the loopback delay value of a terminal device is usually determined by the hardware of the terminal device, and the loopback delay caused by the hardware accounts for a large proportion, so that the loopback delay value of one terminal device can be calculated in advance and can be directly called in the step.

Step S1320, controlling the local audio data to mix with the intermediate audio data according to the loopback delay value lag, to obtain mixed audio data:

referring to fig. 1, the local audio data is subjected to the hysteresis processing according to the loopback delay value in the time domain, and then is mixed with the intermediate audio data obtained after the echo cancellation processing, so that the mixed audio data can be obtained.

The embodiment further discloses the superposition process of the loopback delay value, details the corresponding implementation scheme, and is convenient for guiding the specific implementation of the technical personnel in the field.

Referring to fig. 8, in a further embodiment, the step S1310 of obtaining a loopback delay value corresponding to a local device includes the following steps:

step S2100, presetting a loopback identification signal to the outgoing audio data at a first time, where the loopback identification signal is a high-frequency signal outside a hearing band of a human ear, and includes a plurality of single-frequency signals, and each single-frequency signal is set at equal intervals in a frequency domain:

in order to detect a delay value generated by loopback after local audio data is externally played by terminal equipment, namely the loopback delay value, in the service logic of a playing thread, a loopback identification signal is preset for the local audio data, and then the time length of an echo signal formed by playing the loopback identification signal is detected in real-time audio data through a collecting thread, so that the time length can be determined as the loopback delay value and used as collecting side correction audio data to realize the alignment of audio data of multiple sound sources.

Because the loopback delay value mainly depends on the hardware performance of the terminal equipment, and the loopback delay value of the same terminal equipment is relatively fixed, after the loopback delay value is obtained, the loopback delay value can be stored locally in the terminal equipment and can be directly called subsequently.

The loopback identification signal is self-defined by the application and is constructed into a signal with certain regularity and uniqueness, so that the loopback identification signal is convenient to distinguish from the signal content corresponding to the local audio data. In the present application, the loop-back identification signal is defined as an out-of-band high-frequency signal of a frequency band corresponding to the hearing range of the human ear. The frequency that human ear can receive is between 20Hz-20000Hz, and individual difference from person to person, and out-of-band signal is not perceptible to human ear. Therefore, in the application, the high-frequency signal outside the hearing frequency band of the human ear is selected as the loopback identification signal, the requirement of tone quality control is fully considered, the voice information content of each path of sound source cannot be damaged, and the trouble cannot be caused to the user.

In a preferred embodiment of the present application, the loopback identification signal is configured to include a plurality of single frequency signals, and each single frequency signal is equally spaced in the frequency domain, for example, two adjacent single frequency signals may be spaced by one, two, or three sampling resolution units. By adopting the mode, the detection is relatively easy, the corresponding detection algorithm has high operation efficiency and occupies little time.

In an embodiment of alternative implementation, each single-frequency signal of the loopback identification signal may also be set at equal ratio intervals, and the loopback identification signal may also have other forms of structures similarly, as long as it is detected by applying a corresponding algorithm in the service logic of the acquisition side subsequently.

A more specific and alternative variant embodiment suitable for use in the present application for constructing the loop-back identification signal is given below:

firstly, constructing the loop-back identification signal:

in this embodiment, the loopback identification signal is a plurality of single-frequency signals at a high frequency, for example, three single-frequency signals, and the frequency point of each single-frequency signal is outside the hearing frequency band perceivable to human ears, as follows:

in the formula f₀，f₁，f₂Identifying the corresponding frequencies of the three single-frequency signals of the signal for loopback, wherein:

f₁＝f₀+2Δf

f₂＝f₀+4Δf

Δ f is the frequency resolution, i.e. the current audio sampling rate T_sIn the case where the number of DFT (discrete Fourier transform) points is N

α₀To loop back the amplitude of the identification information, take 16bit quantized audio as an example, alpha₀The reference value is 8192. The single frequency signal duration is typically 200 ms.

In a preferred embodiment, considering the existence of the audio loop-back path, in order to better detect the added single-frequency signal during the detection at the acquisition side, a mute signal ref _ silence (t) may be added for a period of time, for example, 2 seconds, before the loop-back identification signal is added at the playback side, so as to delay the start of the detection at the acquisition side. To this end, in the service logic of the playing side, the loopback identification signal can be expressed as:

assuming that the audio data of the local playing sound source to be played by the terminal equipment is render (t), render (t) and loop back identification signal ref₀(t) performing mixed sound playing, and finishing the operation of adding the loopback identification signal during real-time playing, wherein the time at this moment is marked as a first time correspondingly.

Step S2200, detecting whether the loop back identification signal exists in the real-time audio data, and determining a second time when the loop back identification signal is detected:

because the loopback identification signal that this application constructed more has characteristics, is changeed and is detected to can high-efficiently and accurately confirm the loopback delay value, consequently, the audio mixing audio data that the collection thread obtained is converted into speech signal and is broadcast the back, in the sense of hearing, can obviously perceive synchronism and harmony between the different sound sources, tone quality is better.

In this embodiment, in order to detect the loopback identification signal, the local audio data is played in an external form, so that the loopback identification signal is obtained from the echo signal of the local audio data in the real-time audio data after the real-time audio data is collected by the collection thread. Of course, as mentioned above, once the loopback delay value is determined and stored, the subsequent call may be directly made without additional calculation.

In an extended embodiment, in order to improve the success rate of detecting the loopback identification signal, the implementation environment of the present application may be detected in advance, and specifically, the following steps may be performed:

1) detecting whether the terminal equipment is in an audio data acquisition state and a play-out state at the same time, wherein the play-out state means that the local terminal equipment starts a loudspeaker such as a sound box and the like which can conveniently acquire echo signals, and the loudspeaker is not used for carrying out speaker playing through equipment such as an earphone and the like;

2) detecting whether the sampling activity implemented by the terminal equipment is performed at a preset sampling rate, for example, a sampling rate of 44.1kHz or 48kHz is preset;

3) and detecting whether the terminal equipment has an audio loopback path corresponding to the local audio data, and if the audio loopback path exists, presetting a loopback identification signal and detecting the loopback identification signal according to the preset loopback identification signal.

Through the above prepositive detection, when confirming that corresponding conditions are all satisfied, namely, the terminal equipment is in a collecting state and an externally-playing state at the same time, and samples are carried out at a preset sampling rate, and an audio loopback path is formed just through externally-playing, under the condition, the process of detecting the loopback identification signal can be implemented.

And after the loopback identification signal is played by the external thread along with the local audio data, the acquisition thread starts to detect the loopback identification signal in real time. It can be understood that, for the detection of the loopback identification signal, corresponding processing needs to be performed by designing a corresponding algorithm corresponding to the characteristics of the loopback identification signal, although the specific implementation of the corresponding algorithms may also be various, so as to accurately identify the characteristics of the loopback identification signal and determine the fact that the loopback identification signal exists.

After the loop-back identification signal has been detected by means of these corresponding algorithms, the time at which it was detected can be determined and marked as the second moment.

Step S2300, determining the loopback delay value according to the difference between the first time and the second time:

as can be seen from fig. 9, the speaker voice of the anchor is emitted at time T2 and is collected at time T3, and since the first time (T1) of the preset loopback recognition signal is the time (T1) when the playback side adds the loopback recognition signal to the local audio data to be output, and the second time (T3) is the time when the collection side detects the loopback recognition signal, the interval between the time (T3) when the real-time audio data is collected and the time (T4) when the collection thread performs audio mixing is very small, it can be ignored. Therefore, the difference between the second time T3 and the first time T1 is the actual time when the loop-back identification signal has finished the audio loop-back path, and therefore, the difference between the second time and the first time may be determined as the loop-back delay value.

Step S2400, storing the loopback delay value for subsequent direct call:

and after the loopback delay value is obtained, storing the loopback delay value into local terminal equipment for subsequent calling.

In this embodiment, a configuration manner of the loopback identification signal and various alternative embodiments thereof are exemplarily and deeply disclosed, and the loopback identification signal is defined by constructing a plurality of single-frequency signals arranged at equal intervals in a frequency domain, so that the loopback identification signal presents a more personalized feature and is easier to be identified. The loopback identification signal is out of the band which can be sensed by human ears, so that the loopback identification signal does not form interference which can be sensed by human ears for the sound of the sound source. Moreover, the loopback identification signal also fully considers the requirement of audio loopback, and presets a mute signal before the loopback identification signal, so that the occurrence of a single-frequency signal lags behind the mute signal, the acquisition side can have sufficient time to wait for the occurrence of the loopback identification signal, the success rate of detecting the loopback identification signal is further improved, the loopback delay value is determined efficiently, the acquisition thread is guided to realize aligned sound mixing of the local audio data and the echo signal thereof, and the tone quality protection of the local audio data is ensured.

Referring to fig. 10, in a further embodiment, the step S2200 of detecting whether the loop back identification signal exists in the real-time audio data includes the following steps:

step S2210, tracking a noise signal of the real-time audio data along a time domain, transforming the noise signal to a frequency domain, and obtaining corresponding noise energy spectrum data:

as mentioned above, the loop-back identification signal is configured as a high-frequency signal and is easily interfered by the high-frequency signal, so in order to prevent the adverse effect of high-frequency noise on the detection of the loop-back identification signal, it is necessary to realize the tracking of the noise signal in the real-time audio data on the acquisition side. The tracking of the noise signal can be realized by means of various common algorithms by those skilled in the art, and in the present application, it is recommended to use an MCRA series algorithm, especially an IMCRA algorithm, to perform frequency point noise tracking, where the IMCRA algorithm is an algorithm for tracking a minimum value of a frequency point in a time domain, and is known by those skilled in the art.

In order to facilitate the energy of the noise signal to be referred to for reference, the noise signal in the real-time audio data needs to be transformed into the frequency domain, and corresponding noise energy spectrum data is obtained. The common algorithm for calculating the frequency domain energy uses an FFT (fourier transform) algorithm, which calculates the energy values of the entire frequency domain frequency points, so that the embodiment can convert the noise energy spectrum data corresponding to the noise signal by applying the FFT algorithm.

In the embodiment optimized on the basis of this embodiment, considering that the loopback identification signal is flexibly customized in the foregoing embodiment and only includes a plurality of single-frequency signals, only the energy values corresponding to the plurality of frequency points added on the playing side need to be calculated, for example, the energy values corresponding to the 3 frequency points disclosed in the foregoing embodiment, in this case, the FFT used occupies a large algorithm complexity, and the FFT complexity is 0 (NIogN). Therefore, in an optimized embodiment, it is recommended to use Goertzel algorithm (gratzer algorithm) to calculate the frequency domain energy, the algorithm complexity is 0(N), the algorithm complexity is greatly reduced, and the estimated frequency point noise is represented as: lambda (f)_i)，i＝0，1，2。

It will be appreciated from the disclosure herein that one skilled in the art may transform the noise signal from the time domain to the frequency domain in a variety of ways to obtain corresponding noise energy spectral data.

Step S2220, according to the voice energy spectrum data mapped by the voice frame of the real-time audio data, positioning the peak position of each frequency point:

accordingly, each speech frame in the real-time audio data is also transformed from the time domain to the frequency domain using a fourier transform algorithm to obtain speech energy spectrum data for determining its corresponding energy value. On the basis, the loopback recognition signal can be detected according to the noise energy spectrum data and the voice energy spectrum data.

The loop-back identification signal adapted to the foregoing exemplary configuration when ref _ s is added on the playback side₀And (t) when the signal is received, the acquisition side starts to detect. Aiming at the voice energy spectrum data mapped by the voice frame in the real-time audio data, the acquisition side firstly judges whether the current frequency point is a peak value, and the peak value is expressed as P_peak(i)：

Wherein E (f)_i) The energy value of the current frequency point is calculated by the Goertzel algorithm, and the energy of the current detection frequency should be the peak value of the upper frequency point and the lower frequency point because the added loopback identification signal is separated by delta f.

Step S2230, calculating the existence probability of the loopback identification signal in each voice frame according to the voice energy and the noise energy corresponding to each frequency point:

calculating the existence probability P of the loopback recognition signal according to the energy of each frequency point and the energy of the background noise_f(i)The following formula is used:

wherein E_maxIndicating that the current frame has the energy threshold of the loopback identification signal, namely when the energy value of the loopback identification signal exceeds the preset energy threshold, judging that the corresponding voice frame highly qualitatively has the loopback identification signalNumber, the energy threshold being generally taken as E_max(f_i)＝1.8*E(f_i) And the adjustment can be properly and flexibly carried out according to the actual situation. logE (f)_i)-logλ(f_i) Which can be understood as the signal-to-noise ratio of the current frequency point.

One frame signal satisfies P_peak(i)When the value is 1, the loop-back identification signal is possible, so that by combining the above two features, the existence probability of the loop-back identification signal of the current frame can be obtained:

wherein W is the weight, the requirement

Generally take W_f(i)1/3, i is 0, 1, 2. In particular, if some devices collect weak signals at high frequencies, the weights at the frequencies can be adjusted appropriately.

Step S2240, when the existence probability of a plurality of continuous voice frames meets a preset condition, judging to detect the loopback identification signal:

in an exemplary embodiment, the playback side adds the loop back identification signal to maintain 200ms in the time domain, so a time domain factor is introduced during detection at the acquisition side, and the inter-frame Probability preserving ipp (inter frame Probability persistence) is defined as follows:

in the formula, T is the number of frames for which the inter-frame probability is maintained, and when T is 3, IPP is 1T 6, and IPP is 0.8, it is determined that the loop-back identification signal is detected. At this time, the time when the loop back identification signal is detected for the first time, i.e., the second time, and the first addition of ref _ s are used₀The time of (t), i.e. the first time, the time interval between the two times is the audio loopback delay.

Generally speaking, when the loop back identification signal is detected, the playback side does not add the loop back identification signal any more, and under the ideal actual measurement condition, the playback side needs to add the 3-frame loop back identification signal. The signal is detected by the exemplary algorithm described above, with the measured detection error being within 20 ms.

The embodiment shows that, in the process of starting the detection of the loopback identification signal, the detection of the loopback identification signal is performed according to noise energy and voice energy, wherein a plurality of single-frequency signals are detected by means of probability estimation, and the detection of the loopback identification signal through a plurality of voice frames is ensured in consideration of interframe retention.

In a further extended embodiment, when the detection of the loopback identification signal reaches a preset number of times and all the detection is terminated in failure, the following steps may be performed: and in response to the event of failure of detecting the loopback identification signal, reconstructing the loopback identification signal and starting secondary detection, wherein the reconstructed loopback identification signal is a high-frequency signal outside the hearing frequency band of the human ear, and the frequency of the reconstructed loopback identification signal is lower than that of the previously constructed loopback identification signal.

Therefore, the frequency of the reconstructed loopback identification signal is controlled below the frequency point of the previous loopback identification signal, mainly considering the hardware reason: the energy of the high-frequency signal collected by the current terminal equipment is weak. For this reason, it is necessary to adjust the frequency, and even the amplitude, of the looped-back identification signal appropriately in order to be detected by the service logic on the acquisition side.

As an example, the reconstructed loop back identification signal may be expressed as follows:

wherein:

in general, α₁＞α₀，f_1i＜f_i. In practice, generally, α is taken₁＝10922，f_1iThe frequency of (2) is around 19 KHz.

After reconstructing the new loopback identification signal, the detection procedure may be re-executed according to the procedure disclosed in the foregoing related embodiments, so as to successfully detect the new loopback identification signal by changing the loopback identification signal, thereby successfully determining the loopback delay value.

In order to more fully reveal the advantages of the technical solution of the present application, a related description of the beneficial effects of the technical solution of the present application is continued:

firstly, experimental data obtained by using the loopback delay value calculation scheme provided by the application:

in the practical application scene, the audio loopback delay is calculated in real time, additional operation of a user is not needed, the problem of alignment of voice and accompaniment in a remote karaoke 0K and live singing scene is solved, and the user experience in the network live broadcasting process is improved. After the technical scheme of the application calculates the audio loopback delay value in real time, the alignment of the voice and the accompaniment can be realized more strictly, and the experience of other products of the same kind is advanced. The following table is a comparison between the 5-time audio loopback delay value calculated by implementing the technical scheme of the present application on 4 mobile terminal devices and an artificial measurement result:

in the table, Android models 0P × R × N × 3 and V × xpI × 6 were all subjected to collection playback tests with JAVA APIs under media sound. As can be seen from the table, the detection error of the present application is controlled within 20 ms.

The advantages obtained by applying the echo cancellation technology of the application are explained as follows:

in the relevant application scenario provided by the application, the final signal to be streamed by the anchor user is the sound of the anchor user and the locally played music after time delay compensation, so the processing of the anchor sound and the music sound is decoupled and respectively processed to ensure the best quality of the two sounds, and the implementation is as follows:

1) protecting the anchor sound:

if the anchor sound is double-talk with the playing music, the sound quality is poor. In order to eliminate the frequent occurrence of the double-talk scene, the reference signal in the echo cancellation in the application does not contain locally played music sound, namely the local audio data, at the moment, the double talk can occur only when the connected user and the anchor user talk simultaneously, and the convergence of the adaptive filter is ensured. The acquisition thread processing optimization process is shown in fig. 11.

Referring to fig. 11, in a sub-scenario, when a main broadcasting user pushes a live stream, and plays a background music in a local terminal device at the same time, and the local terminal device is not connected to any user, the playing signal is only a music signal, and for such a scenario, the echo cancellation reference signal in this application is 0, which is equivalent to that echo cancellation does not work, and the main broadcasting sound is not lost; when a main broadcasting user pushes a live stream, plays background music and connects with other users, the playing signal contains a music signal and the sound of the connected users, namely far-end audio data.

2) Protecting the music signal:

after the scheme is adopted, the voice signal in the intermediate audio data after echo cancellation comprises the main sound and the music echo, the music echo has distortion compared with the original signal due to the reflection of the space, and the distortion can be compensated by mixing the music echo and the played music signal (the local audio data). The process flow is shown in fig. 12.

With reference to fig. 12, since the terminal device plays music and collects the music, there is an audio loopback delay, if the delay compensation error is greater than 50ms, the audience will hear 2 channels of music after mixing, which seriously affects the listening feeling. In the non-wired state, the audio mixing is the mixing of the main sound and the music sound, so that the error requirement of time delay compensation is not sensitive. In the on-line state, the delay compensation requires an error of less than 50 ms. Referring to fig. 13, after applying the technical solution of the present application for time delay compensation, the difference between the compensated audio-mixed music signal and the original music signal is smaller, so that the music quality of the plug-flow can be ensured.

Third, description of experimental data compared to the prior art:

when the anchor side issues live broadcast, plays music, and is not connected to other users, the comparison effect before and after anchor sound processing is as shown in fig. 14. Fig. 14 shows, from top to bottom, a main broadcast signal, a music signal, a signal after echo cancellation before the improvement according to the present embodiment is not applied, and a signal after echo cancellation after the improvement according to the present embodiment is applied. The comparison shows that the echo of the anchor signal before the improvement is eliminated, the obvious word dropping occurs, the improved voice signal is consistent with the original anchor signal, and no voice loss is ensured in the processing process.

When the anchor side issues live broadcast, plays music, and connects with other users, the comparison effect before and after anchor sound processing is as shown in fig. 15. Fig. 15 shows, from top to bottom, a main broadcast signal, a viewer signal, a music signal, a signal after echo cancellation before the improvement according to the present embodiment is not performed, and a signal after echo cancellation after the improvement according to the present embodiment is performed. At this time, the improved pre-echo cancellation reference signal is the superposition of the music signal and the audience signal echo, and as can be seen from the figure, the improved pre-echo cancellation signal appears with a word drop at 6s and loses audio; a significant leaky echo occurs at 8s and the improved echo cancellation signal quality is significantly better than before the improvement.

Referring to fig. 16, a live streaming audio processing apparatus provided in the present application, adapted to perform functional deployment by the live streaming audio processing method of the present application, includes: the system comprises a pull stream playing module 1100, an echo cancellation module 1200, a loopback correction module 1300, and a live broadcast push stream module 1400, wherein the pull stream playing module 1100 is used for acquiring far-end audio data in a live broadcast room connection state, mixing the far-end audio data with local audio data to play the audio data after playing the audio data; the echo cancellation module 1200 is configured to perform echo cancellation on the real-time audio data collected by the local device by using the far-end audio data as a reference signal to obtain intermediate audio data, where the echo signal of the far-end audio data is cancelled and the echo signal of the local audio data is retained; the loopback correction module 1300 is configured to superimpose the local audio data with a local loopback delay value and mix the superimposed local audio data with the intermediate audio data to obtain mixed audio data; the live streaming module 1400 is configured to push a live stream including the audio mixing audio data to the live broadcast room.

In a further embodiment, the pull stream playing module 1100 includes: the connection pull sub-module is used for acquiring a far-end live broadcast stream pushed by the server in a connection state of the live broadcast room; the audio extraction submodule is used for extracting the far-end audio data from the far-end live stream; the multi-source audio mixing sub-module is used for mixing the far-end audio data with the local audio data to obtain external audio data; and the play output submodule is used for converting the voice signals into the external audio data to play the external audio data.

In a further embodiment, the echo cancellation module 1200 comprises: the real-time acquisition submodule is used for continuously acquiring real-time input voice signals from the local sound card to obtain real-time audio data; the echo filtering submodule is used for applying a preset self-adaptive echo filtering algorithm, taking the far-end audio data as a reference signal and carrying out echo cancellation processing on the real-time audio data so as to cancel echo signals corresponding to the far-end audio data; and the middle acquisition submodule is used for reserving an echo signal corresponding to the local audio data in the real-time audio data as the middle audio data.

In an embodied embodiment, the loopback correction module 1300 comprises: the delay calculation submodule is used for acquiring a loopback delay value corresponding to the local equipment; and the time delay compensation submodule is used for controlling the local audio data to be mixed with the intermediate audio data according to the loopback delay value lag so as to obtain mixed audio data.

In a further embodiment, the delay calculation sub-module comprises: the loopback presetting unit is used for presetting loopback identification signals to the outgoing audio data at a first moment, wherein the loopback identification signals are high-frequency signals outside a human ear hearing frequency band and comprise a plurality of single-frequency signals, and the single-frequency signals are arranged at equal intervals in a frequency domain; the loopback detection unit is used for detecting whether the loopback identification signal exists in the real-time audio data or not and determining a second moment when the loopback identification signal is detected; the loopback calculation unit is used for determining the loopback delay value according to the difference value between the first time and the second time; and the loopback storage unit is used for storing the loopback delay value for subsequent direct calling.

In a deepened embodiment, the loopback detection unit includes: the noise tracking subunit is used for tracking a noise signal of the real-time audio data along a time domain, transforming the noise signal to a frequency domain and obtaining corresponding noise energy spectrum data; the peak positioning subunit is used for positioning the peak position of each frequency point according to the voice energy spectrum data mapped by the voice frame of the real-time audio data; the probability estimation subunit is used for calculating the existence probability of the loopback identification signal in each voice frame according to the voice energy and the noise energy corresponding to each frequency point; and the signal detection subunit is used for judging to detect the loopback identification signal when the existence probability of a plurality of continuous voice frames meets a preset condition.

In an extended embodiment, the operations of the echo cancellation module 1200 that precede them include: and the state detection module is used for detecting an echo cancellation function switch of a user logged in the live broadcast room of the local computer, driving the echo cancellation module 1200 to work when the echo cancellation function switch is in an activated state, and otherwise, terminating the work of the echo cancellation module 1200.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. As shown in fig. 17, the internal structure of the computer device is schematically illustrated. The computer device includes a processor, a computer-readable storage medium, a memory, and a network interface connected by a system bus. The computer readable storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions, when executed by the processor, can make the processor implement a live network audio processing method. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may store computer readable instructions, and when executed by the processor, the computer readable instructions may cause the processor to execute the live network audio processing method of the present application. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 17 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In this embodiment, the processor is configured to execute specific functions of each module and its sub-module in fig. 16, and the memory stores program codes and various data required for executing the modules or sub-modules. The network interface is used for data transmission to and from a user terminal or a server. The memory in this embodiment stores program codes and data required for executing all modules/sub-modules in the webcast audio processing device of the present application, and the server can call the program codes and data of the server to execute the functions of all sub-modules.

The present application also provides a storage medium storing computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the live audio processing method of any of the embodiments of the present application.

The present application also provides a computer program product comprising computer programs/instructions which, when executed by one or more processors, implement the steps of the live audio processing method according to any of the embodiments of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments of the present application can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when the computer program is executed, the processes of the embodiments of the methods can be included. The storage medium may be a computer-readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

In summary, the present application decouples the played music from the audience sound in the network connection state, and the reference signal in the echo cancellation only selects the audience sound, thereby avoiding the situation of double talk frequently, and for the music echo, the local music delay compensation is performed to perform the audio mixing to protect the music quality, thereby improving the voice playing effect in the network connection state.

Those of skill in the art will appreciate that the various operations, methods, steps in the processes, acts, or solutions discussed in this application can be interchanged, modified, combined, or eliminated. Further, other steps, measures, or schemes in various operations, methods, or flows that have been discussed in this application can be alternated, altered, rearranged, broken down, combined, or deleted. Further, steps, measures, schemes in the prior art having various operations, methods, procedures disclosed in the present application may also be alternated, modified, rearranged, decomposed, combined, or deleted.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A live network audio processing method is characterized by comprising the following steps:

2. The method for processing the live webcast audio according to claim 1, wherein the method for acquiring the remote audio data in the connection state of the live webcast room, mixing the remote audio data with the local audio data and playing the audio data comprises the following steps:

extracting remote audio data from the remote live stream;

and converting the voice signal according to the external audio data to play.

3. The method for processing live network audio according to claim 1, wherein performing echo cancellation on the locally acquired real-time audio data by using the far-end audio data as a reference signal to obtain intermediate audio data, comprises the following steps:

4. The method for processing live network audio according to claim 1, wherein the local audio data is mixed with the intermediate audio data after being superimposed with a local loopback delay value to obtain mixed audio data, and the method comprises the following steps:

obtaining a loopback delay value corresponding to local equipment;

5. The live network audio processing method according to claim 4, wherein obtaining the loopback delay value corresponding to the local device comprises the following steps:

storing the loopback delay value for subsequent direct invocation.

6. The live network audio processing method according to claim 5, wherein detecting whether the loopback identification signal exists in the real-time audio data comprises the following steps:

7. The live network audio processing method according to any one of claims 1 to 6, wherein before the step of performing echo cancellation on the locally acquired real-time audio data by using the far-end audio data as a reference signal to cancel an echo signal corresponding to the far-end audio data, and obtaining intermediate audio data, the method comprises the following steps:

8. A computer device comprising a central processor and a memory, characterized in that the central processor is adapted to invoke execution of a computer program stored in the memory to perform the steps of the method according to any one of claims 1 to 7.

9. A computer-readable storage medium, characterized in that it stores, in the form of computer-readable instructions, a computer program implemented according to the method of any one of claims 1 to 7, which, when invoked by a computer, performs the steps comprised by the corresponding method.

10. A computer program product comprising computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the steps of the method as claimed in any one of claims 1 to 7.